I wonder if you could do machine learning on schemata. Basically start learning ...

vidarh · on April 5, 2013

The problem is that many possible format conflicts in ways that make resolution impossible without cross-referencing with other sources.

Which date is "10/3/5"? Is it March 5th 2010? March 5th 1910? March 10th 1905? March 10th 2005? October 3rd 1905? October 3rd 2005? (or another century entirely, though the 20th and 21st would be most likely). And don't think the "/" vs. "-" as separate is sufficient to tell them apart.

Aand you'll find a lot of other variations - I'm used to writing 10/3-5 for example... But I'm not even consistent, I might write 10/3/5 or 10-3-5, or 5/3/10 / 5-3-10; anywhere I want to be explicit, I would write 2005-03-10 exactly because I'm used to seeing so many ambiguous dates that can't easily be resolved.

What about the value 5.123? Is it a floating point value with "123" after the decimal point, or the integer 5123? The "decimal point" is "," in many countries, and the thousand separator is usually, but not always, "." in countries that use "," as the decimal marker. If you treat things as "just text" you are going to have to potentially deal with dozens of different combinations of decimal points and quantity markers (depending on country, the markers don't all occur only every 3 digits to the left from the decimal marker...)

Interpreting small text fragments is fraught with a near infinite number of obnoxious details like this, and part of the problem is that even few people know most of them and will be unable to quickly resolve ambiguities without cross referencing with other data (or worse: they think they know, or don't even recognize that there is an ambiguity in the first place)

smoyer · on April 4, 2013

Sounds plausible to me ... like how OpenStreetMaps improved (improves?) its map data by letting people import traces from their GPS devices.

http://www.openstreetmap.org/traces

tlarkworthy · on April 4, 2013

See NELL. The never ending language learner.