Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder if you could do machine learning on schemata. Basically start learning about dates (as an example) and as it learns updates the information with what it has learned. Something that has one person putting in { name "foo", born "10/1/92"} and someone else putting in { name "bar", born "september 30th, 1966" } and then going back and replacing the dates with an ISO standard date type but with a change history so you could look backwards in time at the data and see how the database had "improved" it. (or not). Then by voting on the improvements you teach the system to clean up its data representations. Crazy? Insightful? Stupid? I don't know but it was the question that popped into my head.


The problem is that many possible format conflicts in ways that make resolution impossible without cross-referencing with other sources.

Which date is "10/3/5"? Is it March 5th 2010? March 5th 1910? March 10th 1905? March 10th 2005? October 3rd 1905? October 3rd 2005? (or another century entirely, though the 20th and 21st would be most likely). And don't think the "/" vs. "-" as separate is sufficient to tell them apart.

Aand you'll find a lot of other variations - I'm used to writing 10/3-5 for example... But I'm not even consistent, I might write 10/3/5 or 10-3-5, or 5/3/10 / 5-3-10; anywhere I want to be explicit, I would write 2005-03-10 exactly because I'm used to seeing so many ambiguous dates that can't easily be resolved.

What about the value 5.123? Is it a floating point value with "123" after the decimal point, or the integer 5123? The "decimal point" is "," in many countries, and the thousand separator is usually, but not always, "." in countries that use "," as the decimal marker. If you treat things as "just text" you are going to have to potentially deal with dozens of different combinations of decimal points and quantity markers (depending on country, the markers don't all occur only every 3 digits to the left from the decimal marker...)

Interpreting small text fragments is fraught with a near infinite number of obnoxious details like this, and part of the problem is that even few people know most of them and will be unable to quickly resolve ambiguities without cross referencing with other data (or worse: they think they know, or don't even recognize that there is an ambiguity in the first place)


Sounds plausible to me ... like how OpenStreetMaps improved (improves?) its map data by letting people import traces from their GPS devices.

http://www.openstreetmap.org/traces


See NELL. The never ending language learner.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: