Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This article is actually much narrower in scope than the title would seem to suggest. If you don't let the first sentence sink in, it barely even makes sense.

> Here at Factual we apply machine learning techniques to help us build high quality data sets out of the gnarly mass of data that we gather from everywhere we can find it.

So really, this blog post deals with the topic of "principles for applying machine learning techniques to data cleaning".

Clearly this is an appropriate topic for this company, as they're product is essentially API access to pre-cleaned/curated data sets. However, the post itself is lacking depth. The work I do involves a lot of time cleaning my company's internal data sets, so I definitely recognize the pain of the corner/boundary/special case. However, anyone who has worked with data cleaning (read: everyone who has worked with data) would know this pain, they wouldn't need a blog post to point it out.

I would, however, be interested in knowing what sorts of machine learning techniques they're applying to the problem. When I clean data, the process is largely manual, probably in part because I'm not working with as large of data sets. Maybe they don't want to reveal their secret sauce, but I think a more technical blog post could serve to highlight how good their data cleaning is, and therefore how high quality their product is.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: