Which means the corpus is broken. But regular people rarely care about correct s...

rspeer · on Jan 16, 2018

You're imagining that people who make NLP corpora actually vet the text going into them? I dream of a world where people can be convinced to care that much. I'm not even talking about the scenario you suggest of filtering for proper word usage, I'm talking about filtering at all.

The corpora used for popular word embeddings are full of weird nonsense text (in the case of word2vec) or autogenerated awfulness like spam and the text of porn sites (in the case of fastText) or both (GloVe). And most people who implement ML don't care how their data is collected.

mort96 · on Jan 16, 2018

I mean, nobody expects the engineers to manually read through everything, but if the quality of the input text is significant for the quality of the autocorrect (or whatever other application you're using machine learning with), you kind of have to make sure the input is pretty good... You could for example choose datasets which is expected to contain mostly correct grammar and spelling (such as Wikipedia, books, etc.) rather than datasets which is expected to contain mostly incorrect grammar and spelling.

Or don't use a machine learning model. I honestly don't care, just don't automatically turn a correct "its" into an incorrect "it's".

pbhjpbhj · on Jan 17, 2018

Wouldn't late edition books with only corrected text be better, proofread, edited, proofread, edited, ... Google have millions of them they've assumed copyright of. Surely there's enough text there. Do they really just use random website text?? Nearly every news story I read has errors and they have style guides, trained writers, editors, etc..

Do publishers sell their published text as a mass for use in AI/ML? Like 1000 books, no images or frontispiece, etc., possibly jumbled by sentence/para/page.

zentiggr · on Jan 16, 2018

That is one of the most draining things I've heard recently. Sigh. Are these open projects where someone could in principle improve them?

rspeer · on Jan 18, 2018

The best open corpus project I know of is OPUS: http://opus.nlpl.eu/

They say they welcome contributions; I don't know if they just mean new sources of text, or if this includes code for filtering or fixing their existing ones.