It's an interesting problem to try to determine what are the limits of a language and what is a word and what is not. Corpus studies are not sufficient for that purpose as you will always end up with a large number of hapaxes. Because language is based on social consensus, the most common sense approach to the problem would be to determine 'wordiness' of a string by checking how many people consider it to be a word.
We are trying to do something like this with large-scale studies for English and Dutch. As it is very related to the problem I will allow myself to share the links:
http://vocabulary.ugent.behttp://woordentest.ugent.be
We are trying to do something like this with large-scale studies for English and Dutch. As it is very related to the problem I will allow myself to share the links: http://vocabulary.ugent.be http://woordentest.ugent.be