Sunday 27 May 2012

Finding new words

It should be relatively easy with crawl the world wide web comparing its contents to all established dictionaries (all languages, technical glossaries, and slang dictionaries), and hence finding examples of character strings that don't fit existing word definitions. And for each such string it should be possible to measure it's frequency.

Of course, there will be numerous spelling mistakes (and in fact data on common spelling mistakes is useful for both spelling correction, understanding the way people type, and the way people think), but it should be possible to filter these out based on similarity rules.

What would you want to do this? To answer the question "what is the most common undefined word?", but also to set about adding definitions for those words to relevant dictionaries to increase the extent of structured human knowledge.

No comments: