Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Stemmers tend to cover English and a few Europeanish languages only. For example you can see what snowball covers here [1] while NLTK has some more [2].

[1] https://snowballstem.org/

[2] https://www.nltk.org/api/nltk.stem.html

What you may find useful is unidecode that transliterates Unicode to ASCII similar sounds. There are packages available for most programming languages, with the original being on Perl. I highly recommend reading the original article describing how it works. My practical experience is that is fairly good for text searching, producing reasonable results.

https://interglacial.com/tpj/22/



This!

Just pass both the content and query through unidecode and some simple filter that removes stop words and cruft in the content, works wonders for autocomplete. Applying that idea to true full-text search is somewhat more involved (you have to somehow produce a stemmer that works on the output of unidecode), but doable and it works.


I really only have to worry about one language. I think I've seen this method before, but I'll read up on it. Thanks for the link.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: