Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And you can figure out what people have read using the (soon to be closed by Mozilla) CSS history hack. You can easily index every story on HN by pinging and scraping newest every 20 or minutes. Don't worry they wont block you ;)


The problem when indexing isn't getting the new content; it is getting the ~1,240,000 older posts that aren't on newest anymore.


You can grab that from searchyc.com, or just ignore it. You can get reasonable recommendations based on the latest articles. This would not be a general purpose recommendation engine, so you would actually need much less data. Attempting to analyze 1 million+ articles seems like overkill in this case. You could just scrape for a week. If you are building this I actually have several months worth of articles indexed, and would be happy to provide the db dump of 80,000+ articles.

I'm actually thinking about hacking this up this weekend as well.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: