And you can figure out what people have read using the (soon to be closed by Moz...

_tggb · on April 3, 2010

The problem when indexing isn't getting the new content; it is getting the ~1,240,000 older posts that aren't on newest anymore.

kwamenum86 · on April 3, 2010

You can grab that from searchyc.com, or just ignore it. You can get reasonable recommendations based on the latest articles. This would not be a general purpose recommendation engine, so you would actually need much less data. Attempting to analyze 1 million+ articles seems like overkill in this case. You could just scrape for a week. If you are building this I actually have several months worth of articles indexed, and would be happy to provide the db dump of 80,000+ articles.

I'm actually thinking about hacking this up this weekend as well.