Common Crawl

breckinloggins · on March 11, 2012

If the implantation is semantically rich and complete enough, this might really help those who want to tackle the first of pg's "ambitious startup" ideas.

If I have an idea for a search product, competing with Google isn't really the first roadblock my brain puts up. It's more like "sure brain, sounds swell; now, how do you propose to populate this engine of yours?"

mkl · on March 11, 2012

Previous discussion: http://news.ycombinator.com/item?id=3209690

pooyak · on March 12, 2012

This blog post: http://matpalm.com/blog/2012/01/01/common_crawl_collocations... mentions that commoncrawl's data is last updated September 2010. Does anyone know if that's still the case?

Arelius · on March 11, 2012

I imagine that building the index is computationally, a comparable problem to crawling the websites themselves. Does anyone have any data on if this is actually a large win?

boyter · on March 12, 2012

I too would like to know the answer to this. In my experience building indexes and crawlers the crawler is always easier to write.

The reason being initially you just need a lot of pages to work with. Anyone can write a simple

while(links) { get link }

crawler and just let it run for months on end without too many issues. Heck just some xargs and wget will get you by for a long time.

By the time you have your search indexing and spitting out results you are going to need your own dedicated crawler anyway to ensure you are crawling the pages you have identified as being most interesting.

I imagine this data set is not useful for those doing a search engine, but those wanting to calculate statistics on snapshots of the web, such as pages using jquery and the like.