Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Common Crawl (commoncrawl.org)
125 points by namin on March 11, 2012 | hide | past | favorite | 5 comments


If the implantation is semantically rich and complete enough, this might really help those who want to tackle the first of pg's "ambitious startup" ideas.

If I have an idea for a search product, competing with Google isn't really the first roadblock my brain puts up. It's more like "sure brain, sounds swell; now, how do you propose to populate this engine of yours?"



This blog post: http://matpalm.com/blog/2012/01/01/common_crawl_collocations... mentions that commoncrawl's data is last updated September 2010. Does anyone know if that's still the case?


I imagine that building the index is computationally, a comparable problem to crawling the websites themselves. Does anyone have any data on if this is actually a large win?


I too would like to know the answer to this. In my experience building indexes and crawlers the crawler is always easier to write.

The reason being initially you just need a lot of pages to work with. Anyone can write a simple

while(links) { get link }

crawler and just let it run for months on end without too many issues. Heck just some xargs and wget will get you by for a long time.

By the time you have your search indexing and spitting out results you are going to need your own dedicated crawler anyway to ensure you are crawling the pages you have identified as being most interesting.

I imagine this data set is not useful for those doing a search engine, but those wanting to calculate statistics on snapshots of the web, such as pages using jquery and the like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: