If the implantation is semantically rich and complete enough, this might really help those who want to tackle the first of pg's "ambitious startup" ideas.
If I have an idea for a search product, competing with Google isn't really the first roadblock my brain puts up. It's more like "sure brain, sounds swell; now, how do you propose to populate this engine of yours?"
I imagine that building the index is computationally, a comparable problem to crawling the websites themselves. Does anyone have any data on if this is actually a large win?
I too would like to know the answer to this. In my experience building indexes and crawlers the crawler is always easier to write.
The reason being initially you just need a lot of pages to work with. Anyone can write a simple
while(links) { get link }
crawler and just let it run for months on end without too many issues. Heck just some xargs and wget will get you by for a long time.
By the time you have your search indexing and spitting out results you are going to need your own dedicated crawler anyway to ensure you are crawling the pages you have identified as being most interesting.
I imagine this data set is not useful for those doing a search engine, but those wanting to calculate statistics on snapshots of the web, such as pages using jquery and the like.
If I have an idea for a search product, competing with Google isn't really the first roadblock my brain puts up. It's more like "sure brain, sounds swell; now, how do you propose to populate this engine of yours?"