I too run a crawler that visits a lot of pages, although not at a particular hig...

psynapse · on Jan 19, 2015

The discovery process is crawling I suppose, but only within the same site. It is always assured that the higher speed process accesses data that we want to parse. It does no navigation.

Aside from having the physical capacity for the suite to run 24/7, our main challenge is speed. All data must be parsed, matched to other data in our database and published with the lowest possible latency.

We have pretty strict validation. Addressing errors in retrospect is preferable to publishing incorrect data.