I too run a crawler that visits a lot of pages, although not at a particular high frequency. We visit hundreds of sites and each site then has a custom bot that essentially has two methods: find_links and extract. The first finds more links to visit on the site (e.g. navigates and follows pagination) whereas the latter finds and stores records. Is this similar to your approach?
Incidentally, at scale I find that the more tricky part is the whole orchestration (Schedule crawls, make sure resources are used most efficiently without overloading the target sites, properly detecting errors) is the hardest part.
The discovery process is crawling I suppose, but only within the same site. It is always assured that the higher speed process accesses data that we want to parse. It does no navigation.
Aside from having the physical capacity for the suite to run 24/7, our main challenge is speed. All data must be parsed, matched to other data in our database and published with the lowest possible latency.
We have pretty strict validation. Addressing errors in retrospect is preferable to publishing incorrect data.
Incidentally, at scale I find that the more tricky part is the whole orchestration (Schedule crawls, make sure resources are used most efficiently without overloading the target sites, properly detecting errors) is the hardest part.