Can anyone list some good resources about scraping, with gotchas etc.?

Jake232 · on Jan 19, 2015

I wrote an article on scraping last year which recieved a lot of praise. May be worth a read - http://jakeaustwick.me/python-web-scraping-resource/

forlorn · on Jan 19, 2015

My recipe is to use Typhoeus (https://github.com/typhoeus/typhoeus) + Nokogiri. I have tried lots of different options including EventMachine with em-http-request and reactor loop and concurrent-ruby (both a re very poorly documented)

Typhoeus has a built-in concurrency mechanism with callbacks with specified number of concurrent http requests. You just create a hydra object, create the first request object with URL and a callback (you have to check errors like 404 yourself) where you extract another URLs from the page and push them to hydra again with the same on another callback.

joshmn · on Jan 19, 2015

Just said this myself. I love Typhoeus, though I can't spell it 9/10 times.

llamataboot · on Jan 19, 2015

I really the scraping chapter in the Bastard's Book of Ruby http://ruby.bastardsbook.com/chapters/web-scraping/