Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can anyone list some good resources about scraping, with gotchas etc.?


I wrote an article on scraping last year which recieved a lot of praise. May be worth a read - http://jakeaustwick.me/python-web-scraping-resource/


My recipe is to use Typhoeus (https://github.com/typhoeus/typhoeus) + Nokogiri. I have tried lots of different options including EventMachine with em-http-request and reactor loop and concurrent-ruby (both a re very poorly documented)

Typhoeus has a built-in concurrency mechanism with callbacks with specified number of concurrent http requests. You just create a hydra object, create the first request object with URL and a callback (you have to check errors like 404 yourself) where you extract another URLs from the page and push them to hydra again with the same on another callback.


Just said this myself. I love Typhoeus, though I can't spell it 9/10 times.


I really the scraping chapter in the Bastard's Book of Ruby http://ruby.bastardsbook.com/chapters/web-scraping/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: