webscraper666's comments

webscraper666 · on Aug 11, 2021

This is more of a beginners guide than master class. This method will not extract most content on modern websites because of the way javascript behaves on them. It is also vertically not horizontally scalable. There are many other reasons that this a step one when web scraping.

maltz · on Aug 11, 2021

It's part of a series of blog posts that talks explicitly about crawling. There are indeed other links that do better explaining advanced extraction techniques.

Extraction => https://www.zenrows.com/blog/mastering-web-scraping-in-pytho...

Avoid blocking => https://www.zenrows.com/blog/stealth-web-scraping-in-python-...

adamqureshi · on Aug 11, 2021

ok but do you offer custom scraping services if i needed to hire someone to build it?

maltz · on Aug 11, 2021

Yep, seems so. https://www.zenrows.com/pricing

adamqureshi · on Aug 12, 2021

thank you

alephu5 · on Aug 11, 2021

I worked on a large web scraper for several years and JavaScript almost never needs to be executed. The only times I've had were to extract obfuscated links that are revealed by some bit twiddling code, specific to each request, and this was achievable by forking out to deno.

shubb · on Aug 11, 2021

I think javascript comes up because cloudflare use some kind of javascript challenge as part of the DDOS protection. There are python libraries that know how to deal with it, or you can use some level of headless browser. https://github.com/VeNoMouS/cloudscraper

missblit · on Aug 12, 2021

This is highly domain (and sometimes User-Agent) dependent and in my experience JS is required more and more.

e.g. good luck trying to get much out of youtube.com (or any other video site) without executing JS.

maltz · on Aug 12, 2021

YouTube has "var ytInitialData" & "var ytInitialPlayerResponse" params hardcoded in HTML. No need to run JS!

cl42 · on Aug 11, 2021

This is something I find a lot of web scraping tools miss. Are there any you'd recommend that specifically deal with things like async JavaScript content loading, or loading content based on what you click on a page (e.g., in Single Page Apps)?

turtlebits · on Aug 11, 2021

Javascript content loading is easier in most cases. Just look at your browser network inspector and grab the URL.

Usually the response is in JSON and you can ignore the original page. You might have to auth/grab session cookies first, but thats still easier than working with the HTML.

maltz · on Aug 11, 2021

Playwright. It can be easily used with JS, Python, Go, Java, etc.

cl42 · on Aug 11, 2021

Thanks! Is that like using Selenium? (i.e., you have to manage and code the actions yourself)

anderRV · on Aug 11, 2021

Yes, quite similar. According to their definition it is a "library to automate Chromium, Firefox and WebKit with a single API. "

cl42 · on Aug 11, 2021

Thanks! If there are any third-party managed tools to do this, that would be awesome to know about (i.e., where they somehow run common JS functions/site interactions to test for additional content).

ethbr0 · on Aug 11, 2021

Unfortunately, it's a pathological edge case.

Imagine an async-loaded list, that continues loading more content as it comes in, until it displays all of the content available to the backend.

When would you know such a list is finished loading?

This sounds insane, but it's pretty easy and common for an ambitious UXer to key in on, and is something I've seen in production pages.

(In the event you are a UXer, please include some sort of status update! Even an overlaid spinner that disappears solves the problem.)

joshxyz · on Aug 12, 2021

kinda agree

- session persistence

- dealing with cdns

- dealing with regional proxies

- dealing with captchas

- dealing with websocket data

- dealing with custom session handshake sequences

list goes on and on and on, but probably just edge cases haha