Hacker Newsnew | past | comments | ask | show | jobs | submit | webscraper666's commentslogin

This is more of a beginners guide than master class. This method will not extract most content on modern websites because of the way javascript behaves on them. It is also vertically not horizontally scalable. There are many other reasons that this a step one when web scraping.


It's part of a series of blog posts that talks explicitly about crawling. There are indeed other links that do better explaining advanced extraction techniques.

Extraction => https://www.zenrows.com/blog/mastering-web-scraping-in-pytho...

Avoid blocking => https://www.zenrows.com/blog/stealth-web-scraping-in-python-...


ok but do you offer custom scraping services if i needed to hire someone to build it?



thank you


I worked on a large web scraper for several years and JavaScript almost never needs to be executed. The only times I've had were to extract obfuscated links that are revealed by some bit twiddling code, specific to each request, and this was achievable by forking out to deno.


I think javascript comes up because cloudflare use some kind of javascript challenge as part of the DDOS protection. There are python libraries that know how to deal with it, or you can use some level of headless browser. https://github.com/VeNoMouS/cloudscraper


This is highly domain (and sometimes User-Agent) dependent and in my experience JS is required more and more.

e.g. good luck trying to get much out of youtube.com (or any other video site) without executing JS.


YouTube has "var ytInitialData" & "var ytInitialPlayerResponse" params hardcoded in HTML. No need to run JS!


This is something I find a lot of web scraping tools miss. Are there any you'd recommend that specifically deal with things like async JavaScript content loading, or loading content based on what you click on a page (e.g., in Single Page Apps)?


Javascript content loading is easier in most cases. Just look at your browser network inspector and grab the URL.

Usually the response is in JSON and you can ignore the original page. You might have to auth/grab session cookies first, but thats still easier than working with the HTML.


Playwright. It can be easily used with JS, Python, Go, Java, etc.


Thanks! Is that like using Selenium? (i.e., you have to manage and code the actions yourself)


Yes, quite similar. According to their definition it is a "library to automate Chromium, Firefox and WebKit with a single API. "


Thanks! If there are any third-party managed tools to do this, that would be awesome to know about (i.e., where they somehow run common JS functions/site interactions to test for additional content).


Unfortunately, it's a pathological edge case.

Imagine an async-loaded list, that continues loading more content as it comes in, until it displays all of the content available to the backend.

When would you know such a list is finished loading?

This sounds insane, but it's pretty easy and common for an ambitious UXer to key in on, and is something I've seen in production pages.

(In the event you are a UXer, please include some sort of status update! Even an overlaid spinner that disappears solves the problem.)


kinda agree

- session persistence

- dealing with cdns

- dealing with regional proxies

- dealing with captchas

- dealing with websocket data

- dealing with custom session handshake sequences

list goes on and on and on, but probably just edge cases haha


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: