Hacker Newsnew | past | comments | ask | show | jobs | submit | jpyles's commentslogin

With the custom headers, you can actually trick a lot of sites with bot protection to let you load their sites (even big sites like youtube, which I have found success in)


How do you work around pop-ups for newsletters and such? Look at the BBC for a good example.


Pack ad blockers into your containers. They can be loaded into Chrome and help immensely in suppressing popovers while crawling.


Thank you, I'll experiment with that. Tips and advice welcome!


Another cool trick is to deny all the content types you don't care about in your playwright. so if you only want text why bother allowing requests for fonts, css, svgs, images, videos, etc

Just request the html and cap down all the other stuff

PS: I also think this has the nice side-effect of you consuming less resources (that you didnt care about/need anyways) from the server, so win win


That is a great tip, thank you!


I am quite aware, but I actually built most of the scraping logic a long time ago, before I even knew that playwright was a thing.

I am looking to refactor a lot of this, and switching over to playwright is a high priority, using something like camoufox for scraping, instead of just chromium.

Most of my work on this the past month has been simple additions that are nice to haves


I was in a similar boat with my scrapers. Started with Selenium 5-6 years ago and only discovered Playwright 2 years ago. Spend a month or so swapping the two, which was well worth it. Cleaner API, async support.


Playwright was miles ahead of selenium but what I think is really overlooked is chromedp


Luckily, I have some experience with playwright, so swapping shouldn't take me too long.

Currently working on a PR to swap over


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: