jpyles's comments

jpyles · 2025-05-11T20:11:31 1746994291

With the custom headers, you can actually trick a lot of sites with bot protection to let you load their sites (even big sites like youtube, which I have found success in)

dotancohen · 2025-05-11T22:31:04 1747002664

How do you work around pop-ups for newsletters and such? Look at the BBC for a good example.

anxman · 2025-05-11T23:21:14 1747005674

Pack ad blockers into your containers. They can be loaded into Chrome and help immensely in suppressing popovers while crawling.

dotancohen · 2025-05-12T03:32:07 1747020727

Thank you, I'll experiment with that. Tips and advice welcome!

tough · 2025-05-12T10:34:24 1747046064

Another cool trick is to deny all the content types you don't care about in your playwright. so if you only want text why bother allowing requests for fonts, css, svgs, images, videos, etc

Just request the html and cap down all the other stuff

PS: I also think this has the nice side-effect of you consuming less resources (that you didnt care about/need anyways) from the server, so win win

dotancohen · 2025-05-19T12:21:02 1747657262

That is a great tip, thank you!

jpyles · 2025-05-11T20:09:50 1746994190

I am quite aware, but I actually built most of the scraping logic a long time ago, before I even knew that playwright was a thing.

I am looking to refactor a lot of this, and switching over to playwright is a high priority, using something like camoufox for scraping, instead of just chromium.

Most of my work on this the past month has been simple additions that are nice to haves

michaeljx · 2025-05-11T21:24:31 1746998671

I was in a similar boat with my scrapers. Started with Selenium 5-6 years ago and only discovered Playwright 2 years ago. Spend a month or so swapping the two, which was well worth it. Cleaner API, async support.

nkozyra · 2025-05-11T22:29:28 1747002568

Playwright was miles ahead of selenium but what I think is really overlooked is chromedp

jpyles · 2025-05-11T21:29:13 1746998953

Luckily, I have some experience with playwright, so swapping shouldn't take me too long.

Currently working on a PR to swap over