With the custom headers, you can actually trick a lot of sites with bot protection to let you load their sites (even big sites like youtube, which I have found success in)
Another cool trick is to deny all the content types you don't care about in your playwright.
so if you only want text why bother allowing requests for fonts, css, svgs, images, videos, etc
Just request the html and cap down all the other stuff
PS: I also think this has the nice side-effect of you consuming less resources (that you didnt care about/need anyways) from the server, so win win
I am quite aware, but I actually built most of the scraping logic a long time ago, before I even knew that playwright was a thing.
I am looking to refactor a lot of this, and switching over to playwright is a high priority, using something like camoufox for scraping, instead of just chromium.
Most of my work on this the past month has been simple additions that are nice to haves
I was in a similar boat with my scrapers. Started with Selenium 5-6 years ago and only discovered Playwright 2 years ago. Spend a month or so swapping the two, which was well worth it. Cleaner API, async support.