Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> many AI companies engage in web crawling

Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.



How do they manage to get 'paywalled' content?


Maybe 'paywalled' is not the best word but using their Identity Based Crawling feature with Managed Browsers[1], you can use an existing account and scrape content that requires authentication. This may not sound like anything new but IMHO, crawl4ai's workflow is easy to follow.

[1] https://docs.crawl4ai.com/advanced/identity-based-crawling




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: