Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Has it been proven that the major labs are scraping via residential IPs? If so I will absolutely write about that.

I know there are a ton of fly-by-night startups abusing residential IP scraping and I hate it, but if it's Anthropic or OpenAI or Google Gemini that's a story worth telling.

I have written a lot about training data - https://simonwillison.net/tags/training-data/ - including highlighting instances where models attempted to train on ethical sources.

I've also pointed out when a model claims to use ethical data but still uses a scrape of the web that's full of unlicensed content, eg https://simonwillison.net/2025/Jun/7/comma/ and https://simonwillison.net/2024/Dec/5/pleias-llms/





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: