Has it been proven that the major labs are scraping via residential IPs? If so I will absolutely write about that.
I know there are a ton of fly-by-night startups abusing residential IP scraping and I hate it, but if it's Anthropic or OpenAI or Google Gemini that's a story worth telling.
I know there are a ton of fly-by-night startups abusing residential IP scraping and I hate it, but if it's Anthropic or OpenAI or Google Gemini that's a story worth telling.
I have written a lot about training data - https://simonwillison.net/tags/training-data/ - including highlighting instances where models attempted to train on ethical sources.
I've also pointed out when a model claims to use ethical data but still uses a scrape of the web that's full of unlicensed content, eg https://simonwillison.net/2025/Jun/7/comma/ and https://simonwillison.net/2024/Dec/5/pleias-llms/