I’m sure some lawyers figured it out (fair use maybe?) but they basically used scraped content from Common Crawl. Trying to figure out how to license that would be harder than trying to get all the land rights for a railroad across the US. So... probably not purchased.
This is an excellent question which applies to a lot of machine learning datasets. AFAIK there is no specific licensing of much of it (licenses you see are generally attached to the labels: the images/text/etc are often not even part of the download and you need to go scrape them yourself) and it's often claimed that the results of the network are free from the copyright of the training data used to create it, but this is contentious and has definitely not been tested in court.
I wonder if that's one reason they can't open it up. Perhaps they are ensuring that responses that come from the AI are sufficiently different from anything in the training corpus, so it can't be queried for sensitive data or large chunks of copyrighted material.
That's still coming via the API. They may be blocking any
large chunks of text from being reproduced, rather than individual characters (which would be harder to police).
Yes, 500 billion words, for a model with model 175 billion parameters.