> *Does that mean GPT-3 was trained on an arbitrary, huge database of text?* Yes...

bozzcl · on July 23, 2020

So that continues my question: how much of that text would be on the private domain? If there is, is it all legally purchased?

jcims · on July 23, 2020

I’m sure some lawyers figured it out (fair use maybe?) but they basically used scraped content from Common Crawl. Trying to figure out how to license that would be harder than trying to get all the land rights for a railroad across the US. So... probably not purchased.

rcxdude · on July 24, 2020

This is an excellent question which applies to a lot of machine learning datasets. AFAIK there is no specific licensing of much of it (licenses you see are generally attached to the labels: the images/text/etc are often not even part of the download and you need to go scrape them yourself) and it's often claimed that the results of the network are free from the copyright of the training data used to create it, but this is contentious and has definitely not been tested in court.

pontifier · on July 23, 2020

I wonder if that's one reason they can't open it up. Perhaps they are ensuring that responses that come from the AI are sufficiently different from anything in the training corpus, so it can't be queried for sensitive data or large chunks of copyrighted material.

Trasmatta · on July 23, 2020

It's basically already open through AI Dungeon. I've had all sorts of conversations so far with copyrighted characters.

dddbbb · on July 23, 2020

That's still coming via the API. They may be blocking any large chunks of text from being reproduced, rather than individual characters (which would be harder to police).