Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Does that mean GPT-3 was trained on an arbitrary, huge database of text?

Yes, 500 billion words, for a model with model 175 billion parameters.



So that continues my question: how much of that text would be on the private domain? If there is, is it all legally purchased?


I’m sure some lawyers figured it out (fair use maybe?) but they basically used scraped content from Common Crawl. Trying to figure out how to license that would be harder than trying to get all the land rights for a railroad across the US. So... probably not purchased.


This is an excellent question which applies to a lot of machine learning datasets. AFAIK there is no specific licensing of much of it (licenses you see are generally attached to the labels: the images/text/etc are often not even part of the download and you need to go scrape them yourself) and it's often claimed that the results of the network are free from the copyright of the training data used to create it, but this is contentious and has definitely not been tested in court.


I wonder if that's one reason they can't open it up. Perhaps they are ensuring that responses that come from the AI are sufficiently different from anything in the training corpus, so it can't be queried for sensitive data or large chunks of copyrighted material.


It's basically already open through AI Dungeon. I've had all sorts of conversations so far with copyrighted characters.


That's still coming via the API. They may be blocking any large chunks of text from being reproduced, rather than individual characters (which would be harder to police).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: