What is the licensing for code generated in this way? GPT-3 has memorized hundreds of texts verbatim and can be prompted to regurgitate that text. Has this model only been trained on code that doesn't require attribution as part of the license?
The landing page for it states the below, so hopefully not too much of an issue (though I guess some folks may find a 0.1% risk high).
> GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.
If you pulled a marble out of a bag ten times a day, with a 0.1% chance each time that it was red: after a day you'd have a 1% chance of seeing a red marble, the first week you'd have a 6.7% chance, the first month you'd have a 26% chance, and the first working year you'd have a 92.6% chance of having seen at least one red marble.
That's not how fair use works. It doesn't matter how unlikely it is, if Copilot one day decides to "suggest" a significant snippet of ckxd from a GPLed app, you'd better be planning to GPL your project.
Automatic refactoring could be useful for avoiding a lot of dumb legal disputes.
I say dumb because I am, perhaps chauvinistically, assuming that no brilliant algorithmic insights will be transferred via the AI copilot, only work that might have been conceptually hard to a beginner but feels routine to a day laborer.
Then again that assumption suggests there'd be nothing to sue over.
God I wish contracts were encoded semantically rather than as plain text. I just tried to look through Github's terms of service[1]. I'd search for "Github can <verb> with <adjective> code" if I could. Instead I'm giving up.
That looks hard. More politically feasible might be a language I've unfortunately forgotten the name of, ordinary English but with some extra rules designed to eliminate ambiguity -- every noun has to carry a specifier like "some" or "all" or "the" or "a", etc.
Pretty much everything is trained on copyrighted content: machine translation software, TWDNE, DALL-E, and all the GPTs. Software people are bringing this up now because it's their ox being gored. It's the same as when furries got upset about This Fursona Does Not Exist.[1][2]
To expand on your argument, pretty much every person is trained on copyrighted content too. Doesn't make their generated content automatically subject to copyright either.