What is the licensing for code generated in this way? GPT-3 has memorized hundre...

billti · on June 29, 2021

The landing page for it states the below, so hopefully not too much of an issue (though I guess some folks may find a 0.1% risk high).

> GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

rictic · on June 30, 2021

If you pulled a marble out of a bag ten times a day, with a 0.1% chance each time that it was red: after a day you'd have a 1% chance of seeing a red marble, the first week you'd have a 6.7% chance, the first month you'd have a 26% chance, and the first working year you'd have a 92.6% chance of having seen at least one red marble.

Probabilities are fun!

ganzuul · on June 30, 2021

Well within the margin of fair use.

marcan_42 · on June 30, 2021

That's not how fair use works. It doesn't matter how unlikely it is, if Copilot one day decides to "suggest" a significant snippet of ckxd from a GPLed app, you'd better be planning to GPL your project.

hakre · on July 2, 2021

No, and this has been outlined in the past why that is not the case.

e.g. https://lwn.net/Articles/61292/ and most likely only one opinion.

on the other hand, it would be interesting to learn about what the copyright implications are of

a) creating a utility like copilot (it is a software program) and contains a corpus based on copyrighted material (the database that has been trained)

b) using it to create code based on the corpus and resulting in software as a work under copyright.

geekraver · on June 30, 2021

And you would have a whole lot of blue marbles.

propertymagnate · on July 1, 2021

I only murdered them once isn't the best of legal defenses.

Jeff_Brown · on June 30, 2021

Automatic refactoring could be useful for avoiding a lot of dumb legal disputes.

I say dumb because I am, perhaps chauvinistically, assuming that no brilliant algorithmic insights will be transferred via the AI copilot, only work that might have been conceptually hard to a beginner but feels routine to a day laborer.

Then again that assumption suggests there'd be nothing to sue over.

Gh0stRAT · on June 30, 2021

True, but that definitely wouldn't stop Oracle from suing over it anyway. (See the rangeCheck chapter of Oracle v Google [0])

Also, Oracle v Google opens the possibility of a fair-use defense in the event that Copilot does regurgitate a small snippet of code.

[0] https://news.ycombinator.com/item?id=11722514

jgworks · on June 29, 2021

I'd be surprised if a company's legal department would be OK with that 0.1% risk.

viraptor · on June 29, 2021

Google already learned that one. "There's only a tiny chance we may be copying some public code from Oracle." may not be a good explanation there.

TedDoesntTalk · on June 29, 2021

Like wouldn’t be entertaining without the License Nazis. No code for you! (Seinfeld reference)

varispeed · on June 29, 2021

Did they have a license to use public source code as a data source for data set though?

Jeff_Brown · on June 30, 2021

God I wish contracts were encoded semantically rather than as plain text. I just tried to look through Github's terms of service[1]. I'd search for "Github can <verb> with <adjective> code" if I could. Instead I'm giving up.

[1] https://docs.github.com/en/github/site-policy/github-terms-o...

int_19h · on June 30, 2021

A world in which all laws and contracts were required to be written in Lojban would be interesting.

Jeff_Brown · on June 30, 2021

That looks hard. More politically feasible might be a language I've unfortunately forgotten the name of, ordinary English but with some extra rules designed to eliminate ambiguity -- every noun has to carry a specifier like "some" or "all" or "the" or "a", etc.

ganzuul · on June 30, 2021

Legalese might be similar to code, and there is lots of interest in making law machine readable. So don't give up; check back later.

smackjer · on June 29, 2021

Yes, it's public source code.

haimez · on June 30, 2021

Public doesn’t mean it’s not encumbered by copyrights

ggreer · on June 30, 2021

Pretty much everything is trained on copyrighted content: machine translation software, TWDNE, DALL-E, and all the GPTs. Software people are bringing this up now because it's their ox being gored. It's the same as when furries got upset about This Fursona Does Not Exist.[1][2]

1. https://news.ycombinator.com/item?id=23093911

2. https://www.reddit.com/r/HobbyDrama/comments/gfam2y/furries_...

tremon · on June 30, 2021

To expand on your argument, pretty much every person is trained on copyrighted content too. Doesn't make their generated content automatically subject to copyright either.

malka · on June 30, 2021

yeah, except that Oracle and google have way more lawyer power than furries artists.

ganzuul · on June 30, 2021

You have no idea how much money they make. Some of them have payment plans for commissions.

user-the-name · on June 30, 2021

This is an argument for why this is a bigger problem, not a smaller one.

oalae5niMiel7qu · on July 4, 2021

If it's BSD-licensed, the encumbrance doesn't matter much.

iandanforth · on June 29, 2021

Update: Nat Friedman answered this as part of this thread on twitter:

https://twitter.com/natfriedman/status/1409883713786241032

Basically they are building a system to find explicit copying and warn developers when the output is verbatim.

Abimelex · on June 29, 2021

Not sure how this is handled in US, but in Germany a few lines of code have in general not enough uniqueness to be licensed.

hakre · on July 2, 2021

So the corpus has been compiled under license and the derivative work is eligible for distribution?

mempko · on June 29, 2021

Finally, a faster way to spread bugs than copy/paste.

richardanaya · on June 29, 2021

You're using the word "memorized" in a very loose way.

delaaxe · on June 29, 2021

His point still holds, GPT-3 can output large chunks of licensed code, verbatim

tsbinz · on June 29, 2021

How is it loose? Both in the colloquial sense and in the sense it is used in machine learning it is fitting. https://bair.berkeley.edu/blog/2020/12/20/lmmem/ is a post demonstrating it.