Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is the licensing for code generated in this way? GPT-3 has memorized hundreds of texts verbatim and can be prompted to regurgitate that text. Has this model only been trained on code that doesn't require attribution as part of the license?


The landing page for it states the below, so hopefully not too much of an issue (though I guess some folks may find a 0.1% risk high).

> GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.


If you pulled a marble out of a bag ten times a day, with a 0.1% chance each time that it was red: after a day you'd have a 1% chance of seeing a red marble, the first week you'd have a 6.7% chance, the first month you'd have a 26% chance, and the first working year you'd have a 92.6% chance of having seen at least one red marble.

Probabilities are fun!


Well within the margin of fair use.


That's not how fair use works. It doesn't matter how unlikely it is, if Copilot one day decides to "suggest" a significant snippet of ckxd from a GPLed app, you'd better be planning to GPL your project.


No, and this has been outlined in the past why that is not the case.

e.g. https://lwn.net/Articles/61292/ and most likely only one opinion.

on the other hand, it would be interesting to learn about what the copyright implications are of

a) creating a utility like copilot (it is a software program) and contains a corpus based on copyrighted material (the database that has been trained)

b) using it to create code based on the corpus and resulting in software as a work under copyright.


And you would have a whole lot of blue marbles.


I only murdered them once isn't the best of legal defenses.


Automatic refactoring could be useful for avoiding a lot of dumb legal disputes.

I say dumb because I am, perhaps chauvinistically, assuming that no brilliant algorithmic insights will be transferred via the AI copilot, only work that might have been conceptually hard to a beginner but feels routine to a day laborer.

Then again that assumption suggests there'd be nothing to sue over.


True, but that definitely wouldn't stop Oracle from suing over it anyway. (See the rangeCheck chapter of Oracle v Google [0])

Also, Oracle v Google opens the possibility of a fair-use defense in the event that Copilot does regurgitate a small snippet of code.

[0] https://news.ycombinator.com/item?id=11722514


I'd be surprised if a company's legal department would be OK with that 0.1% risk.


Google already learned that one. "There's only a tiny chance we may be copying some public code from Oracle." may not be a good explanation there.


Like wouldn’t be entertaining without the License Nazis. No code for you! (Seinfeld reference)


Did they have a license to use public source code as a data source for data set though?


God I wish contracts were encoded semantically rather than as plain text. I just tried to look through Github's terms of service[1]. I'd search for "Github can <verb> with <adjective> code" if I could. Instead I'm giving up.

[1] https://docs.github.com/en/github/site-policy/github-terms-o...


A world in which all laws and contracts were required to be written in Lojban would be interesting.


That looks hard. More politically feasible might be a language I've unfortunately forgotten the name of, ordinary English but with some extra rules designed to eliminate ambiguity -- every noun has to carry a specifier like "some" or "all" or "the" or "a", etc.


Legalese might be similar to code, and there is lots of interest in making law machine readable. So don't give up; check back later.


Yes, it's public source code.


Public doesn’t mean it’s not encumbered by copyrights


Pretty much everything is trained on copyrighted content: machine translation software, TWDNE, DALL-E, and all the GPTs. Software people are bringing this up now because it's their ox being gored. It's the same as when furries got upset about This Fursona Does Not Exist.[1][2]

1. https://news.ycombinator.com/item?id=23093911

2. https://www.reddit.com/r/HobbyDrama/comments/gfam2y/furries_...


To expand on your argument, pretty much every person is trained on copyrighted content too. Doesn't make their generated content automatically subject to copyright either.


yeah, except that Oracle and google have way more lawyer power than furries artists.


You have no idea how much money they make. Some of them have payment plans for commissions.


This is an argument for why this is a bigger problem, not a smaller one.


If it's BSD-licensed, the encumbrance doesn't matter much.


Update: Nat Friedman answered this as part of this thread on twitter:

https://twitter.com/natfriedman/status/1409883713786241032

Basically they are building a system to find explicit copying and warn developers when the output is verbatim.


Not sure how this is handled in US, but in Germany a few lines of code have in general not enough uniqueness to be licensed.


So the corpus has been compiled under license and the derivative work is eligible for distribution?


Finally, a faster way to spread bugs than copy/paste.


You're using the word "memorized" in a very loose way.


His point still holds, GPT-3 can output large chunks of licensed code, verbatim


How is it loose? Both in the colloquial sense and in the sense it is used in machine learning it is fitting. https://bair.berkeley.edu/blog/2020/12/20/lmmem/ is a post demonstrating it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: