Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: License to prohibit training of systems like Microsoft Copilot
39 points by bugfix-66 on Nov 3, 2022 | hide | past | favorite | 42 comments
We want to modify the BSD 2-Clause Open Source License to explicitly prohibit the use of the licensed software in the training of systems like Microsoft's Copilot. We also want to prohibit use during inference.

Question: How should the third clause be worded, below?

  The No-AI 3-Clause Open Source Software License

  Copyright (C) <YEAR> <COPYRIGHT HOLDER>

  All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in
     the documentation and/or other materials provided with the
     distribution.

  3. Use in source or binary forms for the construction or operation
     of predictive software generation systems is prohibited.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The above is a verbatim (word-for-word) copy of the BSD 2-Clause Open Source License, except for its title and the addition of a third clause:

  3. Use in source or binary forms for the construction or operation
     of predictive software generation systems is prohibited.
Thanks!


That license is neither free nor open source, so you'd be stooping to Microsoft's level by using it. And it wouldn't help anyway, since they're claiming fair use, which if true, means they don't need a license. (And if false, then they're in violation of the existing licenses anyway.)


> which if true, means they don't need a license. (And if false, then they're in violation of the existing licenses anyway.)

Precisely my thoughts; either they're already violating the attribution requirements of such a license, or it's under Fair Use in which case the question is moot. There's no need to put them doubly into failure of compliance.

I'm not an attorney; I have worked professionally with IP policies and licenses for a number of years. I generally give the following suggestion to developers considering FOSS licenses: once you've put your code under an open license, assume that you no longer own today- you only own tomorrow.

What I mean by that is that once the source code is out there under a given open source license, it's best to consider it a total loss until the next time you change the code. Of course there's always the right to challenge and litigate, but this is often time-consuming, expensive, and an uphill battle. I can't even put a number on how many times I've dealt with teams who love the "warm fuzzy" of creating an open source project, but get apoplectic when they realize that means giving up quite a bit of control over anything published as FOSS (but it's a large quantity).

Having said all of that, I think OP is doing exactly the right thing in reconsidering the license. Maybe an OSI-approved FOSS license is not right for this project and a more restrictive license is appropriate, but personally I think that in the face of this new type of use case, most FOSS licenses should clarify and refine attribution requirements. I'd love to see some further guidance from OSI on this.

Edit: for example, it might become recommended practice to put generic, boilerplate, or non-innovative code under permissive FOSS licenses from day one, but leave extremely novel, innovative, or unique code under a more restrictive license (such as one with more stringent attribution requirements) in a separate module for a time, until credit for the innovation is well-established (after which even the innovation can be published under a permissive license). Not a panacea; simply an early thought.


It's the BSD 2-clause license, a very old, very widely-used, and very respectable open source license:

https://en.m.wikipedia.org/wiki/BSD_licenses

We've added one additional clause EXPLICITLY PROHIBITING use in systems like Copilot.

There's no ambiguity. Training your language model is a direct and unequivocal violation of clause 3.

If this is not adequate to prevent Copilot-style intellectual property theft then nothing is.

How else can we protect ourselves from Copilot-style use without attribution?


Yes, the original license was open source. It's exactly your change that makes it not open source.


Please elaborate on this surprising claim, in detail.

Here is the OSI definition of Open Source:

https://en.m.wikipedia.org/wiki/The_Open_Source_Definition


Sure. Here's a quote from the Open Source Definition:

> No Discrimination Against Fields of Endeavor

> The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

And here's the license term in question:

> Use in source or binary forms for the construction or operation of predictive software generation systems is prohibited.


I see, it's clause 6 in the OSI definition:

  6. No discrimination against fields of endeavor: The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.
So it's one or the other. Basically, Copilot makes all open source software (according to the OSI definition) necessarily use-without-attribution. Your code is my code. Your work is instantly my work, and no credit will be given.

I think this is a special case, illustrating that these new language models require us to reformulate our conception of what "open source" and "fair use" entails.

The laws and definitions need to adapt to a changing world.


> Basically, Copilot makes all open source software (according to the OSI definition) necessarily use-without-attribution.

No, Copilot is violating the existing open source licenses by not providing the attribution that they require. If Microsoft doesn't care about that, then they won't care about violating your license either.


I think that's misunderstanding of what the Open Source Definition refers to. The way I read it, you must be allowed to run the binary to further any goal.

But it doesn't mean you can share the source code however you want. Most FOSS licenses add restrictions on that, that's why there are so many of them. To be fair OP might need to reword a little to clarify that you can use the binary to accomplish training but doesn't allow you to create derived works of the source or binary by training on them.


Hold on, how can it ever be fair use if this is literally at the foundation of their commercial offering?


I'm not saying that it is. I'm saying that whether or not it is, this license won't make a difference.


It's certainly less ambiguous.

The BSD license restricts both "redistribution" and "use":

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:
Training a Copilot-like system is "use" in obvious violation of clause 3:

  3. Use in source or binary forms for the construction or operation
     of predictive software generation systems is prohibited.
What else can we do?


Just because you use something for commercial purposes doesn't mean it can't be fair use. See Google v Oracle.


Wasn't that about API as opposed to implementation?


There was a question regarding whether an API could be copyrighted which was never answered by the Supreme Court. The last decision in that case regarding the topic was that they could be. A jury ruled that the use of the copyrighted API constituted fair use. A federal court then overruled that decision. The Supreme Court didn't rule on the copyrightability of the API because they decided even if it was the use by Google was fair use.

The comment I replied to implied that something couldn't be fair use if the use was commercial. Google's use of the API was commercial and yet ruled fair use by the highest court in the land.


I hold that whatever is fair use for an API is not even close to what is fair use for the source of actual implementation, let's see what the court has to say on that.


From what I vaguely understand, part of the GitHub terms of service is granting GitHub / Microsoft various rights to store, index, etc. the contents of your repositories. That is, in addition to whatever license you choose to use for your project, you are granting GH/MS additional rights on top of that (if they are not already granted by the project license).

While I am not a lawyer / licensing expert, I would be very surprised if the terms of service did not include rights or some other loophole allowing it to be used for training machine learning systems[1]. So the actual license you'd use is probably irrelevant while it's hosted on GitHub.

[1] A paragraph from section D4 here https://docs.github.com/en/site-policy/github-terms/github-t... states:

"We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."

I'd guess CoPilot would be covered by either the parsing or improving the service provisions.


Perhaps it could be argued that Copilot was an unexpected development, and the users of GitHub could not possibly have intended to donate all their work to be used anonymously and without attribution?

It's a bit like how non-compete clauses are unenforceable in California. The Copilot theft of intellectual property might be viewed as being too far from anyone's reasonable expectations?

We'll see how the legal battle turns out.


> part of the GitHub terms of service

There are tons of code in Github by authors who do not use Github, and are not bound by its terms of service. Microsoft and Github have no permission to that code beyond to what is granted in its licensing.

> You grant us and our legal successors the right to store, archive, parse, and display Your Content

Many Github uploaders do not have the right to be making any such grants to anyone, since they are uploading someone else's code.

If you've had someone else upload your code to Github, even if the code is entirely open source, you have the right to demand, for instance, that Github not serve any portion of it without showing your copyright notice. For instance, some kind of search function which shows only several lines of your code that contain a match for some search terms, could be deemed infringing. A license like two-clause BSD doesn't allow an excerpt of your code to be transmitted over a network, stripped of the copyright notice, list of conditions and disclaimer.


That doesn't absolve the users of the generated output(they have to clean room it to be safe), it just means Microsoft can't be sued for using the code you voluntarily uploaded.


These AI systems are all riding on fair use anyway (at least for now), so it doesn't matter what license you choose because they're never accepting it in the first place.


Law-abiding companies like Microsoft won't knowingly violate a license.

My understanding is that Microsoft makes an effort to exclude all GPL software from its training. (Is this true?)

What we need is a more permissive MIT- or BSD-style license that defeats Copilot.

We just need to convince law-abiding companies that they must not use the code.


Where did Microsoft say that GPL code was excluded from the training dataset? Since not including attribution is already a violation of MIT/BSD-style licenses, I don't think MS is discouraged by violating licenses.


Unfortunately, I cannot find support for my claim that Microsoft excludes GPL-licensed code.


The problem with making up your own license clause is that responsible companies now need to pay their lawyers to review it if they want to use your software. This is going to exclude a lot of users who don't want to pay for that review, rather than just the ones who'd use it for purposes you want to prohibit.


If someone is against CoPilot-style use of their code, would they want commercial organizations using it? It seems like such a hurdle would be quite attractive for many projects.


Well, what are our options?

What else can we do to protect ourselves from Microsoft Copilot?


Do we have to protect ourselves?

I understand that some may disagree that training AI with copyrighted data is fair use, but protecting?

If you are scared that many companies will steal open source code and not give anything back, well, it’s already the case. If you are scared to be replaced by an IA, forbidding the AI doesn’t seem te be the right approach. It’s too late and you should perhaps start to use it and improve it instead of fighting it.


What you're trying to do is pointless and wrongheaded; putting this license on your software will cause distributions to treat it as radioactive.

The conditions of the two-clause BSD license already prohibits the uses that you're trying to ban, because those uses do not "retain the above copyright notice, this list of conditions and the following disclaimer".

Suppose that your software were mangled by AI, in a way that conforms to the two-clause BSD license: the copyright notice and all are there. What would be wrong with it?

Until you explain what the actual problem is, it's impossible to advise you on licensing.

For instance, are you concerned that even if some AI preserves your copyright notices, your name is then being associated with gibberish?

People could cause that problem, too; some human could take your two-clause-BSD-licensed program, and make garbage modifications to it which make it look like unprofessional crap, without indicating that the program has been modified. The result carries only your copyright notice, making it look like you wrote it that way.

If that possibility makes you uncomfortable, finds some existing license which requires modified works to be clearly indicated as modified.


Why not hired a licensed lawyer in your jurisdiction, ideally one with expertise in copyright law, and ask them? If there are licensed lawyers here, they know better than try to give you legal advice (a) for free, (b) anonymously, and (c) without knowing where you are, and anyone else is unqualified to answer your question.

You can't just wing this stuff, dude.


We're consulting the lawyers here:

https://www.saverilawfirm.com/our-cases/github-copilot-intel...

I'll let you know what they say.


agpl3

they will have to open source all code and data that produced the derivative work.

why reinvent the wheel?


We need a more permissive MIT- or BSD-style license.

The GPL is good for some purposes, but too restrictive for others.


What is the use-case for a permissive license that does not permit use by code-fragment-compiler AI systems?


I use MIT- and BSD-licensed code at work in a closed-source commercial system. The documentation for our system acknowledges our use of the open-source code, and I'm sure the authors are proud to be part of our codebase (with attribution). That's the use-case.


Other than the ZFS-on-Linux situation (which is only a problem because the CDDL was intentionally written to cause it), what is the GPL "too restrictive" for? In the ideal world, literally all software would be AGPLv3.


In the ideal world, literally all software would be CC-0.

Copilot is an example of something which should exist, in this form or optimally a much better one.

Copyright is an example of something which should not exist, in any form. As with patents, it's fundamentally a drag on innovation and development.


> In the ideal world, literally all software would be CC-0.

No, because then companies would release heavily obfuscated binaries of everything and not provide the source, or worse, require you to run everything in their cloud and never release anything to you.


What problem is this meant to address? How is this making OSS better?


Probably to stop corporations from profiting from this work, or maybe these trillion dollar companies could pay the people who made these repos if they want to train their models on without expressive permission. Even better, maybe they can pay residuals out every time the model is using their code.



Testing for uni work jeje




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: