Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is a bit too broad. There are actually three possible cases.

When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.

If the code is completely different, then clean room or not is indeed irrelevant. The only way the author can claim that you violated their copyright despite no apparent similarity is for them to have proof you followed some kind of mechanical process for generating the new code based on the old one, such as using an LLM with the old code as input prompt (TBD, completely unsettled: what if the old code is part of the training set, but was not part of the input?) - the burden of proof is on them to show that the dissimilarity is only apparent.

In realistic cases, you will have a mix of similar and dissimilar portions, and portions where the similarity is questionable. Each of these will need to be analyzed separately - and it's very likely that all the similar portions will need to be re-written again if you can't prove that they were not copied directly or from memory from the original, even if they represent a very small part of the work overall. Even if you wrote a 10k page book, if you copied one whole page verbatim from another book, you will be liable for that page, and the author may force you to take it out.



> When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.

Yes, but you do not have to prove that you haven’t copied the original; you have to prove you didn’t infringe copyright. For that there are other possible defenses, for example:

- fair use

- claiming the copied part doesn’t require creativity

- arguing that the copied code was written by AI (there’s jurisdiction that says AI-generated art can’t be copyrighted (https://www.theverge.com/2023/8/19/23838458/ai-generated-art...). It’s not impossible judges will make similar judgments for AI-generated programs)


Courts have ruled that you can't assign copyrights to a machine, because only humans qualify for human rights. ** There is not currently a legal consensus on whether or not the humans using AI tools are creating derivative works when they use AI models to create things.

** this case is similar to an old case where a ~~photographer~~ PETA claimed a monkey owned a copyright to a photo, because they said a monkey took the photo completely on their own. The court said "okay well, it's public domain then because only humans can have copyrights"

Imagine you put a harry potter book in a copy machine. It is correct that the copy machine would not have a copyright to the output. But you would still be violating copyright by distributing the output.


https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput... Specifically he claimed he owned the copyright on a photo he didn't directly take. PETA weighed in trying to say the monkey owned the copyright.


Ah yeah you’re right I forgot it was PETA arguing that.


> there’s jurisdiction that says AI-generated art can’t be copyrighted

The headline was misleading. The courts said what Thaler could have copyrighted was a complicated question they ignored because he said he was not the author.


- Arguing that you owned the copyright on the copied code (the author here has apparently been the sole maintainer of this library since 2013, not all, but a lot of the code that could be copied here probably already belongs to him...)

The burden of proof is completely uncharted when it comes to LLMs. Burden of proof is assigned by court precedent, not the Copyright Act itself (in US law). Meaning, a court looking at a case like this could (should) see the use of an LLM trained on the copyrighted work as a distinguishing factor that shifts the burden to the defense. As a matter of public policy, it's not great if infringers can use the poor accountability properties of LLMs to hide from the consequences of illegally redistributing copyrighted works.

The way I see this it looks like this:

1. Initially, when you claim that someone has violated your copyright, the burden is on you to make a convincing claim on why the work represents a copy or derivative of your work.

2. If the work doesn't obviously resemble your original, which is the case here, then the burden is still on you to prove that either

(a), it is actually very similar in some fundamental way that makes it a derived work, such as being a translation or a summary of your work

or (b), it was produced following some kind of mechanical process and is not a result of the original human creativity of its authors

Now, in regards to item 2b, there are two possible uses of LLMs that are fundamentally different.

One is actually very clear cut: if I give an LLM a prompt consisting of the original work + a request to create a new work, then the new work is quite clearly a derived work of the original, just as much as a zip file of a work is a derived work.

The other is very much not yet settled: if I give an LLM a prompt asking for it to produce a piece of code that achieves the same goal as the original work, and the LLM had in its training set the original work, is the output of the LLM a derived work of the original (and possibly of other parts of the training set)? Of course, we'll only consider the case where the output doesn't resemble the original in any obvious way (i.e. the LLM is not producing a verbatim copy from memory). This question is novel, and I believe it is being currently tested in court for some cases, such as the NYT's case against OpenAI.


On the other hand, as a matter of public policy, nobody should be able to claim copyright protection for the process of detecting whether a string is correctly formed unicode using code that in no material way resembles the original. This is not rocket science.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: