Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's probably worth considering how the thing actually works.

LLMs are sort of like a fancy compression dictionary that can be used to compress text, except that we kind of use them in reverse. Instead of compressing likely text into smaller bitstrings, they generate likely text. But you could also use them for compression of text because if you take some text, there is highly likely a much shorter prompt + seed that would generate the same text, provided that it's ordinary text with a common probability distribution.

Which is basically what the lawyers are doing. Keep trying combinations until it generates the text you want.

But the ability to do that isn't really that surprising. If you feed a copyrighted article to gzip, it will give you a much shorter string that you can then feed back to gunzip to get back the article. That doesn't mean gunzip has some flaw or ill intent. It also doesn't imply that the article is even stored inside of the compression library, rather than there just being a shorter string that can be used to represent it because it contains predictable patterns.

It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.



> It's not implausible that an LLM could generate a verbatim article it was never even trained on if you pushed on it hard enough, especially if it was trained on writing in a similar style and other coverage of the same event.

That'd be a coincidence, not a verbatim copy. Copyright law doesn't prohibit independent creation. This defense isn't available to OpenAI because there is no dispute OpenAI ingested the NYTimes articles in the first place. There is no plausible way OpenAI could say they never had access to the articles they are producing verbatim copies of.

Rather than sneeringly explain away how LLMs work without any eye towards the laws at issue, maybe you should do yourself the favor of learning about them so you can spare us this incessent "no let me explain how they work, it's fine I swear!" shtick.


> That'd be a coincidence, not a verbatim copy.

It would be both. Or to put it a different way, how would you distinguish one from the other?

> This defense isn't available to OpenAI because there is no dispute OpenAI ingested the NYTimes articles in the first place.

The question remains whether ingesting the article is the reason it gets output in response to a given prompt, when it could have happened either way.

And in cases where you don't know, emitting some text is not conclusive evidence that it was in the training data. Most of the text emitted by LLMs isn't verbatim from the training data.

> Rather than sneeringly explain away how LLMs work without any eye towards the laws at issue, maybe you should do yourself the favor of learning about them so you can spare us this incessent "no let me explain how they work, it's fine I swear!" shtick.

This is a case of first impression. We don't really know what they're going to do yet. But "there exists some input that causes it to output the article" isn't any kind of offensive novelty; lots of boring existing stuff does that when the input itself is based on the article.


>It would be both. Or to put it a different way, how would you distinguish one from the other?

No, it's not both. Have you engaged in any effort to understand the law here? Copyright doesn't prohibit independent creation. I'm not sure how much more simple I can make that for you. In one scenario there is copying, in the other there isn't. The facts make it clear, when something is copied it is illegal.

>The question remains whether ingesting the article is the reason it gets output in response to a given prompt, when it could have happened either way.

This can't actually be serious? This isn't credible. You are saying there is no difference between ingesting it and outputting the results vs not ingesting it and outputting the results. Anything to back this up at all?

>This is a case of first impression. We don't really know what they're going to do yet. But "there exists some input that causes it to output the article" isn't any kind of offensive novelty; lots of boring existing stuff does that when the input itself is based on the article.

"First impression" (something you claim) doesn't mean ignore existing copyright law. One side is arguing this isn't first impression at all, it's just rote copying.

> But "there exists some input that causes it to output the article" isn't any kind of offensive novelty

You said its novel, I called it plain copying.

>lots of boring existing stuff does that when the input itself is based on the article.

You are saying its first impression... not me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: