Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is why it would be such a mistake to kneecap this process over copyright. The models needs ALL the data.


Okay so we're all agreed that IP laws don't matter and we can have all of OpenAI's data for free? That's a good deal!


What do you mean "have"?


Is this a trick question? OpenAI blatantly used copyrighted works for commercial purposes without paying the IP owners, it would only be fair to have them publish the resulting code/weights/whatever without expecting compensation. (I don't want to publish it myself, of course, just transform it and sell the result as a service!)

I know this won't happen, of course, I am moreso hoping for laws to be updated to avoid similar kerfuffles in the future, as well as massive fines to act as a deterrent, but I don't dare to hope too much.


I was envisioning a future where we've done away with the notion of data ownership. In such a world the idea that we would:

> have all of OpenAI's data for free

Doesn't really fit. Perhaps OpenAI might successfully prevent us from accessing it, but it wouldn't be "theirs" and we couldn't "have" it.

I'm not sure what kind of conversations we will be having instead, but I expect they'll be more productive than worrying about ownership of something you can't touch.


So in that world you envision someone could hack into openai, then publish the weights and code. The hacker could be prosecuted for breaking into their system, but everyone else could now use the weights and code legally.

Is that understanding correct?


I think that would depend on whether OpenAI was justified in retaining and restricting access to that data in the first place. If they weren't, then maybe they get fined and the hacker gets a part of that fine (to encourage whistleblowers). I'm not interested in a system where there are no laws about data, I just think that modeling them after property law is a mistake.

I haven't exactly drafted this alternative set of laws, but I expect it would look something like this:

If the data is derived from sources that were made available to the public with the consent of its referents (and subject to whatever other regulation), then walling it off would be illegal. On the other hand, other regulation regarding users' behavior world be illegal to share without the users consent and might even be illegal to retain without their consent.

If you want to profit from something derived from public data while keeping it private, perhaps that's ok but you have to register its existence and pay taxes on it as a data asset, much like we pay taxes on land. That way we can wield the tax code to encourage companies that operate in the clear. This category would probably resemble patent law quite a bit, except ownership doesn't come by default, you have to buy your property rights from the public (since by owning that thing, you're depriving the masses of access to it, and since the notion that it is a peg that fits in a property shaped hole is a fiction that requires some work on our part to maintain).


This is alleged, and it is very likely that claimants like New York Times accidentally prompt injected their own material to show the violation (not understanding how LLMs really work), and clouded in the hope of a big pay day rather than actual justice/fairness etc...

Anyways, the laws are mature enough for everyone to work this out in court. Maybe it comes out that they have a legitimate concern, but the way they presented their evidence so far in public has seriously been lacking.


Prompt injecting their own article would indeed be an incredible show of incompetence by the New York Times. I'm confident that they're not so dumb that they put their article in their prompt and were astonished when the reply could reproduce the prompt.

Rather, the actual culprit is almost certainly overfitting. The articles in question were pasted many times on different websites, showing up in the training data repeatedly. Enough of this leads to memorization.


They hired a third party to make the case, and we know nothing about that party except that they were lawyers. It is entirely possible, since this happened very early in the LLM game, that they didn’t realize how the tech worked, and fed it enough of their own article for the model to piece it back together. OpenAI talks about the challenge of overfitting, and how they work to avoid it.


The goal is to end up with a model capable of discovering all the knowledge on its own. not rely on what humans produced before. Human knowledge contains errors, I want the model to point out those errors and fix them. the current state is a crutch at best to get over the current low capability of the models.


Then lawmakers should change the law, instead of a private actor asserting that their need overrides others' rights.


"Congressman, I have Mr. Altman on line 2."


Or rather, I have an unending stream of callers with similar-sounding voices who all want to make chirpy persuasive arguments in favor of Mr Altman's interests.


With you 100% on that, except that after you defeat the copyright cartel, you'll have to face the final boss: OpenAI itself.

Either everybody should get the benefits of this technology, or no one should.


If OpenAI actually followed their initial mission and didn't become ClosedAI I think people would be much more on their side.


This is an anti-human ideology as bad as the worst of communism.

Humanity only survives as much as it preserves human dignity, let's say. We've designed society to give rewards to people who produce things of value.

These companies take that value and giving nothing back to the creators.

Supporting this will lead to disaster for all but the few, and ultimately for the few themselves.

Paying for your (copyrighted) inputs is harmony.


These models literally need ALL data. The amount of work it would take just to account for all the copyrights, let alone negotiate and compensate the creators, would be infeasible.

I think it’s likely that the justice system will deem model training as fair use, provided that the models are not designed to exactly reproduce the training data as output.

I think you hit on an important point though: these models are a giant transfer of wealth from creators to consumers / users. Now anyone can acquire artist-grade art for any purpose, basically for free — that’s a huge boon for the consumer / user.

People all around the world are going to be enriched by these models. Anyone in the world will be able to have access to a tutor in their language who can teach them anything. Again, that is only possible because the models eat ALL the data.

Another important point: original artwork has been made almost completely obsolete by this technology. The deed is done, because even if you push it out 70 years, eventually all of the artwork that these models have been trained on will be public domain. So, 70 years from now (or whatever it is) the cat will be out of the bag AND free of copyright obligations, so 2-3 generations from now it will be impossible to make a living selling artwork. It’s done.

When something becomes obsolete, it’s a dead man walking. It will not survive, even if it may take a while for people to catch up. Like when the vacuum tube computer was invented, that was it for relay computers. Done. And when the transistor was invented, that was it for vacuum tube computers.

It’s just a matter of time before all of today’s data is public domain and the models just do what they do.

…but people still build relay computers for fun:

https://youtu.be/JZyFSrNyhy8?si=8MRNznoNqmAChAqr

So people will still produce artwork.


> The amount of work it would take just to account for all the copyrights, let alone negotiate and compensate the creators, would be infeasible.

Your argument is the same as Facebook saying “we can’t provide this service without invading your privacy” or another company saying “we can’t make this product without using cancerous materials”.

Tough luck, then. You don’t have the right to shit on and harm everyone else just because you’re a greedy asshole who wants all the money and is unwilling to come up with solutions to problems caused by your business model.


This is bigger than the greed of any group of people. This is a technological sea change that is going to displace and obsolesce certain kinds of work no matter where the money goes. Even if open models win where no single entity or group makes a large pile of money, STILL the follow-on effects from wide access to models trained on all public data will unfold.

People who try to prevent models from training on all available data will simply lose to people who don’t, and eventually the maximally-trained models will proliferate. There’s no stopping it.

Assume a world where models proliferate that are trained on all publicly-accessible data. Whatever those models can do for free, humans will have a hard time charging money for.

That’s the sea change. Whoever happens to make money through that sea change is a sub-plot of the sea change, not the cause of it.

If you want to make money in this new environment, you basically have to produce or do things that models cannot. That’s the sink or swim line.

If most people start drowning then governments will be forced to tax whoever isn’t drowning and implement UBI.


Maybe the machines will just pay for more of leisure time as they were originally designed to do? It may just be as simple as that?

Remember the 4 hour work week ? Maybe we are almost there ?

Let’s face it, most people in a developed country have more free time than they know what to do with, mostly spent in HN and social median ofc :)


Check out the short story Manna by Marshall Brain for some speculative fiction on exactly these subjects.

https://marshallbrain.com/manna1


>Tough luck, then. You don’t have the right to shit on and harm everyone else just because you’re a greedy asshole who wants all the money

It used to be that property rights extended all the way to the sky. This understanding was updated with the advent of the airplane. Would a world where airlines need to negotiate with every land-owner their planes fly above be better than ours? Would commercial flight even be possible in such a world? Also, who is greediest in this scenario, the airline hoping to make a profit, or the land-owners hoping to make a profit?


Your comment seems unfair to me. We can say the exact same thing for the artist / IP creator:

Tough luck, then. You don’t have the right to shit on and harm everyone else just because you’re a greedy asshole who wants all the money and is unwilling to come up with solutions to problems caused by your business model.

Once the IP is on the internet, you can't complain about a human or a machine learning from it. You made your IP available on the internet. Now, you can't stop humanity benefiting from it.


Talk about victim blaming. That’s not how intellectual property or copyright work. You’re conveniently ignoring all the paywalled and pirated content OpenAI trained on.

https://www.legaldive.com/news/Chabon-OpenAI-class-action-co...

Those authors didn’t “make their IP available on the internet”, did they?


First, “Plaintiffs ACCUSE the generative AI company.” Let’s not assume OpenAI is guilty just yet. Second, assuming OpenAI didn’t access the books illegally, my point still remains. If you write a book, can you really complain about a human (or in my humble opinion, a machine) learning from it?


> So people will still produce artwork.

There's zero doubt that people will still create art. Almost no one will be paid to do it though (relative to our current situation where there are already far more unpaid artists than paid ones). We'll lose an immeasurable amount of amazing new art that "would have been" as a result, and in its place we'll get increasingly bland/derivative AI generated content.

Much of the art humans will create entirely for free in whatever spare time they can manage after their regular "for pay" work will be training data for future AI, but it will be extremely hard for humans to find as it will be drowned out by the endless stream of AI generated art that will also be the bulk of what AI finds and learns from.


AI will just be another tool that artists will use.

However the issue is that it will be much harder to make a career in the digital world from an artistic gift and personal style: one's style will not be unique for long as AI will quickly copy it and so make the original much less valuable.


AI will certainly be a tool that artists use, but non-artists will use it too so very few will ever have the need to pay an artist for their work. The only work artists are likely to get will be cleaning up AI output, and I doubt they'll find that to be very fulfilling or that it pays them well enough to make a living.

When it's harder to make a career in the digital world (where most of the art is), it's more likely that many artists will never get the opportunity to fully develop their artistic gifts and personal style at all.

If artists are lucky then maybe in a few generations with fewer new creative works being created, AI almost entirely training on AI generated art will mean that the output will only get more generic and simplistic over time. Perhaps some people will eventually pay humans again for art that's better quality and different.


The prevalence of these lines of thought make me wonder if we'd see a similar backlash against Star-Trek style food-replicators. "Free food machines are being be used by greedy corporations to put artisanal chefs out of business. We must outlaw the free food machines."


>one's style will not be unique for long as AI will quickly copy it and so make the original much less valuable

Note that the fashion industry doesn't have copyrights, and runway fashions get copied very quickly. Fashion designers still exist in such a world.


There are alternative systems. One would be artists making a living through other ways such as live performances, meet and greet, book signings, etc.)

We could also do patronage. Thats how musicians used to be funded. Even today we have grants from public/private institutions.

We could also drift back into "owning the physical media" We see this somewhat with the resurgence of records.

NFTs would have been another way, but at least initially, it failed to become generally accepted into the popular conscious.


I'll gladly put money on music that a human has poured blood, sweat, tears and emotion into. Streaming has already killed profits from album sales so live gigs is where the money is at and I don't see how AI could replace that.


Lol, you really want content creators to aid AI in replacing them without any compensation? Would you also willingly train devs to do your job after you've been laid off, for free?

What nonsense. Just because doing the right thing is hard, or inconvenient doesn't mean you get to just ignore it. The only way I'd be ok with this is if literally the entire human population were equal shareholders. I suspect you wouldn't be ok with that little bit of communism.


There is no way on Earth that people playing by the existing rules of copyright law will be able to compete going forward.

You can bluster and scream and shout "Nonsense" all you want, but that's how it's going to be. Copyright is finished. When good models are illegal or unaffordable, only outlaws -- meaning hostile state-level actors with no allegiance to copyright law -- will have good models.

We might as well start thinking about how the new order is going to unfold, and how it can be shaped to improve all of our lives in the long run.


I think there’s no stopping this train. Whoever doesn’t train on all available data will simply not produce the models that people actually use, because there will be people out there who do train models on all available data. And as I said in another comment, after some number of decades all of the content that has been used to train current models will be in the public domain anyway. So it will only be a few generations before this whole discussion is moot and the models are out there that can do everything today’s models can, unencumbered by any copyright issues. Digital content creation has been made mostly obsolete by generative AI, except for where consumers actively seek out human-made content because that’s their taste, or if there’s something humans can produce that models cannot. It’s just a matter of time before this all unfolds. So yes, anyone publishing digital media on the internet is contributing to the eventual collapse of people earning money to produce content that models can produce. It’s done. Even if copyright delays it by some decades, eventually all of today’s medial will be public domain and THEN it will be done. There are 0 odds of any other outcome.

To your last point, I think the best case scenario is open source/weight models win so nobody owns them.


> We've designed society to give rewards to people who produce things of value

Is that really what copyright does though? I would be all for some arrangement to reward valuable contributions, but the way copyright goes about allocating that reward is by removing the right of everyone but the copyright holder to use information or share a cultural artifact. Making it illegal to, say, incorporate a bar you found inspiring into a song you make and share, or to tell and distribute stories about some characters that you connected with, is profoundly anti-human.


I'm shocked at how otherwise normally "progressive" folks or even so called "communists" will start to bend over for IP-laws the moment that they start to realize the implications of AI systems. Glad to know that accusations of the "gnulag" were unfounded I guess!

I now don't believe most "creative" types when they try to spout radical egalitarian ideologies. They don't mean it at all, and even my own family, who religiously watched radical techno-optimist shows like Star Trek, are now falling into the depths of ludditism and running into the arms of defending copyright trolls


If you're egalitarian, it makes sense to protest when copyright is abolished only for the rich corporations but not for actual people, don't you think? Part of the injustice here is that you can't get access to windows source code, or you can't use Disney characters, or copy most copyrighted material... But OpenAI and github and whatnot can just siphon all data with impunity. Double standard.


Copyright has been abolished for the little guy. I’m talking about AI safety doomers who think huggingface and Civit.AI are somehow not the ultimate good guys in the AI world.


This is a foul mischaracterization of several different viewpoints. Being opposed to a century-long copyright period for Mickey Mouse does not invalidate support for the concept of IP in general, and for the legal system continuing to respect the licensing terms of very lenient licenses such as CC-BY-SA.


The thinking is: ‘Anything that the little guy does to get ahead is justified; but if the rich do the same thing, that’s unfair.’




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: