>> This is not to endorse the European position - which is foolish, regressive and short sighted
Potentially, but I’m glad somebody is taking a minute to question whether or not a handful of tech companies have the right to consume everyone’s information, build their entire business with it and then sell that back to the people whose data they wouldn’t have existed without in the first place.
If you limit scraping and fair use, it further entrenches the power of the tech and content giants.
If you're upset about big tech's power and influence, you should be cheering on the free use of publicly available information, as it has the best shot at unseating that power.
I don't think anyone is arguing that copyright laws shouldn't apply to ChatGPT. In practice, I believe it's quite rare for ChatGPT to output copyrighted works but if does, the copyright holder would indeed have a claim against OpenAI for redistributing their work without permission.
That being said, this is not what the current ban is about. The ban was ordered by GDPR regulators who have suddenly decided that using publicly available information on the web to train machine learning models is not allowed due to "privacy concerns". It has nothing to do with copyright.
By this way of reasoning, if you’re going around minding your own business not having your face covered I can just take a picture of you and your face and use it to train my models? I mean, it’s out there, it’s not behind a paywall so it’s not private.
We can’t just live in a world where something is by default up for grabs by big companies just because it’s out there.
In the US, anything you create is automatically copyrighted and you have full rights to decide who does what with it, unless you explicitly waive those rights, even if you post it publicly on the Internet.
Too many people seem to think that just because something’s public it means they’re allowed to do whatever they want with it. That’s incorrect. The only barrier is whether someone you copy from is willing to bring legal action.
There are a number of exceptions to copyright, and at least in the US, the supreme court's understanding of "transformativeness" applies to chatgpt and other similar tech.
The relevant quote, originally written in 1990 and cited in 94, is thus:
"
[If] the secondary use adds value to the original--if the quoted matter is used as raw material, transformed in the creation of new information, new aesthetics, new insights and understandings--this is the very type of activity that the fair use doctrine intends to protect for the enrichment of society.
"
We've wandered into uncharted legal territory with only a few light posts guiding the way. Nothing about this is settled or obvious.
I'm not sure it's "unseating". Moving from one lobe of enclosed corporate power to another doesn't help. I think this serves more to enclose that which was free, than to liberate what was closed.
>I think this serves more to enclose that which was free, than to liberate what was closed.
What has been closed? Every day there's new foundational models being released. It's exhausting trying to keep up with the pace of change - Alpaca, Vicuna, LLaMA, etc. I'd be shocked if there wasn't a truly open source foundational model with performance equivalent to OpenAI's GPT-4 by the end of the year.
Any move to limit fair-use and scraping of publicly available information through copyright laws, no matter how good of a reason it is, gives more power to the biggest companies.
It doesn’t have to be all or nothing. You can regulate the big companies and leave the smaller companies unregulated. I’m not sure what regulations, if any, are necessary here but pausing to think about it when the consequences are potentially world changing seems like a good idea.
If anything, it's the opposite. It's the legal restrictions around scraping that have entrenched the power of the walled-gardens today.
They built their businesses on scraping, then when they had their lock-in and monopolies, they turned around and fought against scraping to keep up-starts from eroding their business.
This blog is a great overview of the current landscape around scraping laws.
Fair use has been a thorn in the side of content and publishing giants for a long time as it allows for upstarts to create derivative works and databases without violating the original copyright.
If you block fair-use and scraping, those that hold the most data and copyrights (the largest companies) will be the only ones able to create quality foundational models, thus holding the keys and further entrenching their power.
yes I agree that the query is not a bad thing. However, what I think is a bad thing is if the query ends up with a ban (which I think it will) and prevents the obvious value of this new class of technology for the rest of humanity
Except it was considered by most as a win-win situation. Google consumed it for the purpose of directing people to the original content (except for Google’s recent foray into info cards and quick answers within search). The info in the LLM doesn’t reference back to the source. There may be 1000 sources for a given concept embedded in the model anyhow.
Yes, and arguably Google has been way under-regulated for the externalities they caused to the system (one example is having adversarial bidding wars for companies over their own keywords, which is simply a rentier tax on Google advertising)
I think we're finally coming around on that (I think a lot of regulators were slow on antitrust in the internet space because the harm to the consumer/environment is not immediately obvious - i.e. it's mostly non-monetary externalities)
Same as America as a whole and offshoring to reduce labor cost. The externalities are slow but they're clearly there in the increasing polarization and the growth of the precariat.
It’s not. Indexing and linking to primary sources is very different to consuming all of those primary sources and spitting it back without attribution. Google entered this area a bit with the info boxes they display alongside results but it’s a very different proposition from AI.
This is an interesting point, if someone develops a card catalogue for a library, they aren't providing a market substitute for each one of those books. As far as I know, "fair use" law has to account whether or not the end product can serve as a substitute for the original work.
Fair use copyright law in the US allows for selling summaries of books, and those can serve as a substitute for the original work in many cases. Many students use Cliff Notes to get through their literature courses without ever doing the assigned reading.
I find any comparison between a human reading something and a model owned by a private company ingesting all the worlds information absolutely ridiculous.
This is not the first time. Google built a trillion dollar business based on taking content created by people over the internet and monetizing it without prior consent of the creators. It was the largest information heist in the history of mankind.
They also tried to do the same with the book industry, fortunately publishers had money to pay for lawsuits against Google.
It's a more advanced google search. They may not like the fact that it was built on publicly available data but pandoras box has been opened for the world and European companies will become less competitive without this valuable tool in all of their companies' tool boxes.
This logic applies to nearly everything. People invent things and then sell it but the invention is not possible if they did not rely on the information they often have for free.
Potentially, but I’m glad somebody is taking a minute to question whether or not a handful of tech companies have the right to consume everyone’s information, build their entire business with it and then sell that back to the people whose data they wouldn’t have existed without in the first place.