>> This is not to endorse the European position - which is foolish, regressive a...

celestialcheese · on April 6, 2023

This is so backwards.

If you limit scraping and fair use, it further entrenches the power of the tech and content giants.

If you're upset about big tech's power and influence, you should be cheering on the free use of publicly available information, as it has the best shot at unseating that power.

Angostura · on April 6, 2023

If I'm a person and I limit the way you scrape my private blog, It's not clear how I'm entrenching anyone

zdragnar · on April 6, 2023

If it's on the internet and not behind a paywall, it's not private.

gumballindie · on April 6, 2023

> it's not private.

Doesn't need to be private in order to be copyrighted.

olalonde · on April 6, 2023

I don't think anyone is arguing that copyright laws shouldn't apply to ChatGPT. In practice, I believe it's quite rare for ChatGPT to output copyrighted works but if does, the copyright holder would indeed have a claim against OpenAI for redistributing their work without permission.

That being said, this is not what the current ban is about. The ban was ordered by GDPR regulators who have suddenly decided that using publicly available information on the web to train machine learning models is not allowed due to "privacy concerns". It has nothing to do with copyright.

manuelmoreale · on April 7, 2023

By this way of reasoning, if you’re going around minding your own business not having your face covered I can just take a picture of you and your face and use it to train my models? I mean, it’s out there, it’s not behind a paywall so it’s not private.

We can’t just live in a world where something is by default up for grabs by big companies just because it’s out there.

orev · on April 6, 2023

In the US, anything you create is automatically copyrighted and you have full rights to decide who does what with it, unless you explicitly waive those rights, even if you post it publicly on the Internet.

Too many people seem to think that just because something’s public it means they’re allowed to do whatever they want with it. That’s incorrect. The only barrier is whether someone you copy from is willing to bring legal action.

zdragnar · on April 6, 2023

There are a number of exceptions to copyright, and at least in the US, the supreme court's understanding of "transformativeness" applies to chatgpt and other similar tech.

The relevant quote, originally written in 1990 and cited in 94, is thus:

" [If] the secondary use adds value to the original--if the quoted matter is used as raw material, transformed in the creation of new information, new aesthetics, new insights and understandings--this is the very type of activity that the fair use doctrine intends to protect for the enrichment of society. "

We've wandered into uncharted legal territory with only a few light posts guiding the way. Nothing about this is settled or obvious.

olalonde · on April 6, 2023

AFAIK, using copyrighted works for training machine learning models does not infringe on copyrights and falls under fair use.

https://www.thefashionlaw.com/ai-trained-on-copyrighted-work...

rout39574 · on April 6, 2023

I'm not sure it's "unseating". Moving from one lobe of enclosed corporate power to another doesn't help. I think this serves more to enclose that which was free, than to liberate what was closed.

celestialcheese · on April 6, 2023

>I think this serves more to enclose that which was free, than to liberate what was closed.

What has been closed? Every day there's new foundational models being released. It's exhausting trying to keep up with the pace of change - Alpaca, Vicuna, LLaMA, etc. I'd be shocked if there wasn't a truly open source foundational model with performance equivalent to OpenAI's GPT-4 by the end of the year.

Any move to limit fair-use and scraping of publicly available information through copyright laws, no matter how good of a reason it is, gives more power to the biggest companies.

raverbashing · on April 6, 2023

The problem is not what was scraped, but add conversation data to the training set (without obvious disclosure)

Samsung got bit by it https://twitter.com/GergelyOrosz/status/1643903974536347649

basisword · on April 6, 2023

It doesn’t have to be all or nothing. You can regulate the big companies and leave the smaller companies unregulated. I’m not sure what regulations, if any, are necessary here but pausing to think about it when the consequences are potentially world changing seems like a good idea.

meroes · on April 6, 2023

Didn’t Google become entrenched by scraping?

Scraping seems to create a few dozen winners and 99% losers from the data we have.

celestialcheese · on April 6, 2023

If anything, it's the opposite. It's the legal restrictions around scraping that have entrenched the power of the walled-gardens today.

They built their businesses on scraping, then when they had their lock-in and monopolies, they turned around and fought against scraping to keep up-starts from eroding their business.

This blog is a great overview of the current landscape around scraping laws.

https://blog.ericgoldman.org/archives/2022/12/hello-youve-be...

digging · on April 6, 2023

> If you limit scraping and fair use, it further entrenches the power of the tech and content giants.

Explain?

celestialcheese · on April 6, 2023

Fair use has been a thorn in the side of content and publishing giants for a long time as it allows for upstarts to create derivative works and databases without violating the original copyright.

If you block fair-use and scraping, those that hold the most data and copyrights (the largest companies) will be the only ones able to create quality foundational models, thus holding the keys and further entrenching their power.

ptrhvns · on April 6, 2023

"It was beautiful. We were selling rich women their own fat asses back to them." -- Fight Club

hunglee2 · on April 6, 2023

yes I agree that the query is not a bad thing. However, what I think is a bad thing is if the query ends up with a ban (which I think it will) and prevents the obvious value of this new class of technology for the rest of humanity

berkle4455 · on April 6, 2023

> have the right to consume everyone’s information, build their entire business with it and then sell that back to the people

That’s quite literally what google has done for 25 years, just with advertising for monetization.

qgin · on April 6, 2023

Except it was considered by most as a win-win situation. Google consumed it for the purpose of directing people to the original content (except for Google’s recent foray into info cards and quick answers within search). The info in the LLM doesn’t reference back to the source. There may be 1000 sources for a given concept embedded in the model anyhow.

medvezhenok · on April 6, 2023

Yes, and arguably Google has been way under-regulated for the externalities they caused to the system (one example is having adversarial bidding wars for companies over their own keywords, which is simply a rentier tax on Google advertising)

I think we're finally coming around on that (I think a lot of regulators were slow on antitrust in the internet space because the harm to the consumer/environment is not immediately obvious - i.e. it's mostly non-monetary externalities)

Same as America as a whole and offshoring to reduce labor cost. The externalities are slow but they're clearly there in the increasing polarization and the growth of the precariat.

basisword · on April 6, 2023

It’s not. Indexing and linking to primary sources is very different to consuming all of those primary sources and spitting it back without attribution. Google entered this area a bit with the info boxes they display alongside results but it’s a very different proposition from AI.

Avicebron · on April 6, 2023

This is an interesting point, if someone develops a card catalogue for a library, they aren't providing a market substitute for each one of those books. As far as I know, "fair use" law has to account whether or not the end product can serve as a substitute for the original work.

nradov · on April 6, 2023

Fair use copyright law in the US allows for selling summaries of books, and those can serve as a substitute for the original work in many cases. Many students use Cliff Notes to get through their literature courses without ever doing the assigned reading.

https://www.cliffsnotes.com/literature

Angostura · on April 6, 2023

And that's what no-robots was for

csomar · on April 6, 2023

Based on that, anyone who has read something on the Internet can no longer start a business?

> handful of tech companies

What about the Open Source models? What prevents the rest of the companies to innovate here?

basisword · on April 6, 2023

I find any comparison between a human reading something and a model owned by a private company ingesting all the worlds information absolutely ridiculous.

coliveira · on April 6, 2023

This is not the first time. Google built a trillion dollar business based on taking content created by people over the internet and monetizing it without prior consent of the creators. It was the largest information heist in the history of mankind.

They also tried to do the same with the book industry, fortunately publishers had money to pay for lawsuits against Google.

fithisux · on April 6, 2023

Exactly

williamcotton · on April 6, 2023

Luckily the US courts will not agree with their foolish, regressive and short-sighted European counterparts.

ahmedbaracat · on April 6, 2023

Exactly that. I wrote about the problem and a potential solution here for the interested

https://barac.at/essays/request-for-funding-a-business-exper...

bostonsre · on April 6, 2023

It's a more advanced google search. They may not like the fact that it was built on publicly available data but pandoras box has been opened for the world and European companies will become less competitive without this valuable tool in all of their companies' tool boxes.

FpUser · on April 6, 2023

This logic applies to nearly everything. People invent things and then sell it but the invention is not possible if they did not rely on the information they often have for free.