You call it extortion of the AI companies, but isn’t stealing/crawling/hammering...

cpncrunch · 2025-12-05T18:52:00 1764960720

>You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious?

You can easily block ChatGPT and most other AI scrapers if you want:

https://habeasdata.neocities.org/ai-bots

james2doyle · 2025-12-05T19:16:18 1764962178

This is just using robots.txt and asking "pretty please, don’t scrape me".

Here is an article (from TODAY) about the case where Perplexity is being accused of ignoring robots.txt: https://www.theverge.com/news/839006/new-york-times-perplexi...

If you think a robots.txt is the answer to stopping the billion-dollar AI machine from scraping you, I don’t know what to say.

Aeolun · 2025-12-06T01:58:47 1764986327

If someone has a robots.txt, and I want to request their page, but I want to do that in an automated way, should I open the browser to do it instead of issue a curl request? How about if I am going to ask claude to fetch the page for me?

kentm · 2025-12-06T03:26:17 1764991577

Respect the robots.txt and don’t do it?

cpncrunch · 2025-12-06T00:39:51 1764981591

Yes, I was referring to legitimate companies, and Perplexity doesn't seem to be one of those.

albedoa · 2025-12-06T05:02:45 1764997365

Oh for sure. When he wrote of the AI companies that are "stealing/crawling/hammering", you thought he meant the legitimate ones that do honor robots.txt. That makes sense.

cpncrunch · 2025-12-06T05:44:44 1764999884

Actually, it looks like all the major ones do honour robots.txt including perplexity. They seemingly get around it using google serps, so theyre not actually crawling or hammering the site servers (or even cloudflare).

https://www.ailawandpolicy.com/2025/10/anti-circumvention-re...

jacobgkau · 2025-12-05T20:01:14 1764964874

I'm guessing you don't manage any production web servers?

robots.txt isn't even respected by all of the American companies. Chinese ones (which often also use what are essentially botnets in Latin American and the rest of the world to evade detection) certainly don't care about anything short of dropping their packets.

cpncrunch · 2025-12-06T00:27:22 1764980842

I have been managing production commercial web servers for 28 years.

Yes, there are various bots, and some of the large US companies such as Perplexity do indeed seem to be ignoring robots.txt.

Is that a problem? It's certainly not a problem with cpu or network bandwidth (it's very minimal). Yes, it may be an issue if you are concerned with scraping (which I'm not).

Cloudflare's "solution" is a much bigger problem that affects me multiple times daily (as a user of sites that use it), and those sites don't seem to need protection against scraping.

filleduchaos · 2025-12-06T00:48:50 1764982130

It is rather disingenuous to backpedal from "you can easily block them" to "is that a problem? who even cares" when someone points out that you cannot in fact easily block them.

cpncrunch · 2025-12-06T01:07:59 1764983279

I was referring to legitimate ones, which you can easily block. Obviously there are scammy ones as well, and yes it is an issue, but for most sites I would say the cloudflare cure is worse than the problem it's trying to cure.

oasisbob · 2025-12-06T16:34:31 1765038871

No true scotsman needs Cloudflare, as any true scotsman can block AI bots themselves is not a strong argument.

kvirani · 2025-12-06T01:03:38 1764983018

Security almost always brings inconvenience (to everyone involved, including end users). That is part of its cost.

cpncrunch · 2025-12-06T01:22:25 1764984145

What security issue is actually being solved here though?

chrneu · 2025-12-05T22:52:47 1764975167

this is the equivalent of asking people not to speed on your street.

mplewis · 2025-12-05T22:43:04 1764974584

No you cannot! I blocked all of the user agents on a community wiki I run, and the traffic came back hours later masquerading as Firefox and Chrome. They just fucking lie to you and continue vacuuming your CPU.

cpncrunch · 2025-12-06T00:38:34 1764981514

There shouldn't be any noticeable hit on your cpu from bots from a site like that. Are you sure it's not a DDoS?

Obviously it depends on the bot, and you can't block the scammy ones. I was really just referring to the major legitimate companies (which might not include Perplexity).

literalAardvark · 2025-12-06T00:50:28 1764982228

There is a noticeable hit, there's also a noticeable cost, and it's not a ddos.

Not all sites can have full caching, we've tried.

cpncrunch · 2025-12-06T01:11:49 1764983509

I was referring to the community wiki.

Sohcahtoa82 · 2025-12-05T23:44:24 1764978264

How are you this naive? Do you really think scrapers give a damn about your robots.txt?

cpncrunch · 2025-12-06T00:39:14 1764981554

The legitimate ones do, which is what I was referring to. Obviously there are bastard ones as well.

literalAardvark · 2025-12-05T23:35:36 1764977736

Tell me you don't run a site without telling me you don't run a site

cpncrunch · 2025-12-06T00:35:24 1764981324

Tell me you make incorrect assumptions without specifically saying so. (Yes, you're incorrect).