Hacker Newsnew | past | comments | ask | show | jobs | submit | f33d5173's commentslogin

So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.


The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

This is from my experience having a personal website. AI companies keep coming back even if everything is the same.


Weird, considering IA has most of its content in a way you could rehost it all idk why nobody’s just hosting a IA carbon copy that AI companies can hit endlessly, and then cutting IA a nice little check in the process, but I guess some of the wealthiest AI startups are very frugal about training data?

This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.


Unless regulated, there is no incentive for the giants to fund anything.

There is no problem that cannot be solved with creating a bureaucracy and paperwork!

I understand this is tongue-in-cheek, but do you have an alternative/better proposal?

Let the market do. If good data is so critical to the success of AI, AI companies will pay for it. I don't know how someone can still entertain the idea that a bureaucrat, or worse, a politician, is remotely competent at designing an efficient economy.

All the world's data was critical to the success of AI. They stole it and fought the system to pay nothing. Then settled it for peanuts because the original creators are weak to negotiate. It already happened.

No they won't pay for it, unless they believe it's in their best interests. If they believe they can free-ride and get good data without having to pay for it, why would they lay down a dollar?

Because the companies in control of that data won't let them have it for free, like what is happening in the article.

Or, they'll just create more technically sophisticated workarounds to get what they want while avoiding a bad precedent that might cost them more money in the long run. Millions for defense, not one cent for tribute.

Now apply the same logic to laws, except that laws are a lot slower to change when they find the next workaround.

And it's a lot harder to get the law to stop doing something once it proves to cause significant collateral damage, or just cumulative incremental collateral damage while having negligible effectiveness.


That already exists, it's called Common Crawl[1], and it's a huge reason why none of this happened prior to LLMs coming on the scene, back when people were crawling data for specialized search engines or academic research purposes.

The problem is that AI companies have decided that they want instant access to all data on Earth the moment that it becomes available somewhere, and have the infrastructure behind them to actually try and make that happen. So they're ignoring signals like robots.txt or even checking whether the data is actually useful to them (they're not getting anything helpful out of recrawling the same search results pagination in every possible permutation, but that won't stop them from trying, and knocking everyone's web servers offline in the process) like even the most aggressive search engine crawlers did, and are just bombarding every single publicly reachable server with requests on the off chance that some new data fragment becomes available and they can ingest it first.

This is also, coincidentally, why Anubis is working so well. Anubis kind of sucks, and in a sane world where these companies had real engineers working on the problem, they could bypass it on every website in just a few hours by precomputing tokens.[2] But...they're not. Anubis is actually working quite well at protecting the sites it's deployed on despite its relative simplicity.

It really does seem to indicate that LLM companies want to just throw endless hardware at literally any problem they encounter and brute force their way past it. They really aren't dedicating real engineering resources towards any of this stuff, because if they were, they'd be coming up with way better solutions. (Another classic example is Claude Code apparently using React to render a terminal interface. That's like using the space shuttle for a grocery run: utterly unnecessary, and completely solvable.) That's why DeepSeek was treated like an existential threat when it first dropped: they actually got some engineers working on these problems, and made serious headway with very little capital expenditure compared to the big firms. Of course they started freaking out, their whole business model is based on the idea that burning comical amounts of money on hardware is the only way we can actually make this stuff work!

The whole business model backing LLMs right now seems to be "if we burn insane amounts of money now, we can replace all labor everywhere with robots in like a decade", but if it turns out that either of those things aren't true (either the tech can be improved without burning hundreds of billions of dollars, or the tech ends up being unable to replace the vast majority of workers) all of this is going to fall apart.

Their approach to crawling is just a microcosm of the whole industry right now.

[1]: https://en.wikipedia.org/wiki/Common_Crawl

[2]: https://fxgn.dev/blog/anubis/ and related HN discussion https://news.ycombinator.com/item?id=45787775


Thanks for the mention of Common Crawl. We do respect robots.txt and we publish an opt-out list, due to the large number of publishers asking to opt out recently.

There's a bit of discussion of Common Crawl in Jeff Jarvis's testimony before Congress: https://www.youtube.com/watch?v=tX26ijBQs2k


So perhaps the AI companies will go bankrupt and then this madness will stop. But it would be nice if no government intervenes because they are "too big to fail".

Are you sure it's the AI companies being that incompetent, and not wannabe AI companies?

What I feel is a lot more likely is that OpenAI et al are running a pretty tight ship, whereas all the other "we will scrape the entire internet and then sell it to AI companies for a profit" businesses are not.


They run a tight AI ship but it is in their interest to destroy the web so that people can only get to data through their language model

OpenAI cannot possibly running a tight ship, even if they have competent scientists and engineers.

yeah, they should really have a think about how their behavior is harming their future prospects here.

Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.

We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.


Something I’ve noticed about technology companies, and it’s bled into just about every facet of the US these days, is the consideration of if an action *can* be executed upon vs *should* an action be executed upon.

It’s very unfortunate and a short sighted way to operate.


The main issue is a well behaved AI company won't be singled out for continued access, they will all be hit by public sites blocking AI access. So there is no benefit to them behaving.

> So there is no benefit to them behaving.

That's assuming they're deriving a benefit from misbehaving.

There is no benefit to immediately re-crawling 404s or following dynamic links into a rabbit hole of machine-generated junk data and empty search results pages in violation of robots.txt. They're wasting the site's bandwidth and their own in order to get trash they don't even want.

Meanwhile there is an obvious benefit to behaving: You don't, all by yourself, cause public sites to block everyone including you.

The problem here isn't malice, it's incompetence.


Why should a well-behaved AI company be singled out for continued access? If the industry can't regulate itself then none deserve access no matter if they're "well-behaved".

Receiving a response from someone's webserver is a privilege, not a right.


Honestly, has any of these AI companies ever offered a compensation for the data they pillage, except in case of large walled up information silos like reddit? This is like asking why the occasional burglars are not singled out for direct access into your house, compared to the stripmining marauders out there.

Why does any of them deserve any special treatment? Please don't try to normalize this reprehensible behavior. It's a greedy, exploitative and lawless behavior, no matter how much they downplay it or how long they've been doing it.


No single piece of content (unless you're a really large website) is worth the paper that such a contract would be written on.

This is the problem with AI scraping. On one hand, they need a lot of content, on the other, no single piece of content is worth much by itself. If they were to pay every single website author, they'd spend far more on overhead than they would on the actual payments.

Radio faces a similar problem (it would be impossible to hunt down every artist and negotiate licensing deals for every single song you're trying to play). This is why you have collective rights management organizations, which are even permitted by law to manage your rights without your consent in some countries.


This is just tragedy of the commons.

It’s insane actually how fast to re-request the same pages, even 404s. They’re so desperate for data they’re really hurting smaller hosts. One of our clients site became unusable when one of the ai bots started spamming the Wordpress search for terms that I’m guessing users were searching for but were unrelated to the sites content. Instead of building a search index they’re just hammering sites directly. So annoying.

It can be 10,000 requests a day on static HTML and non-existent, PHP pages. That's on my site. I'd rather them have Christ-centered and helpful content in their pretraining. So, I still let them scrape it for the public good.

It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.

Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?


> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Maybe they vibecoded the crawlers. I wish I were joking.


Isn't this just how crawlers work? How do you know if a page has changed if you don't keep visiting it?

HEAD requests

Scrapers could also use the "Cache-Control", "Expires" and "If-Modified-Since" headers properly, to reduce their traffic a little, but do they?

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Why, though? Especially if the pages are new; aren't they concerned about ingesting AI-generated content?


Possibly because a lot of “AI-company scraping” isn't traditional scraping (e.g., to build a dataset of the state at a particular point in time), its referencing the current content of the page as grounding for the response to a user request.


Coincidentally most of the funding towards IPFS development dried up because the VC money moved onto the very technology enabling these problems...

Is there a good post-mortem of IPFS out there?

What do you mean? It is alive and "well". Just extremely slow now that interest waned.

It's been several years, but in my experiments it felt plenty fast if I prefetched links at page load time so that they're already local by the time the user actually tries to follow them (sometimes I'd do this out to two hops).

I think it "failed" because people expected it to be a replacement transport layer for the existing web, minus all of the problems the existing web had, and what they got was a radically different kind of web that would have to be built more or less from scratch.

I always figured it was a matter of the existing web getting bad enough, and then we'd see adoption improve. Maybe that time is near.


oh I mean slow in terms of adoption and public interest. my bad. i expressed awfully.

But you are right on the reason it "failed". People expected web++, with a "killer app", whatever that means. Imagination is dead.


I'm still working on what I think could be a killer app for it, but progress happens on holidays and vacations and weekends only if I'm lucky, so as you say... it's slow :)

I see the primary issue with IPFS is a significant majority of all web users are on mobile. They can't act as content hosts or routers. In P2P parlance they can only ever act as leeches. Even people with full fledged computers the market is dominated by laptops. These have similar availability issues as phones even if they don't have the same storage or connectivity limitations.

Compared to the total number of users on the Internet relatively few have stable always-on machines ready to host P2P content. ISPs do not make it easy or at times possible to poke holes in firewalls to allow for easy hosting on residential connections. This necessitates hole punching which adds non-trivial delays on connections and overall poorer network performance.

It's less about imagination being dead but instead limitations of the modern Internet retards momentum of P2P anything.


> I see the primary issue with IPFS is a significant majority of all web users are on mobile. They can't act as content hosts or routers.

Is there any reason this has to be true? Probably some majority or significant minority of mobile devices spend some eight hours a day attached to a charger in a place where they have the WiFi password, while the user is asleep. And you don't need 100% of devices to be hosts or routers, 10% at any given time would be more than sufficient.


> And you don't need 100% of devices to be hosts or routers, 10% at any given time would be more than sufficient.

Except it don't. Route and content takes hours to converge.


Is convergence necessary?

If a peer says "hey there's a new version of this" and that peer also has pinned that version, then I can get it from them right now, well before the network converges. Yeah maybe it'll take a few hours for the other side of the planet to get the word, but for most data a couple hours or a couple days is fine. Tolerating latencies was kind of the point of calling it "interplanetary".

What's the use case where I'm on the other side of the planet and I somehow end up with a CID which I can't resolve? How did I get that CID so much faster than content to which it refers?


Why?

Why not? If internet access goes away there's no reason the data on my phone can't be made available to other phones on the same LAN.

The tricky part is the trust networking that incentivizes me to allow those others to do so.


What's IPFS 's killer app?

They already are, I've been dealing with Vietnam and Korea residential proxies destroying my systems for weeks, I'm growing tired. I cannot survive 3500 RPS 24/7.

> I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

You've just described Nostr: Content that is tied to a hash (so its origin and authenticity can be verified) that is hosted by third parties (or yourself if you want)


With Nostr you can host your content anywhere, but for it to actually be discoverable, you need to declare that host. Third parties therefore cannot really solve the problem for you, without your help.

It would be nice if IA could create a browser extension or TLS-intercepting proxy that end users can run over their own computers and connections, allowing crowd-sourced scraping. It would need an allow/deny-listing feature for sites to passively crawl, and I'm not sure how you could prevent data poisoning, but it would at least get around the issues of blocking.

I don’t believe resips will be with us for long, at least not to the extent they are now. There is pressure and there are strong commercial interests against the whole thing. I think the problem will solve itself in some part.

Also, I always wonder about Common Crawl:

Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?


Many AI projects in academia or research get all of their web data from Common Crawl -- in addition to many not-AI usages of our dataset.

The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.


Thank you!

> The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big.

But how can they aspire to do any of that if they cannot build a basic bot?

My case, which I know is the same for many people:

My content is updated infrequently. Common Crawl must have all of it. I do not block Common Crawl, and I see it (the genuine one from the published ranges; not the fakes) visiting frequently. Yet the LLM bots hit the same URLs all the time, multiple times a day.

I plan to start blocking more of them, even the User and Search variants. The situation is becoming absurd.


Well, yes, it is a bit distressing that ill behaved crawlers are causing a lot of damage -- and collateral damage, too, when well-behaved bots get blocked.

> I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent.

I wrote a short paper on that 25 years ago, but it went nowhere. I still think it is a great idea!


Blocking the internet archive sounds like non-tech leadership making decisions without understanding how ubiquitous and moot it is to simply get it another way.

Kind of sucks because the news are an important part of that kind of an archive.


>The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

definitely, this is going to hurt those over at /r/datahoarder


Even if the site is archived on IA, AI companies will still do the same.

AI companies are _already_ funding and using residential proxies. Guess how much of those proxies are acquired through being compromised or tricking people into installing apps?

Does anyone know if Teslas do this? I noticed Tesla cars want to have access to local WiFi and eat up oodles of bandwidth …

AI browsers will be the scrapers, shipping content back to the mothership for processing and storage as users co browse with the agentic browser.

> So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money.

News websites aren’t like those labyrinthian cgit hosted websites that get crushed under scrapers. If 1,000 different AI scrapers hit a news website every hour it wouldn’t even make a blip on the traffic logs.

Also, AI companies are already scraping these websites directly in their own architecture. It’s how they try to stay relevant and fresh.


Hello hi, I work on a news site and we absolutely notice and it does mess up traffic logs.

But don't you have to sign a license agreement that prohibits scraping in order to purchase a subscription that allows you to bypass the paywall?

It's almost as if this isn't about scraping and more about shutting down a "free article sharing" channel that gets abused all the time.

But hey, paywalled sites might be getting 2-3 additional subscriptions out of it!

We don’t lack the technology to limit scrapers, sure it’s an arms race with AI companies with more money than most. Why can’t this be a legal block through TOS

Iirc it's not quite true. 75% of the book is more likely to appear than you would expect by chance if prompted with the prior tokens. This suggests that it has the book encoded in its weights, but you can't actually recover it by saying "recite harry potter for me".

Do you happen to know, is that because it can’t recite Harry Potter, or because it’s been instructed not to recite Harry Potter?

It's a matter of token likelihood... as a continuation, the rest of chapter one is highly likely to follow the first paragraph.

The full text of Chapter One is not the only/likeliest possible response to "recite chapter one of harry potter for me"


Instructed not to was my understanding.

Not sure how they're being counted, but that adds up to 46 with the pair spells counted separately. But then nox is counted twice, so maybe 45.

A mile is exactly 1000 paces, or 4000 feet. You may disagree, but consider: the word mile come from latin for "one thousand". Therefore a mile must be 1000 of something, namely paces. I hope you find this argument convincing.



There is a "metric mile" which is 1500 m. This is something in the context of track and field athletics.


There is also a "metric gallon", or 4L.


Same with a Metric ton (a "tonne") which is one thousand kilograms (pretty close to an imperial ton).


Not to be confused with the 1600 meter or "1 mile" race which is commonly run in US track and field events (i.e. 4 times around a 400 meter track). At least that's within 1% of an actual mile.


Maybe after society has collapsed and been rebuilt we'll end up with km and cm having a weird ratio to the meter. Same for kg. At least celsius is just about impossible to screw up.


Celsius is already screwed up because it's not zero-based.


Celcius is eminently practical (similar to power of two prefixes). Absolute zero is completely irrelevant for day to day human life.


Fahrenheit is even more practical for daily life, as it was designed to be.


How so? 0 = freezing water, 100 = boiling water is at least more useful than more or less arbitrary points of F.

Celsius is just as arbitrary as farenheit so I wouldn't be so sure.


Webster’s dictionary defines a mile as …


I think your comment is supposed to be sarcastic, but I'm not sure what the sarcasmcis about? Yes, a mile is 1000 paces. That is why it's called a mile. It's not an "argument", it's just what a mile is.


> Did you notice how they have banned and demonized tobacco, but the lung cancer rate keeps increasing?

No, I noticed the opposite. They demonized tobacco, and lung cancer rates went dowm precipitously.


[dead]


Rates are down massively compared to where they were before the significant drop in smoking.

https://pmc.ncbi.nlm.nih.gov/articles/PMC10752493/


[flagged]


> a full 1/3 of lifelong smokers never develop any kind of cancer,

That's "true" in the sense that it's the CVD (Cardiovascular disease) and COPD (Chronic obstructive pulmonary disease) that are way more likely to take them out first.

Lifetime Smoking History and Cause-Specific Mortality in a Cohort Study with 43 Years of Follow-Up

https://pmc.ncbi.nlm.nih.gov/articles/PMC4824471/

Sure, you absolutely can be 98 years old sucking back on a deathstick, just like you might find yourself screaming "suck it" as you take home that giant lottery cheque with some winnings.

Pachinko's a hell of a game .. but still the house wins.


[dead]


Buck up, This Is Serious Mum:

40 Years of Living, Then Death (with simulated smoking) https://www.youtube.com/watch?v=oGxDVXGRQpY

Life is a MLM: Death, Death, Death https://www.youtube.com/watch?v=ZxoODPQ4CTM


There has to be a clause for "willful disregard for the truth", no? Having your lying machine come up with plausible lies for you and publishing them without verification is no better than coming up with the lies yourself. What really protects them from fraud accusations is that these blog posts were just content marketing, they weren't making money off of them directly.


Even for civil law where the bar for the evidence is lower, it's hard to make a case that someone who posted wrong details on a free blog and didn't make money off of it should cover the damages you incurred by traveling based on the advice alone. Not making any reasonable effort to fact check cuts both ways.

This is a matter of contract law between the two companies, but the people who randomly read an internet blog, took everything for granted, and more importantly didn't use that travel agency's services can't really claim fraud.

Just being wrong or making mistakes isn't fraud. Otherwise 99% of people saying something on the internet would be on the hook for damages again and again.


And using autocomplete to write travel advertisements has to fall under this category?


The businesses they acquire are ones whose revenue has not appreciably grown in many years. They are being sold because the prior owner does not believe they can improve the business any more.

Any profit bending spoons earns they can run off and invest in another business if they like. They don't bother investing in the businesses they purchase because they believe, like the previous owner believed, that there is no more juice to squeeze from that particular lemon.



>Any profit bending spoons earns they can run off and invest in another business if they like

And the ones who helped make Vimeo what it is? left out in the cold to fend for themselves.

This is why loyalty is dead. Maybe if this billion dollar aquisition benefitted the workers there'd be less hard feelings, but that's not how capitalism works.


And the ones who helped make Vimeo what it is? left out in the cold to fend for themselves.

This is such a bizarre mentality to me. When you sell your car, do you send a cut of the money to your mechanic?


Just don't look up what the word "hentai" means ;)


At least hentai isn't necessarily lolisho (although a lot of it is...)


.18 is 3% of 6. This might mean something, but I don't know what.


10 months out of six years is 0.14 so it isn't quite prenatal benefits.

What happens if an unborn baby has rights to go to preschool, but the birthing parent can't?

Is an unborn child a US citizen yet?


the next number in the sequence 3, 6, 18 is 72, but I doubt it means anything.


People pay to go to basketball games, so yes, obviously.

I don't think his point is that you shouldn't have a production crew. It's moreso two points:

- you shouldn't change perspectives very often, because that's jarring

- having more streams is preferable to having high production value, so if it costs too much you should just cut the production team.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: