Hacker Newsnew | past | comments | ask | show | jobs | submit | xurukefi's commentslogin

Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.


> I figured that they have found an (automated) way to imitate Googlebot really well.

If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.


There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.

That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.


I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P

Presumably they are just matching on *Google* and calling it a day.


Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.


> which I've configured to redirect to a paywalled news article.

Which specific site with a paywall?


> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.

The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.

I hope they haven't been stealing cookies from actual users through a botnet or something.


Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.


You don't even need active measures. If a publisher is serious about tracing traitors there are algorithms for that (which are used by streamers to trace pirates). It's called "Traitor Tracing" in the literature. The idea is to embed watermarks following a specific pattern that would point to a traitor or even a coalition of traitors acting in concert.

It would be challenging to do with text, but is certainly doable with images - and articles contain those.


You need that sort of thing (i.e. watermarking) when people are intentionally trying to hide who did it.

In the archive.today case, it looks pretty automated. Surely just adding an html comment would be sufficient.


If they use paid accounts I would expect them to strip info automatically. An "obvious" way to do that is to diff the output from two separate accounts on separate hardware connecting from separate regions. Streaming services commonly employ per-session randomized stenographic watermarks to thwart such tactics. Thus we should expect major publishers to do so as well.

At which point we still lack a satisfactory answer to the question. Just how is archive.today reliably bypassing paywalls on short notice? If it's via paid accounts you would expect they would burn accounts at an unsustainable rate.


Watch https://news.ycombinator.com/threads?id=1vuio0pswjnm7 they post AT-free recipes for many paywalls


I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.


.. but what about subscription only, paywalled sources?


many publisher's offer "first one's free".

For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.


> which is, of course, ridiculous.

Why? in the world of web scrapping this is pretty common.


Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.

Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.


Using two or more accounts could help you automatically strip account details.


That's actually a really neat idea.


Do you know where the doxxed info ultimately originates from? It turns out that the archives leaked account names. Try Googling what happened to volth on Github.


I could be wrong, but I think I've seen it fail on more obscure sites. But yeah it seems unlikely they're maintaining so many premium accounts. On the other hand they could simply be state-backed. Let's say there are 1000 likely paywalled sites, 20 accounts for each = 20k accounts, $10/month => $200k/month = $2.4m a year. If I were an intelligence agency I'd happily drop that plus costs to own half the archived content on the internet.

Surely it wouldn't be too hard to test. Just set up an unlisted dummy paywall site, archive it a few times and see what the requests looks like.


Interesting theory. It would also be a good way to subtly undermine the viability of news outlets, not to mention the insidious potential of altering snapshots at will. OTOH, I'd expect a state-sponsored effort to be more professional in terms of not threatening and smearing some blogger who questioned them.


If I were an intelligence agency wanting to throw people off my scent, maybe I'd set up or pay off a blogger to track down my site's "owner" and then do some immature shit in response to absolutely confirm forever that the blogger was right.

Not saying this is true, just saying it could be


Replace any identifiers like usernames and emails with another string automatically.


It's because it's actively maintained, and bypassing the paywalls is its whole selling point, thus, they do have to be good at it.

They bypass the rendering issues by "altering" the webpages. It's not uncommon to archive a page, and see nothing because of the paywalls; but then later on, the same page is silently fixed. They have a Tumblr where you can ask them questions; at one point, it's been quite common for everyone to ask them to fix random specific pages, which they did promptly.

Honestly, you cannot archive a modern page, unless you alter it. Yet they're now being attacked under the pretence of "altering" webpages, but that's never been a secret, and it's technologically impossible to archive without altering.


There's a pretty massive difference between altering a snapshot to make it archivable/readable and doing it to smear and defame a blogger who wrote about you.


I imagine accounts are the only way that archive.today works on sites like 404media.co that seem to have server sided paywalls. Similarly, twitter has a completely server sided paywall.


It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.


But it is reliable in the sense that if it works for a site, then it usually never fails.


no tool is 100% effective. Archive.today is the best one we've seen


I hate NAT with a passion. It's a terrible technology, whose disruptive nature has probably prevented any novelty on the transport layer. But this article is oversimplifying things.

It is well known that NAT is not meant for security and that NAT is not a firewall. But one cannot deny that it implicitly brings some "default" security to the table. With NAT it's basically impossible to screw you over because there is no meaningful practical way to allow inbound connections without the client explicitly defining them (port forwarding). With IPv6, you could have a lazy vendor that does not do any firewalling or a has a default allow policy or maybe buggy firewall. With NAT that is not possible. There is no lazy/buggy NAT implementation that allows inbound connections for your entire network, because it is technically not possible. When a NATting device receives a packet with a destination port that has not previously been opened by a client, it does not decide to drop this packet because of a decision by the vendor. It drops the packet because there is simply no other option due to the nature of NAT. That is what people mean when they talk about the inherent "security" of NAT.

Again, NAT is terrible. We need to finally get rid globally of IPv4 and all the NATting that comes with it. But let's keep it to the facts.


> there is no meaningful practical way to allow inbound connections without the client explicitly defining them

This... just isn't true though. Your router knows it has one network on one interface and one network on another interface and if it receives a packet on the one interface destined for the network on the other interface will happily route it unless something (a firewall) tells it not to. All the protection comes from trusting your ISP and its peers to not route RFC1918-private networks


For me, type hints are mainly useful because they're the only reliable way to get decent IDE auto-completion. Beyond that, they feel like a bolted-on compromise that goes against the spirit of Python. If you really need strict typing, you're probably better off using a statically typed language.


JSDoc plays a similar role with Javascript. Moreover it is supported out of the box by VSCode, so add a few JSDoc comments to your types and functions, and intellisense instantly kicks in.


The LaTeX community is astonishingly good at gatekeeping. I can't think of another field where the adoption of a clearly superior modern alternative has been so slow. For some reason, they seem to take pride in clinging to a 50-year-old typesetting system—with its bloated footprint, sluggish compilation, incomprehensible error messages, and a baroque syntax that nobody truly understands. People have simply learned just enough to make it work, and now they treat that fragile familiarity as a virtue.


The problem is that with Latex i end up in the same situation like in Word. I do not understand what is happenig and why.

Typst was an amazing addition to the modern IT stack. I use it whenever I can. The only issue is that companies like Google and Micrososft are dominating the collaboration space and I have zero chance to convince a comany to adopt Typst for internal documents that need to look good. It would be great though.


Nobody forces you to change your key for renewals.


removed


The linked source seems to be checking (len_in && !len).

len_in being the passed argument and len being the page aligned len.


> The question then becomes, whether Adblockers could use this information to skip the ads. It's a cat and mouse game.

I wouldn't call it a cat and mouse game because there is nothing from a technical point of view that prevents adblockers to use this information to skip ads. Unless YouTube gets completely rid of the concept of timestamps for their videos, they will always lose this battle.


They could make the timestamp conversion very Server-Side or create a very unwieldy API where it's difficult to collect all ad segments via timestamp sampling.


> For one thing, this approach seems to inherently conflict with the fact that you can link directly to a particular timestamp in a YouTube video, either in an external link using the `&t=...` URL parameter, or by just including a timestamp in a YouTube comment.

The more I think about this, the more I belive that this is literally the only reason that ad blocking cannot be meaningfully defeated for video on emand. Because of the concept of referencing a fixed point in the video by a time stamp, there will always need to be a mechanism to offset the time stamp with respect to the injected ads, which, in turn, gives ad blockers the ability to find out where the ads are exactly.


> What does "server side injection" actually mean?

The way ads usually work is that they are separate video files that are fetched by the YouTube client (e.g., the browser) and then displayed to the user. Ad blockers modify the content of the web site such that the URLs to those ads (usually embedded in some JSON object from an API endpoint or something like that) get removed so that no ads are displayed.

Server side injection in this context means that the server renders a video file on demand that contains the original clean video plus a few ads here or there. Blocking the ads now is much harder because you cannot simply manipulate API responses containing refrences to those ads because there is only this one video file. Instead you would need to implement a mechanism that skips those ads in the player.

AFAIK server sie injection is already done on twitch for live streams where blocking ads is really basically impossible because you cannot skip anything in a live stream. I think the best solution for twitch to get rid of ads is to use a VPN/Proxy in a country where no ads are delivered for contractual reasons.


It's a nice idea, but I don't think it adds enough clarity to the code to justify the messy compiler warnings and errors that this kind of preprocessor abuse will eventually cause.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: