> New approaches are needed, like more dynamic approaches using behavioural analysis
Does this set off alarm bells for anyone else? Of course the best way to know if a visitor is a human or a bot is to deeply analyze their behavior. But that's at odds with the right of us humans not to be analyzed by every website we visit. What happens if we do reach a standoff where the bots become good enough at mimicking human behavior that the only way to tell us apart is unacceptable and illegal behavioral analysis?
I recently wondered whether the good reviews on WWW shopping sites, are actually written by the bots. The market for astroturfing is so competitive, that the paid reviews probably learned a long time ago you need to leave quality reviews to get repeat customers.
They also 'care' more than actual customers in many cases. Real customer -> "Stop sending me review reminders. It was a comb. Block." Bot -> "Dutifully review all kinds of products. 500 words on the life changing experience of hair brushing with this comb. A+ reviewer."
I find it difficult to believe that the bot networks would not have just immediately rolled every single generative AI advance into their networks (write convincing reviews, generate convincing product examples without buying, beat captchas more reliably, automated screen clicking, human eye scan impersonation). Need to be better than every other group doing paid reviews. Need to be better than actual humans. They might write critical reviews.
Also, lot of sites already doing some behavioral analysis. Popup every time you consider clicking 'leave' on websites lately? "Before you go..."
> I recently wondered whether the good reviews on WWW shopping sites, are actually written by the bots.
Probably, but if these are just like, Amazon reviews/etc, they likely violate FTC regulations. Enforcement is lacking, but I'd still be very hesitant to break the law.
Maybe the ultimate solution is to make people pay. As in microtransactions
Human or bot is not really the problem, spam is the problem, and bots makes spam so cheap that admins can't deal with it. So, bots are banned. Human spammers can still get in, and you can pay people to solve captcha, but humans are more expensive, so there are less of them and moderators can deal with them.
If we had people (or bots) pay a few cents to access a service, it could be enough to keep spam to a manageable level.
The problem is, people don't like to pay, and unlike with phone numbers, the web doesn't have a good microtransaction architecture so behavioral analysis it is.
Any payment system will be used to track and unmask users. Then they'll double dip selling your even better identified user data to whoever wants it. Probably while still showing you ads.
You can slowdown bots with proof of work. A crypto-miner seems like the only possible payment method that would resist tracking, I think Brave tried something similar. Not sure I like that idea!
I think “slowing down the bots with proof of work” is essentially what captchas were, back when they weee easy for us and hard for computers.
But now, when I see a captcha, I hit back, press unsubscribe, and find a new vendor. The work is harder for me than it is for a computer and so I won’t do it.
Accordingly, we can see that proof of work is the opposite of a solution.
I’ve been rereading your post here and I think I’m coming around. Captcha is not proof that I did the work; it’s an inference that because I was able to do the work, it must have been easy for me (because I’m a human).
I let my my cryptoskepticism run a little too freely here. Thanks for making me think harder.
If the point is to stop spam it doesn't need to be implemented like that. One way to do it could be, pay $5 to create an account. Your money will be returned to you as you post on the site. If someone determines that you're spamming, any money that hasn't been returned to you is lost.
The idea is not to make people pay 5 cents to see a website. Instead replacing captchas with a toll for about how much it would cost to have a human solve the captcha for you.
There are already 'solve captcha as a service' sites all over, with highly developed APIs for their use. Lots of people use them for sneaker bots and ticket bots etc.
How do you combine microtransactions with the need to be indexed by search engines?
Microtransactions solve the issue of bad bots, and possibly websites monetization. But then do you want to give free pass to search engine crawlers? The big ones will be strong enough to refuse to crawl your site if you don't. The small ones will be financially unable to crawl if you don't. If you allow them all, you're back to step 1. If you allow only one or a few, you basically freeze search engine innovation.
Spammers will have more disposable funds than me and more utility from making payments to spread their message than I will. Essentially this is more likely to exclude poor people while waving spammers on in through the ticket gate.
Not to mention credit card fees making sub $1 payments a no-go and crypto being it's own barrel of nightmares.
There are the recently announced hardware root of trust solutions[1] that aren’t behavior driven.
But they just trade one privacy issue for potentially another, depending on your view
It does seem sadly unavoidable. Perhaps the internet has to go full circle and we need real identities if we want to ensure we’re not talking to machines?
It is not only a privacy issue, "certified by entrenched gatekeeper mega-corporation" is a nightmare for user freedom. The first ones who will be impacted are the minority who root their devices and compile their own software, but in the long run there are detrimental effects of a monopoly at the gate, like the ease to implement surveillance, censorship and drm, that will apply to everyone.
Yeah, at second glance it actually (as proposed - huge caveat) might even be better for privacy.
I’m can’t begin to theorize how the future will play out if you need a PAT to access most web destinations. Do cloudflare or Apple engineers use Linux machines ever? Surely they do, and either know this is bad or have some plan to make it work?
As am I, which is why it's not that big of a deal to take this stance.
> most major web players already 100% know who you are
I have no doubt about this, but there's also a whole internet full of others who I want to remain pseudonymous with. I've been using a handful of online identities for over 30 years now, and have never tied them to my real world identity.
The reasons for avoiding that hold more true now than ever before. Having to tie my online identities to my actual identity is unthinkable.
Then we require a phone number for everything (it's not easy to make unlimited new phone numbers) and use OIDC to authenticate to one of a couple providers. You won't be able to do anything on the internet without logging in first, but the login is safe at the identity provider.
If you think about it this is no different than showing your ID to get into a bar.
I trust any bar more about privacy than anyone on the internet. Their incentive to maximize consent to stalk me and my behavior is close to non-existing.
Phone numbers are a cheap and reusable resource. Pushing the problem on another site with OIDC doesn't help either, if their CAPTCHAs have the same limitation.
"Real" cellular phone numbers are a very finite pool and require nontrivial amounts of money, a physical phone (with a burned-in hardware identifier), an in-person interaction, and government ID that validated against a state database.
"Real" cellular phone numbers are a very finite pool and require nontrivial amounts of money, a physical phone (with a burned-in hardware identifier), an in-person interaction, and government ID that validated against a state database.
You'd think so, but no.
I signed up for T-Mobile service early this year with no ID, and paid cash.
The store is so eager to complete the transaction that it keeps a government ID document in a drawer and the sales people whip it out whenever anyone looks queasy about providing information.
I didn't resist giving my information. All I did was pause because I wasn't sure if I brought my ID with me. Even that little hesitation was enough for the clerk to say, "Don't worry about it. I got you covered" and he pulled out the ID.
So I have a T-Mobile account that I can pay for with cash and no ID on file, and someone else's address.
Now, if a government was really interested in me, it could probably pull the security camera video or follow the signal around or whatever. But it turns out that KYC is easily bypassed when the incentives are right.
Some places might require all that but a lot don't need in-person, id card. And there's also sms verification services that charge you a few cents per verification, and they use "real" numbers.
In the US, you can walk in a T-Mobile store and get a SIM card for cash, no questions asked.
Also once you are sitting on a few dozens phone numbers, you can use them again and again to spam or abuse different services (possibly for sale). It's not like CAPTCHA solutions that you have to do every time.
> What happens if we do reach a standoff where the bots become good enough at mimicking human behavior that the only way to tell us apart is unacceptable and illegal behavioral analysis?
Sophisticated bots are already good enough at this that a variety of behavioral-based bot analysis tools exist and are in semi-widespread use. They're not illegal.
Thankfully, soon we will have the web integrity API to verify that a visitor is human.
Apple devices already support something like this when connecting to websites behind cloudflare and fastly, and as cloudflare explains this "vastly improves privacy by validating without fingerprinting"[1].
I tried to register for twa^Hitter yesterday and couldn't fill in all the captcha it was throwing at me. First it was some innocent point a train to station a letter shown in picture on the left and after I barely completed that it was throwing at me 6 pictures of different string knots where I was supposed to point out which picture has two strings or maybe just one. Funny thing is that after first 5 it throws 20 more and after 40 or 50 filled in total I just gave up - no twitter for me I guess and that is probably for the best.
The best one is NPM. You have to pick two identical icons that are overlayed on other images. But if the username you picked is taken then the whole form resets, and you have to solve the captcha again. There is no way to check if a username is taken before solving the CAPTCHA, even though npm usernames are public.
I almost never get cat has on the first try any longer. It seems obvious that AIs will get better than humans at these things. It’s just a question of model cost and timing.
They made the joke of having first typoed the name of the company as "Twatter", before fixing it up to "Twitter". Visualizing a backspace in text as ^H is an old joke rooted in the backspace control character being ascii 08, which also maps to Ctrl-H.
The joke is intended to be funny because "twat" is a vulgar and generally derogatory term, and the author almost but not quite applied it to either a large company or (transitively) to its users.
Another cliched^Wcommon joke relates to the werase ("word erase") character in TTYs. As you'd guess, it kills the previous word, and is typically bound to ctrl-W.
The article being reported on does not make this conclusion. The study authors are interested only in the time it takes for (human) users to complete CAPTCHAs, and did not examine the speed at which bots solve them.
The fact that bots can solve them -- and solve them fast -- is apparently a well-established fact in the literature. There is a table in the article comparing its (human) participants' solve times to a number of previous studies which examined how fast/accurate bots can be.
The Register (and the New Scientist, which most of this is cribbed from) is looking for a headline, so whatever. But the study's authors say that the "surprising" part is that "solving time and user perception are not always correlated" for human users. Game-based CAPTCHAs with sliders may take longer, but the users in the study still enjoyed them more than image-selection-based ones.
If you ask me, the whole idea of trying to prevent bad actors from acting badly by throwing up barriers to EVERYONE trying to get access to your system is... weird.
Better to deploy some light measures (tarpitting, RBLs etc.) on entry, then weed out the bad actors once they start acting bad inside the system, no? I mean CAPTCHA for everyone? Come on.
CAPTCHAS exist precisely because those were inadequate 15 years ago.
You may not have been around for it, but it's not like everyone was super duper excited to put these things on their web sites. It was something people were dragged into kicking and screaming, and even today there's a lot of those technologies deployed even so.
You are probably underestimating the willingness of bad actors to make efforts to avoid these things. Is your model of a "bad actor" on the web some malicious guy writing a program and running it on his personal laptop from his home connection? Because in 2023, your threat model should be something more like a guy who rents a botnet out with millions of computers of all sorts on it (the difficulty of this rental being somewhat higher than AWS, but only somewhat so, it's not that hard at all really), collaborates with other bad actors to work out how to best bypass filtering, creates websites to do things like CAPTCHA proxying so that humans fill out the CAPTCHAs in return for free porn or something, trades rootkits and other exploits around both for home computers and for compromising web servers for their campaigns (for the URL cred), and so on. You're not up against some guy, you're up against a honed and tuned machine with years of experience, internal division of labor and skillsets, basically an entire parallel predator economy.
Tarpitting and RBLs are not dead, but they became just one layer a long time ago.
In my experience Captchas are used a lot by inexperienced developers. As you stated, they are not particularly hard to circumvent, but they are incredibly easy to implement.
So developers just install a Captcha and outsource the problem to Google.
I think the primary way to deal with the problem should be to design services in a way to make them unsuitable for spammers.
"For distorted text fields, humans took 9-15 seconds with an accuracy of just 50-84 percent. Bots, on the other hand, beat the tests in less than a second with 99.8 percent accuracy."
I'm guessing part of the answer is (and most likely already implemented in things like reCAPTCHA) is rate limiting and detecting bots when they solve these too quickly.
The bots would just slow down then. Their time is ~free.
reCAPTCHA is one of the better captcha because they do a decent amount of browser fingerprinting and their captcha are interactive.
Still there are services for solving them. Fun thing is you only need to pay those services for first ~50K captcha and then you can train your own solver using the data you collected.
Ultimately, captchas only serve to increase the cost of running bots. If what ever your trying to protect is worth more you will fail.
Modern captchas are mostly about letting google or cloudflare track you anyway aren't they? When I am forced to actually click on stoplights (most of that stuff should be easy object detection) I usually just get stuck in an infinite loop. Actually mastering the cognitive task is not really important now.
You can tell it because it's not actually Google or Cloudflare installing captchas on websites of third parties. They in fact cannot do it. It's done by the people operating the websites and who desperately need to protect it against abuse, and for whom letting a company track you is not even a hypothetical motive.
Even in the case where you're getting some kind of behavior or reputation verdict from past behavior (and possibly across multiple surfaces), you probably want a progressive set of outcomes rather than just a binary allow or deny. Even if some requests are clearly best just blocked and others should obviously be allowed, there's always going to be a grey area where you're not sure. You need something to do with those requests. Making an arbitrary choice is one option, but pretty harsh on the legit users. A captcha in another.
Sometimes you have options for that gray area that are much better than captchas that you can do, e.g. request a phone number and do an SMS challenge. But that's both expensive and will lead to a massive dropoff for most sites as people won't be willing to give out their phone number to every site.
(Also, the act of solving the puzzle can give you additional signals of whether the request is from a bot or not. Signal collection is kind of the entire point of the slider captchas in the first place.)
I regard these as more or less slave labor and in the interest of polluting training data I've been intentionally making a minority of incorrect selections on these for years.
I was surprised how rarely that I have to make more than one submission in spite of intentionally making incorrect selections.
I'm looking forward to Google getting sued after a Waymo tries to make a right on red at a pontoon boat.
Interesting. The authors of the captcha solving bot papers claim 100% accuracy for reCaptcha and 98% accuracy for hCaptcha.
That google does this is not really a surprise, since they earn money by letting bots through (bots are then counted as humans and google can bill for ads shown to the bot), but hCaptcha at least advertises the fact that their interest is actually detecting bots.
I personally know someone who has specifically designed bots to defeat both, and also the ones that defeat the "drag the puzzle piece" and similar "bot-defeating" technologies!
at this point I think your best bet is security through obscurity. with the state of generally available AI tools and processing power, is there any general format can't be solved?
surely the best option nowadays is to make your own or find an obscure one, and hope it's unusual enough that ready-made software doesn't exist that can easily solve it. then if and when it gets cracked to the degree it's impacting your content, move onto another one
Putting in a CC for a trial to access your service doesn't end up deterring anybody because you can just buy a massive CSV with stolen details for relatively cheap. Even if they're frozen/cancelled, the card numbers are still valid numbers. If you want to pay to authenticate each and every one of them, you'll probably run yourself dry.
Can also just use old empty VISA gift cards, ethically sourced from relatives and friends of course.
with that, besides the issues other commenters raised, you're also putting off a very significant portion of legitimate users, who for any number of reasons may not want or be able to put a credit card into your site, even if they completely trust you
> Google's implementation, reCAPTCHA, eventually did away with much of these shenanigans to make the browser identify low-risk human users in the background, but the image verification method still pops up occasionally if risk cannot be ascertained.
Can’t remember the last time I clicked reCAPTCHA and didn’t have to do the challenge, come to think of it - I can’t remember a single time, so if it has happened it is very rare, whereas the Cloudflare one always lets me through.
Try using a browser which resists tracking more, you'll see it. And a lot of the time you see it it's not actually solvable: the captcha system has already decided you're a bot and will just try to tarpit you with ever slower-appearing challenges which will always fail.
The worst captcha are the ones on 4chan. I don't think I've ever gotten it right on my first try. It really discourages participation from all but the most dedicated of people. I swear it was added with the ultimate goal of reducing activity and getting regulars to pay for gold.
What about proof of work based CAPTCHA like https://github.com/mCaptcha/mCaptcha ? Since CAPTCHAs can be solved by bots, at least make it more costly for them.
Say, is there a Firefox extension that will solve captchas for me when i click on them? Don't need auto solving as in never see the captcha, I just don't want to look which of these blurry photos has palm trees and which has hills.
Sometimes the value of the action a bot is trying to perform is simply so low, that even a simple obstacle is effective. Like how much compute you want to spend to write one spam message through a contact form.
has this not been the case for a very long time? besides a sneaky way for Google to train its models, I had the impression that Captchas are more like a way to increase the friction rather than anything that would actually stop a determined actor?
they increase the processing/energy cost/set-up time/difficulty to the point where it may no longer be profitable to access that content, but no one really thought a powerful computer with the right software couldn't actually solve them at pace, right?
I’ve noticed that in order to defeat captcha sometimes I have to slow down and click the crosswalks one at a time, pausing between each, then count to 3 and click submit.
Does this set off alarm bells for anyone else? Of course the best way to know if a visitor is a human or a bot is to deeply analyze their behavior. But that's at odds with the right of us humans not to be analyzed by every website we visit. What happens if we do reach a standoff where the bots become good enough at mimicking human behavior that the only way to tell us apart is unacceptable and illegal behavioral analysis?