Ask HN: Why isn't there a 100% undetectable automatable browser?(off the shelf)

kypro · on July 23, 2023

Selenium, Puppeteer, Playwright, Cypress, etc can all drive "real" browsers - although note by default config is altered for performance / testing reasons and this can be detected in some cases.

But assuming you're driving a real browser you're probably being "detected" because of your behaviour.

Humans are slow. They follow certain browsing patterns. They interact with the site in a certain way (scrolling, moving the mouse cursor, keyboard presses).

If a client is hitting a site many times a minute without much scrolling or mouse movement and seemingly doing things in an unusual or systematic way it will often trigger security measures like Captchas.

What you're describing here is the product of an arms race between people who want to scrape and exploit websites using automated tools and the websites themselves who want to offer the best service possible to legitimate users.

It's partly why so much of the internet today requires people to login and verify phone numbers / email addresses. It's also why we see captchas and other tactics to slow users down and screen out bots.

If you built something completely undetectable the bad guys would love it and sites would then need to find new ways to detect / stop whatever you're doing.

Someone · on July 21, 2023

“Of course clicks and keyboard events needs to emulate the timings and movements of a human, but that's another problem entirely.”

I don’t think that’s a different problem. As you know, real browsers are operated by humans, and humans have limited processing capacity and somewhat predictable navigation patterns.

To be undetectable, you have to mimic both (almost) perfectly.

So, you have to both limit the speed at which a fake user requests pages _and_ request then in a believable sequence.

The speed issue can be circumvented by using multiple machines to make requests from, but the second is the real problem. A collection of fake users would have to either stay under the radar by not making many requests or follow human patterns.

Combined, I would think it’s not possible to efficiently scrape lots of pages in a way that cannot be detected.

throwawayadvsec · on July 21, 2023

From my experience, it's a completely different problem.

For 99% of websites there is no serious behavioral analysis, and when there it's easily bypassable.

It's as simple as:

wait some random time between 500 and 3000ms before clicking and always make the clicks randomly slightly off from the center of the element

wait for some random time between 50 and 200ms between keystrokes

use ghost-cursor to make realistic movements using bezier curves

Most of the time you get spotted because of some obscure value in some random part of the browser or request headers isn't what it's supposed to be, not because your actions aren't realistic

OkayPhysicist · on July 21, 2023

Or, save yourself the hassle of pretending to be using a mouse, by pretending to be using a touchscreen.

throwawayadvsec · on July 22, 2023

well it's also pretty hard to simulate a phone, even a lot harder than a browser imo

for example I'm pretty sure the GPU/canvas fingerprint can't be faked

coreyp_1 · on July 22, 2023

I've been thinking about this lately.

I would think that a fully automatable browser might be an accessibility need, and therefore the viable use whose free access may be protected via ADA regulation. Of course, you might need someone with a disability to push a legal threat in the direction of one of the offending companies in order to get them to play nicely.

I admit I don't like that approach. I would rather just find a solution that works (or perhaps build one and market it).

toast0 · on July 21, 2023

Applescript + accesibility mode + Safari should get you pretty far, I'd think. Applescript is a terrible language to work with, but if you want a real browser, you want a real browser.

throwawayadvsec · on July 21, 2023

interesting, but I should have probably mentioned that something scalable would be nice

mac instances are 30 bucks a day on AWS

toast0 · on July 21, 2023

I thought you wanted to look like a person, why are you running on aws?

Pick up a used mac somewhere for a weeks worth of aws rental and you can run that from home on residential internet.

throwawayadvsec · on July 22, 2023

You can use proxies from AWS that's not an issue

I need to scale, using a physical machine is not feasible

wruza · on July 22, 2023

What issues are you running into with mentioned browsers? Captcha?

You likely need, in the order of importance:

- a signed-in google account

- mobile proxy

- captcha solver plugin

- randomized offsets and delays

- virtual screen, screenshot and ocr tools for specific cases

throwawayadvsec · on July 22, 2023

I knew about all that besides the google account

do a lot of websites try to check if you have a google account?

The worst thing I came across was chaff bugs[0] that only happened in puppeteer.

I'm not really looking for a method to not be blocked, I'm looking for a browser controllable from code that is indistinguishable from a regular browser out of the box, there is a slight difference.

[0]: https://www.csoonline.com/article/566227/what-is-a-chaff-bug...

Minor49er · on July 21, 2023

What is your use case for this? Is there a specific website that you're targeting?

throwawayadvsec · on July 21, 2023

I do a lot of scraping/automation for many different websites, so any hard to automate websites. Websites protected with cloudfare/datadome, FAANGs, microsoft websites, or websites with advanced custom protections...

Thoeu388 · on July 21, 2023

It is more about IP address range. There are apps, that give users cash, in exchange using their home connectivity.

If you want 100% undetectable, use real browser, with image recognition and auto clicker.

throwawayadvsec · on July 21, 2023

IPs aren't enough in a lot of cases

"If you want 100% undetectable, use real browser, with image recognition and auto clicker. " Do you know an open source package that does this reliably?