There's also the post going around about how it can (and does) falsely flag human posts as AI output, particularly among some autistic people. About as useful as a polygraph, no?
Both false-positives are as useful as the other one, flagged "human" but actually "LLM" vs flagged "LLM" but actually "human". As long as no one put too much weight on the result, no harm would have been done, in either case. But clearly, people can't stay away from jumping to conclusions based on what a simple-but-incorrect tool says.
A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process. There is no way to know when it’s wrong so you’ll either use it to help justify what you want, or ignore it.
> A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process.
It can be used for some decision (i.e. not critical ones), but it should NOT be used to accused someone of academic misconduct unless the tool meets a very robust quality standard.
The AI tool doesn't give accurate results. You don't know when it's not accurate. There is no accurate way to check its results. Who should use a tool to help them make a decision when you don't know when the tool will be wrong and it has a low rate of accuracy? It's in the article.
Nearly everything doesn't give 100% accurate results. Even CPUs have had bugs their calculation. You have to use a suitable tool for a suitable job with the correct context while understanding it's limitation to apply it correctly. Now that is proper engineering. You're partially correctly but you're overstating:
> A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process.
That's totally wrong and an overstated position.
A better position is that some tools have such a low accuracy rate that they shouldn't be used for their intended purpose. Now that position I agree with it. I accept that CPUs may give incorrect results due to a cosmic ray event, but I wouldn't accept a CPU that gives the wrong result for 1/100 instructions.
That sounds like a less serious problem—if the tool highlights the allegedly plagarized sections, at worst the author can conclusively prove it false with no additional research (though that burden should instead be on the tool’s user, of course). So it’s at least possible to use the tool to get meaningful results.
On the other hand, an opaque LLM detector that just prints “that was from an LLM, methinks” (and not e.g. a prompt and a seed that makes ChatGPT print its input) essentially cannot be proven false by an author who hasn’t taken special precautions against being falsely accused, so the bar for sanctioning people based on its output must be much higher (infinitely so as far as I am concerned).
ChatGPT isn't the only AI. It is possible, and inevitable, to train other models specifically to avoid detection by tools designed to detect ChatGPT output.
The whole silly concept of an "AI detector" is a subset of an even sillier one: the notion that human creative output is somehow unique and inimitable.
You're right. After reading what I'd wrote, there should be some reasonable expectations about a tool, such as how accurate it is, or what are the consequences to be wrong.
The AI detection tool fails both as it has a low accuracy and could ruin someones reputation and livelihood. If a tool like this helped you pick out what color socks you're wearing, then it's just as good as asking a magic 8-ball if you should wear the green socks.
This is a strawman. First, the AI detection algorithms can't offer anything close to 99.9%. Second, your scenario doesn't analyze another human and issue judgement, as the AI detection algorithms do.
When a human is miscategorized as a bot, they could find themselves in front of academic fraud boards, skipped over by recruiters, placed in the spam folder, etc.
> Second, your scenario doesn't analyze another human and issue judgement, as the AI detection algorithms do.
> When a human is miscategorized as a bot, they could find themselves in front of academic fraud boards, skipped over by recruiters, placed in the spam folder, etc.
Is the problem here the algorithms or how people choose to use them?
There’s a big difference between treating the results of an AI algorithm as infallible, and treating it as just one piece of probabilistic evidence, to be combined with others, to produce a probabilistic conclusion.
“AI detector says AI wrote student’s essay, therefore it must be true, so let’s fail/expel/etc them” vs “AI detector says AI wrote student’s essay, plus I have other independent reasons to suspect that, so I’m going to investigate the matter further”
That's exactly why the stock analogy doesn't work. People don't buy algorithms, they buy products - such as detectors or predictors. You necessarily have to sell judgement alongside the algorithm. So debating the merits of an algorithm in a vacuum, when the issue being raised is the human harm caused by detector products, is the strawman.
> People don't buy algorithms, they buy products - such as detectors or predictors. You necessarily have to sell judgement alongside the algorithm.
Two people can buy the same product yet use it in very different ways: some educators take the output of anti-cheating software with a grain of salt, others treat it as infallible gospel.
Neither approach is determined by the product design in itself, rather by the broader business context (sales, marketing, education, training, implementation), and even factors entirely external to the vendor (differences in professional culture among educational institutions/systems).
It's not a strawman. There are many fundamentally unpredictable things where we can't make the benchmark be 100% accuracy.
To make it more concrete on work I am very familiar with: breast cancer screening. If you had a model that outperformed human radiologists at predicting whether there is pathology confirmed cancer within 1 year, but the accuracy was not 100%, would you want to use that model or not?
It's a strawman because they aren't comparable to AI detection tests. A screening coming back as possible cancer will lead to follow up tests to confirm, or rule out. An AI detection test coming back as positive can't be refuted or further tested with any level of accuracy. It's a completely unverifiable test with a low accuracy.
You are moving the goalposts here. The original claim I am responding to is
"A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process."
I agree that there are places where we shouldn't put AI and that checking whether something is an LLM or not is one of them. However I think the sentence above takes it way too far and breast cancer screening is a pretty clear example of somewhere we should accept AI even if it can sometimes make mistakes.
That seems like a restrictive binary. Are there not other entities which generate text? What if a gorilla uses ASL that is transcribed? ELIZA could generate text, after a fashion, as a precursor to LLM. It seems like there's a number of automated processes that could take data and generate text, sort of, like weather reports, no?
So I think the only thing a mythical detector could determine would be LLM, or non-LLM, and let us take it from there. But detectors are bunk; I've had first-hand experience with that.
You could but is there any reason to believe these two noisy signals wouldn't result in more combined noise than signal?
Sure, it's theoretically possible to add two noisy signals that are uncorrelated and get noise reduction, but is it probable this would be such a case?
It all depends on the properties of the signal and the noise. In photography you can combine multiple noisy images to increase the signal to noise ratio. This works because the signal increases O(N) with the number of images but the noise only increases O(sqrt(N)). The result is that while both signal and noise are increasing, the signal is increasing faster.
I have no idea if this idea could be used for AI detection, but it is possible to combine 2 noisy signals and get better SNR.
If the noisy signals are not completely correlated then the signal would be enhanced; however in this case I imagine that there is likely to be a strong correlation between different tools which would mean adding additional sources may not be so useful.
TBH, a properly-administered polygraph is probably more accurate than OpenAI's detector (of course, "properly administered" requires the subject to be cooperative and answer very simple yes or no questions, because a poly measures subconscious anxiety, not "truth")
I mean, it literally and factually measures multiple your body's autonomous responses - all of which are provably correlated with stress. That's what a polygraph machine is. Saying it measures nothing is factually incorrect.
You can't detect "truth" from that, but you can often tell (i.e. with better accuracy than chance) whether or not a subject is able to give a confident, uncomplicated yes-or-no to a straightforward question in a situation where they don't have to be particularly nervous (which is why it's not very useful for interrogating a stressed criminal suspect, and should absolutely be inadmissible in court).
But everyone knows that it's not very reliable in almost every circumstance it's used. My point is that while only marginally better than chance, it's still better than chance, unlike the OpenAI's detector, which is significant worse than chance.
Right. The point is: it absolutely does NOT measure what it claims to measure, i.e. truthfulness.
You can detect indicators of stress... or hot weather... or stage-fright (admittedly a form of stress)... or too much caffeine... or an underlying (maybe undiagnosed) medical condition, etc. So it does not even necessarily measure "stress".
It's about as useful as the so called "fruit machine" which they used to test for homosexuality[0], in that it is utterly useless while at the same time can be quite ruinous for people. People have been fired over polygraph "fails", and while not admissible in courts, people probably have been fingered for crimes after they failed polygraphs. Also, criminals have gone free after passing polygraphs[1].
>But everyone knows that it's not very reliable in almost every circumstance it's used.
You and I may know that. But a lot of people actually do not. That's why it's still used. Either because people administering those tests think it's "good science", or because those people administering it know that while it's all bullshit the person they are testing might not know that and break down and admit to things. Remember that fake polygraph on the show The Wire, which was just a copier they strapped to the suspect. If I remember correctly that was based upon true events.
A quick google shows e.g. you can hire "polygraphers" to e.g. "test" if your partner was unfaithful, making claims such as: "However, assuming that you have a good polygrapher with a fair amount of experience in working with betrayal trauma, you're going to get results that are at least 90% accurate or better."[2]
The US (and probably a lot of other) government(s) like their polygraphs very much, too[3].
> you can often tell (i.e. with better accuracy than chance) whether or not a subject is able to give a confident, uncomplicated yes-or-no to a straightforward question in a situation where they don't have to be particularly nervous
Uhmm, if somebody sat me down in a room, strapped all kinds of "science" to my body and then asked me questions, I'd be quite nervous regardless of whether I am truthful or not. In fact, I'd be even more nervous knowing it's a polygraph and bullshit, because I cannot know if the person administrating it would know that too.
If that somebody then asked me "Have you ever killed a prostitute?", or "Have you ever colluded with the enemy?", or "Have you ever cheated on your partner?", or "Have you ever stolen from your employer?", for example, my stress would certainly peak despite being able to confidently and truthfully answer "No!" to all of those questions. And I am sure the polygraph would "measure" my "stress".
[1] E.g. the Green River Killer Gary Ridgway passed a polygraph, so the police turned their resources to another suspect who failed the polygraph. That was in 1984. Ridgway remained free until his arrest in 2001. He killed at least 4 more times after the investigation stopped focusing on him after that "passed" polygraph.