Both false-positives are as useful as the other one, flagged "human" but actually "LLM" vs flagged "LLM" but actually "human". As long as no one put too much weight on the result, no harm would have been done, in either case. But clearly, people can't stay away from jumping to conclusions based on what a simple-but-incorrect tool says.
A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process. There is no way to know when it’s wrong so you’ll either use it to help justify what you want, or ignore it.
> A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process.
It can be used for some decision (i.e. not critical ones), but it should NOT be used to accused someone of academic misconduct unless the tool meets a very robust quality standard.
The AI tool doesn't give accurate results. You don't know when it's not accurate. There is no accurate way to check its results. Who should use a tool to help them make a decision when you don't know when the tool will be wrong and it has a low rate of accuracy? It's in the article.
Nearly everything doesn't give 100% accurate results. Even CPUs have had bugs their calculation. You have to use a suitable tool for a suitable job with the correct context while understanding it's limitation to apply it correctly. Now that is proper engineering. You're partially correctly but you're overstating:
> A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process.
That's totally wrong and an overstated position.
A better position is that some tools have such a low accuracy rate that they shouldn't be used for their intended purpose. Now that position I agree with it. I accept that CPUs may give incorrect results due to a cosmic ray event, but I wouldn't accept a CPU that gives the wrong result for 1/100 instructions.
That sounds like a less serious problem—if the tool highlights the allegedly plagarized sections, at worst the author can conclusively prove it false with no additional research (though that burden should instead be on the tool’s user, of course). So it’s at least possible to use the tool to get meaningful results.
On the other hand, an opaque LLM detector that just prints “that was from an LLM, methinks” (and not e.g. a prompt and a seed that makes ChatGPT print its input) essentially cannot be proven false by an author who hasn’t taken special precautions against being falsely accused, so the bar for sanctioning people based on its output must be much higher (infinitely so as far as I am concerned).
ChatGPT isn't the only AI. It is possible, and inevitable, to train other models specifically to avoid detection by tools designed to detect ChatGPT output.
The whole silly concept of an "AI detector" is a subset of an even sillier one: the notion that human creative output is somehow unique and inimitable.
You're right. After reading what I'd wrote, there should be some reasonable expectations about a tool, such as how accurate it is, or what are the consequences to be wrong.
The AI detection tool fails both as it has a low accuracy and could ruin someones reputation and livelihood. If a tool like this helped you pick out what color socks you're wearing, then it's just as good as asking a magic 8-ball if you should wear the green socks.
This is a strawman. First, the AI detection algorithms can't offer anything close to 99.9%. Second, your scenario doesn't analyze another human and issue judgement, as the AI detection algorithms do.
When a human is miscategorized as a bot, they could find themselves in front of academic fraud boards, skipped over by recruiters, placed in the spam folder, etc.
> Second, your scenario doesn't analyze another human and issue judgement, as the AI detection algorithms do.
> When a human is miscategorized as a bot, they could find themselves in front of academic fraud boards, skipped over by recruiters, placed in the spam folder, etc.
Is the problem here the algorithms or how people choose to use them?
There’s a big difference between treating the results of an AI algorithm as infallible, and treating it as just one piece of probabilistic evidence, to be combined with others, to produce a probabilistic conclusion.
“AI detector says AI wrote student’s essay, therefore it must be true, so let’s fail/expel/etc them” vs “AI detector says AI wrote student’s essay, plus I have other independent reasons to suspect that, so I’m going to investigate the matter further”
That's exactly why the stock analogy doesn't work. People don't buy algorithms, they buy products - such as detectors or predictors. You necessarily have to sell judgement alongside the algorithm. So debating the merits of an algorithm in a vacuum, when the issue being raised is the human harm caused by detector products, is the strawman.
> People don't buy algorithms, they buy products - such as detectors or predictors. You necessarily have to sell judgement alongside the algorithm.
Two people can buy the same product yet use it in very different ways: some educators take the output of anti-cheating software with a grain of salt, others treat it as infallible gospel.
Neither approach is determined by the product design in itself, rather by the broader business context (sales, marketing, education, training, implementation), and even factors entirely external to the vendor (differences in professional culture among educational institutions/systems).
It's not a strawman. There are many fundamentally unpredictable things where we can't make the benchmark be 100% accuracy.
To make it more concrete on work I am very familiar with: breast cancer screening. If you had a model that outperformed human radiologists at predicting whether there is pathology confirmed cancer within 1 year, but the accuracy was not 100%, would you want to use that model or not?
It's a strawman because they aren't comparable to AI detection tests. A screening coming back as possible cancer will lead to follow up tests to confirm, or rule out. An AI detection test coming back as positive can't be refuted or further tested with any level of accuracy. It's a completely unverifiable test with a low accuracy.
You are moving the goalposts here. The original claim I am responding to is
"A tool that gives incorrect and inconsistent results shouldn’t have any part of a decision making process."
I agree that there are places where we shouldn't put AI and that checking whether something is an LLM or not is one of them. However I think the sentence above takes it way too far and breast cancer screening is a pretty clear example of somewhere we should accept AI even if it can sometimes make mistakes.
That seems like a restrictive binary. Are there not other entities which generate text? What if a gorilla uses ASL that is transcribed? ELIZA could generate text, after a fashion, as a precursor to LLM. It seems like there's a number of automated processes that could take data and generate text, sort of, like weather reports, no?
So I think the only thing a mythical detector could determine would be LLM, or non-LLM, and let us take it from there. But detectors are bunk; I've had first-hand experience with that.