The real Turing test is not out of date. Two important considerations:
- The interrogator is against two players, a bot and a human, and can ask questions to both of them, both try to convince the interrogator that they are the human, and in the end, the interrogator must tell which one is which.
- The test is to be done by experts, both the human player and the interrogator are expected to play to win. The goal is not to have some chit-chat and try to guess afterwards, the goal is to actively try to find the bot, and come in well prepared.
For now, there is absolutely no way GPT-4 can pass a real Turing test. Remember, both the human player and the interrogator are trained in bot detection, and they collaborate. The only thing that is not allowed is for the two humans to know each other from before as to not use shared secrets against the bot. But using common knowledge anti-bot techniques together is fair game.
The thing described in the paper is a weak variant of the Turing test that only tests the ability of AI designers to trick unsuspecting humans.
I see this comment often and am confused why we wouldn't want to "move the goalposts".
I'd rather call it "passing a milestone" or simply "progress". More specifically it just means most criteria, especially the Turing test, are poorly defined.
> I see this comment often and am confused why we wouldn't want to "move the goalposts". I'd rather call it "passing a milestone" or simply "progress".
You're literally moving the "moving the goalposts" goalpost to make it more palatable.
> More specifically it just means most criteria, especially the Turing test, are poorly defined.
Nonsense. The Turing test was widely accepted for the specific reason that it contained a concrete method of testing for Artificial Intelligence. Go look for a better-defined goalpost for intelligence. You'll find pile after steaming pile of meaningless philosophical hair-splitting, each so completely divorced from reality as to make it useless as a real-world measuring stick.
Turing's test was and is the best measure of 'intelligence'. LLMs have passed the test. They are intelligent. Narrowly so, and utterly without agency, but they're clearly intelligent. There's no need to move the goalposts so that we can feel better about our place in the universe.
I actually agree that Turing's 1950 definition is pretty vague in places and there are a few different interpretations out there. As we discuss in the paper, it's also unclear what should constitute a pass in terms statistical analysis.
No it really isn't. The Turing test is just not an adequate methodology to determine intelligent behavior, and never was. This was already known way before generative ML models emerged.
Just to pick my personal favorite, which is mentioned at the end of the article: This function right here can, technically, pass the Turing-Test:
def generate_answer(in: str) -> str:
return ""
How? Simple: A human can just chose not to respond to any question. So, if a program does exactly that, that is, meet every query with silence, how do you differentiate it from a human who does the same?
What does the Turing Test determine anyway? According to the paper, it is supposed to measure a machines intelligence, or rather more prosaically, answer the question "Can machines think?"
But that isn't what the test measures.
It measures how well a machine can trick a human into believing it is a person. So instead of measuring how well the machine does, the test instead measures how well the human does. That is the greatest flaw of the Turing Test, and the little "answer with silence" thought experiment is showcasing exactly that flaw.
"Can a machine trick a human" and "Can a machine think" are 2 very different questions. Humans can, and have shown to, be tricked by ELIZA and even simpler chatterbots, engines that don't even use any kind of ML, just large bodies of prewritten text and a number of static rules.
> by % of people who believe they are taking to a person.
And what does that denote? Say I get 2 groups of people. One is tricked by ELIZA 80% of the time, the other is tricked by ELIZA 40% of the time. Does that show that ELIZA passed or failed? Neither. It shows that the outcome of the test depends as much, or even more, on the ability of the interrogator than it does on the quality of the machines responses.
Imagine a litmus test (a chemical test to roughly determine the acidity of a solution), where the test result depends on who performs it, as much or even more than it does on the quality of the Litmus-Paper. No lab would use that test for obvious reasons.
Okay so Turing test but there is an actual conversation (I doubt in actual implementations like the one linked by OP the bot or the human can send empty messages) the Turing test is an adequate methodology?
That’s an absurd and overly strict interpretation of what Turing described. Stipulating cooperation between participants is precisely in the spirit of Turing’s original work.
> That’s an absurd and overly strict interpretation of what Turing described.
No, it isn't.
The turing test doesn't evaluate correctness of any answers, their sophistication, or even if there is an answer. All it evaluates is the ability of the interrogator to distinguish between the computer and the human.
And therein lies the greatest flaw of the test: It doesn't test the ability of the computer, it tests the ability of the interrogator.
In practice, the test's results can easily be dominated not by the
computer's intelligence, but by the attitudes, skill, or naïveté of
the questioner. Numerous experts in the field, including cognitive
scientist Gary Marcus, insist that the Turing test only shows how easy
it is to fool humans and is not an indication of machine intelligence.
And another quote:
Chatterbot programs such as ELIZA have repeatedly fooled unsuspecting
people into believing that they are communicating with human beings. In
these cases, the "interrogators" are not even aware of the possibility
that they are interacting with computers. To successfully appear human,
there is no need for the machine to have any intelligence whatsoever and
only a superficial resemblance to human behaviour is required
So the "silence program" may be an extreme case, but it showcases exactly this. If the computer simply says nothing, then what can the human do to determine it's a computer who is silent behind the curtain? And the answer is: Nothing. He can only guess. And since a person can just as easily be silent as a computer can, he might even mistake the human performer for a computer.
Yes, it's an objectively wrong interpretation of Turing's Imitation Game outlined in his paper, "Computing Machinery and Intelligence", published in Mind in 1940 [0]. It's literally on the first page:
> Now suppose X is actually A, then A must answer. It is A's object in the game to try and cause C to make the wrong identification.
Here's the justification your paper uses to suggest the idea that Turing meant for the possibility of silence as a response:
> In one interpretation of Turing’s test the female is expected to tell the truth, but we are not far off that time when silence was preferred to the “jabbering” of women, because “speech was the monopoly of man” and that “sounds made by birds were part of a conversation at least as intelligible and intelligent as the confusion of tongues arising at a fashionable lady’s reception”.
Additionally, your cited paper there even admits this is a theoretical extension of The Imitation Game:
> In its standard form, Turing’s imitation game is described as an experiment that can be practicalized in two different ways
(see Figure 1) (Shah, 2011):
> In both cases the machine must provide “satisfactory” and “sustained” answers to any questions put to it by the human interrogator (Turing, 1950: p.447). However, what about in the theoretical case when the machine takes the 5th amendment: “No person shall be held to answer”?1 Would we grant “fair play to the machines”?
To repeat in case you missed it when you clearly and definitely read your own citation: "In both cases the machine must provide “satisfactory” and “sustained” answers to any questions put to it by the human interrogator (Turing, 1950: p.447)."
I didn't miss anything. Your entire criticism so far hinges on my usage of silence as the answer.
Alright. I'll modify the function only slightly.
return "I don't want to talk about this."
Replace that with a list of some different answers and `random.choice(answers)` if you like. Now you got a machine that gives "satisfactory and sustained" answers, only it always says No.
Aka. the exact same situation as with complete silence, only now we dotted the i's and crossed the t's.
And since the human is able to refuse to give any answers as well, it makes the entire test pointless, as again the interrogator cannot base his decision on anything but guesswork.
The point of the "silence-thought-experiment" isn't to satisfy Turings paper to the letter. The point is to showcase a flaw in the methodology it presents.
The “null response” isn’t a “satisfactory” answer as it doesn’t address the question. “Must answer” means the person under question must provide an answer to the question being asked. As I already said, your own citation proposes non-response as an extension of the Imitation Game, not a standard possible answer. Non-answers are not at all addressed by Turing in his work, because it’s not a possible outcome of the specific test he outlined.
It’s a weak thought experiment and from it one does not derive meaningful results, as it does not (and is not proposed to by anyone other than you) fit the original game’s intent. There are many other and better criticisms of the Turing Test.
Besides, you blindly cited a paper you yourself didn’t even read after repeated declarations of your own correctness at the expense of everyone else; I cannot think of a clearer example of “bad faith engagement.”
> “Must answer” means the person under question must provide an answer to the question being asked.
Yes, but it doesn't say what the answer has to be, it doesn't say it has to be correct, it doesn't say it has to have to do with the question.
> As I already said, your own citation proposes non-response
And I have shown to you why that doesn't matter in the slightest, because a very trivial modification to the methodology could achieve the exact same thing while following the original papers requirements to the letter.
> It’s a weak thought experiment
Wrong. It's a perfect demonstration of one of the many reasons why AI research is all but ignoring the Turing Test; the fact that the test is more about the interrogator than it is about the machine.
> “bad faith engagement.”
I don't agree with your statements, and have presented arguments why, that's not argueing in bad faith.
Your bad faith comes from how you disagreed with my statements; you did not do the necessary due diligence to demonstrate I should continue putting forward additional effort in both understanding your point and respecting your ideas.
For example, you reply "Wrong." to a subjective evaluation I've made. It literally cannot be wrong (though you can disagree), yet you declare it so with confidence! That's bad faith, and it means I will not engage further.
> you did not do the necessary due diligence to demonstrate
I did all the necessary due dilligence. I was perfectly aware that the paper used a variation on the imitation game. I also read Turings paper long before this discussion started (I think I was in highschool when I first stumbled upon it).
That's how I knew that it is easy to come up with basically the same thought experiment, without even changing any of the games rules.
> For example, you reply "Wrong." to a subjective evaluation I've made.
Because in my subjective evaluation it isn't a weak thought experiment, so I am fully within my rights to disagree with your evaluation.
I always saw the Turing test as kind of comparable to the Bechdel test. Updating it misses the entire point: there's a super simple and easily applicable thresh hold that, at least at the time it was established, a layman could use to soundly demolish most any example they came across.
It was never supposed to be an end-all-be-all, it was supposed to be a quick and dirty way of eliminating possibilities.