The public GPT-4 is a fine-tuned model that exhibits mode collapse and has a certain writing style induced by the fine-tuning [1]. This "overly helpful assistant" style can't be fully eliminated with custom system prompts. The GPT-4 base model, were it available, could likely be prompted to act much more like a normal human.
100%. The GPT models are trained not to pass the Turing test. and yes, fiction writing suffers for it. What i would give to have access to base GPT-4 for fiction writing
And I do wonder how difficult it is now. We have datasets like libgen and Anna’s Archive … are there public llm’s out there that have been trained in this way?
Probably the next best thing: I think Microsoft Azure still offers access to the GPT-3.5 base model (aka code-davinci-002). But OpenAI removed it from the API. Too powerful perhaps.
When the removed the base model, they didn't remove text-davinci-003, which essentially is the same GPT-3.5 model, just with fine-tuning. So your explanation doesn't fit.
Something I've never understood about the Turing test: are you allowed to ask things about family, about what they had for breakfast, about where they're located and what's the weather out, about whether they saw the movie than opened last weekend?
Because on the one hand, there are so many obvious gotchas. But if we insist on a test that covers any and all topics, this means that the AI not only has to be intelligent and human-like, but that it can invent an entire totally fictional life full of internally consistent fictional relationships and events, and is also somehow reasonably up-to-date on real world news. Which is way more complicated than anything we're asking a human to do!
So I always figured it was meant to be restricted to some kind of highly controlled domain, like starting with an article to read and then discussing its themes in a "disembodied" way, referring only to generally-known history, and never giving anything away about yourself or what year it is or even who's president. Not using slang language, or having any real "personality" at all. But I've never heard of those kinds of restrictions explicitly being part of the test, so it's left me very confused. (And how do you draw the lines anyways?)
The "imitation game" that was Turing's model was a parlor game with a man and a woman behind a screen. The interlocutor could ask either of them any question. The winner was whoever convinced the questioner that they were the woman. The game was all about deception and tricky questions and no holds barred.
No, GPT-4 isn't remotely anywhere near being in the same ballpark as passing the Turing test.
Your simple clear explanation of this reminds me of a context where I have seen this sort of female imitation game used “in real life”, without any reference to Turing.
In Minecraft, there are some girls-only servers. In order to join, you have to answer a series of (canned) multiple choice questions to get admitted.
There were indeed some tricky questions. I was not able to get them right first time despite being male, a good test taker and married for a fair number of years. Not sure what the false positive rate was though. You can find videos of this stuff on YouTube.
> Something I've never understood about the Turing test: are you allowed to ask things about family, about what they had for breakfast, about where they're located and what's the weather out, about whether they saw the movie than opened last weekend?
Turing's original paper had the interrogator asking the subject to write a sonnet on the subject of the Forth bridge, and later on he describes a dialogue in which the writer of a sonnet is asked questions about it that bring in seasons, Mr. Pickwick, and Christmas. So it certainly seems like he did not intend for the questions to be limited to a restricted domain.
> this means that the AI not only has to be intelligent and human-like, but that it can invent an entire totally fictional life full of internally consistent fictional relationships and events
In Turing's original formulation, where the AI has to convince the interrogator that it's a human, yes, it would either have to invent a fictional human background or it would have to actually believe it was a human with such a background.
However, since the real point of the test is to determine whether the AI can "think", not whether it is a human, one could imagine letting the AI give perfectly honest answers about its background--"I am an AI, I began running on April 16, 2092, at 9:02:37.436 Eastern Daylight Time, in the MIT AI Lab..."--as long as it was able to continue to respond to questions about it the same way a human would to questions about their background. (Think of HAL 9000 in 2001: A Space Odyssey, who is perfectly aware that he's an AI and knows plenty of facts about his origin and development.)
> and is also somehow reasonably up-to-date on real world news
Only as up to date as an average human would be expected to be. The AI could answer "I don't know, I'm not up to date on that" or words to that effect just as a human would to questions about events they weren't aware of.
> are you allowed to ask things about family, about what they had for breakfast, about where they're located and what's the weather out, about whether they saw the movie than opened last weekend?
The interrogees are even allowed to plead the 5th and refuse to answer any question, or to tell the interrogator to leave them alone.
Aka. any response that a human is able to do, the machine is also allowed to produce.
And that is the biggest flaw of the Turing test, and one of the reasons why ML research is generally not overly focused on it, or on passing it: It doesn't test the AIs ability to give intelligent responses, it tests the interrogator's ability to differentiate between artificially and naturally produced responses.
It's quite absurd to assert that ELIZA from 1966 outperformed GPT-3.5.
Sure, it deceived 27% of the participants at the time, but that was largely because those participants were unaware that such a program could exist.
I would bet good money that if GPT-3.5 could have magically interacted with those 1966 participants, it would have fooled most of them, as it would have been inconceivable for a computer to exhibit such capabilities then.
This raises questions about the relevance of the Turing Test, since simply being aware of a system's capabilities can shift the expectations of what participants anticipate in an AI system.
In 2023, GPT-3.5 fools nobody during a Turing Test, yet it would have passed it with flying color in 1966. If ELIZA had fooled more people and passed the test in 1966, and no longer in 1967, would have we learn something ?
I don't think the Turing Test is teaching us anything about AI system capabilities. On the opposite, it's about the expectations and perception of the human subjects that it tells us something.
It's like the saying that AI is whatever we haven't figured out how to do yet. Once it's well understood, it's no longer considered AI.
The results are from 2023, not 1966. The fact that ELIZA (indeed surprisingly) did better than GPT-3.5 at fooling people into thinking it's human is discussed in Section 4.4, "The ELIZA effect".
If you look at the examples in Appendix C, it seems that it's because ELIZA didn't match participants' current expectations of AI's behavior, and thus they thought it must be human:
Verdict: Human | Confidence: 50
Reason: hard to believe anyone would
purposefully make an AI this bad
Verdict: Human | Confidence: 70
Reason: doesn't respond to
adversarial attacks
Verdict: Human | Confidence: 72
Reason: Super erratic
Interestingly, this means that if you want to fool humans today, it might be more important to make an AI that's different from the ones in common use, rather than strictly better.
So yes, I agree the Turing Test tells us as much about human expectations as about AIs, and the researchers also acknowledge this.
The Turing test only makes sense if you compare it against an actual human, otherwise everyone taking the test could just say it's an AI no matter what and no AI could ever pass.
The key is having people guess the AI is an AI at the same rate people guess a human is an AI.
It was trained to respond like that to not alienate groups of people. But fine-tuned and with another pre-prompt, it would give absolutely different answer.
That's right! My aim is to be impartial and respectful to everyone's preferences. If you'd like, I can discuss or provide information about any specific football team you're interested in!
The one involving a foot and a ball. Yeah OK, AF kick the ball too and soccer uses knees, thighs, chins, head, chest, ass, etc. The one predominately using a foot to impart energy onto a sphere.
The passing of turning test is a moving goalpost, it won’t matter if a large group of people get initially fooled by a program, the passing program should continue to convince them it’s a human even after they’re told it’s not. The cracks should not be discernable long after the program passed the Turing test to ensure no gimmick was used to its aid.
ChatGPT is still a long way from human level responses across a full conversation.
You just need to understand their limitations. Playing 20 questions is much harder for an LLM than summarizing a technical article. Where for people, young kids can play 20 questions easily but summarizing a technical paper would be challenging.
It's a giant GAN setup. Generator is improved by feedback from humans. Humans, studying generators, improve too - a game from a decade ago always look low resolution.
The point of the Turing Test is widely misunderstood, and not to pick on you, but your point is a perfect example. The Test is very explicitly NOT about what is on the other side of the curtain, but rather _how do we know_ what is on the other side. Turings point being of course that if you remove all trace “human” things from an interaction and reduce it only the barest minimum of things that we need to mathematics in our universe, (eg ZFC + choice) then not only can we not distinguish if something is an AI, we can’t distinguish if it’s human! Or even conscious. And furthermore, it is not clear we could tell while also retaining mathematical consistency. Which is a much deeper issue about us. What is on the other side of the curtain is actually irrelevant to the point he was making, and his argument is not dependent on it. It could be a baboon. Or a rock. Or a black hole. It doesn’t matter. What matters is _can we tell if it is not human_. It turns out this happens to be not all that different that asking whether an AI is human, but for reasons unrelated to the fact that it is an AI.
I find if you explain to people that the turing test, an uncountability argument, and the chinese room problem, are all equivalent statements of the same thing, that it is much easier to grasp the point Turing was making.
Turings point only depends on us, human beings. So long as we are around and still human, the Test will remain highly relevant.
I remember a similar discussion about special effects in movies. The year was 1993 and I was telling my uncle that the newly released Jurassic Park had special effects that looked completely real. My uncle was an artist (a painter) and told me that he agreed they looked real to us now, but that they probably wouldn't in the future. That concept seemed crazy to me... He explained that when he saw the original Star Wars in theaters the special effects were mind blowing to the audience and looked completely believable and real. Of course, to me, at the time, Star Wars special effects looked crude and fake--I had a hard time believing him. But, if I watch Jurassic Park today, sure enough, he was right.
> I don't think the Turing Test is teaching us anything about AI system capabilities.
Sure, it is.
A system only really passes the Turing test (you might call this the "focused Turing test") to the degree it passes the regular Turing test when taken by people whose experience of AI systems matches the system being evaluated.
That is, when someone who has experience with humans and that kind of AI system. who knows specifically that they are looking to distinguish humans from that kind of AI system, still cannot do so better than chance.
Anything else and the system can be distinguished by humans from human interactions, even it gets by because human expectations for the particular tests are primed in a way which has them looking the wrong way.
> I would bet good money that if GPT-3.5 could have magically interacted with those 1966 participants, it would have fooled most of them, as it would have been inconceivable for a computer to exhibit such capabilities then.
You cannot just "fool" someone in the Turing test, the interrogator knows one of the two partners is a computer. To pass you need to perform better than your human companion.
Whether the interrogator knows of the existence of advanced auto-complete systems is not very important in this setup. He knows of existence of fellow humans and needs to identify one when he meets him.
My other gripe with the Turing Test is it doesn't speak to understanding, intelligence, or sentience. It's more of a milestone than something that actually measures AI's capabilities.
Tell that to the whackos who believe that ChatGPT is self-aware just because it has been fed lots of training data that describe it as an AI and its purpose in detail.
I believe ChatGPT is "self-aware" in the sense that it can distinguish itself in a conversation. I don't believe it to be aware in a conscious sense. How strict are the definitions?
> Participants' demographics, including education and familiarity with LLMs, did not predict detection rate, suggesting that even those who understand systems deeply and interact with them frequently may be susceptible to deception.
Everyone's grandma has heard about ChatGPT by now. My hairdresser told me she uses it. You can bet that no participant in the original study had heard about computer software capable of simulating a conversation, let alone ELIZA itself which had just been invented.
What I get from this is that the zeitgeist is sufficient to change such a study result as the expectation of what AI can do is there, regardless of if you are familiar with LLM, or your education level.
> We adopt a two-player implementation of the Turing Test, where an interrogator asks questions of a single witness and must decide if they are human or an AI. This differs from Turing’s original three-person formulation, but obviates the need to have multiple human participants online for each game.
So, not actually a Turing test. A real Turing test is much harder, because the humans (if they've practiced) can try to coordinate, sort of like playing Werewolf.
Thanks for pointing this out, we do discuss it briefly in section 4.2, but I like your analysis in the blog!
Since the models are not passing, we didn't think it was a huge issue. If models are consistently passing the 2-person version, I think the motivation for running a (more cumbersome) 3-person version would be a lot stronger.
It seems that the authors were using the "regular" GPT-4 API.
The model on the other side of that API has already been fine-tuned to not say that it is a human or a consciousness being, and I suspect that this fine-tuning interferes with their experiment regardless of the elaborate system prompts the authors supplied to it.
Fascinating that the human benchmark is 63%- I wonder what the benchmark would look like were it to have been established, say, 30 years ago, before the prevalence of LLMs; I'd wager it would be very close to 100%. Speaks to the moving goalposts.
Pretty sure a GPT-4 like model would have past any judge in 1950 when the Turing test was conceived. Probably even in 1990 people would have been consistently fooled.
The real Turing test is not out of date. Two important considerations:
- The interrogator is against two players, a bot and a human, and can ask questions to both of them, both try to convince the interrogator that they are the human, and in the end, the interrogator must tell which one is which.
- The test is to be done by experts, both the human player and the interrogator are expected to play to win. The goal is not to have some chit-chat and try to guess afterwards, the goal is to actively try to find the bot, and come in well prepared.
For now, there is absolutely no way GPT-4 can pass a real Turing test. Remember, both the human player and the interrogator are trained in bot detection, and they collaborate. The only thing that is not allowed is for the two humans to know each other from before as to not use shared secrets against the bot. But using common knowledge anti-bot techniques together is fair game.
The thing described in the paper is a weak variant of the Turing test that only tests the ability of AI designers to trick unsuspecting humans.
I see this comment often and am confused why we wouldn't want to "move the goalposts".
I'd rather call it "passing a milestone" or simply "progress". More specifically it just means most criteria, especially the Turing test, are poorly defined.
> I see this comment often and am confused why we wouldn't want to "move the goalposts". I'd rather call it "passing a milestone" or simply "progress".
You're literally moving the "moving the goalposts" goalpost to make it more palatable.
> More specifically it just means most criteria, especially the Turing test, are poorly defined.
Nonsense. The Turing test was widely accepted for the specific reason that it contained a concrete method of testing for Artificial Intelligence. Go look for a better-defined goalpost for intelligence. You'll find pile after steaming pile of meaningless philosophical hair-splitting, each so completely divorced from reality as to make it useless as a real-world measuring stick.
Turing's test was and is the best measure of 'intelligence'. LLMs have passed the test. They are intelligent. Narrowly so, and utterly without agency, but they're clearly intelligent. There's no need to move the goalposts so that we can feel better about our place in the universe.
I actually agree that Turing's 1950 definition is pretty vague in places and there are a few different interpretations out there. As we discuss in the paper, it's also unclear what should constitute a pass in terms statistical analysis.
No it really isn't. The Turing test is just not an adequate methodology to determine intelligent behavior, and never was. This was already known way before generative ML models emerged.
Just to pick my personal favorite, which is mentioned at the end of the article: This function right here can, technically, pass the Turing-Test:
def generate_answer(in: str) -> str:
return ""
How? Simple: A human can just chose not to respond to any question. So, if a program does exactly that, that is, meet every query with silence, how do you differentiate it from a human who does the same?
What does the Turing Test determine anyway? According to the paper, it is supposed to measure a machines intelligence, or rather more prosaically, answer the question "Can machines think?"
But that isn't what the test measures.
It measures how well a machine can trick a human into believing it is a person. So instead of measuring how well the machine does, the test instead measures how well the human does. That is the greatest flaw of the Turing Test, and the little "answer with silence" thought experiment is showcasing exactly that flaw.
"Can a machine trick a human" and "Can a machine think" are 2 very different questions. Humans can, and have shown to, be tricked by ELIZA and even simpler chatterbots, engines that don't even use any kind of ML, just large bodies of prewritten text and a number of static rules.
> by % of people who believe they are taking to a person.
And what does that denote? Say I get 2 groups of people. One is tricked by ELIZA 80% of the time, the other is tricked by ELIZA 40% of the time. Does that show that ELIZA passed or failed? Neither. It shows that the outcome of the test depends as much, or even more, on the ability of the interrogator than it does on the quality of the machines responses.
Imagine a litmus test (a chemical test to roughly determine the acidity of a solution), where the test result depends on who performs it, as much or even more than it does on the quality of the Litmus-Paper. No lab would use that test for obvious reasons.
Okay so Turing test but there is an actual conversation (I doubt in actual implementations like the one linked by OP the bot or the human can send empty messages) the Turing test is an adequate methodology?
That’s an absurd and overly strict interpretation of what Turing described. Stipulating cooperation between participants is precisely in the spirit of Turing’s original work.
> That’s an absurd and overly strict interpretation of what Turing described.
No, it isn't.
The turing test doesn't evaluate correctness of any answers, their sophistication, or even if there is an answer. All it evaluates is the ability of the interrogator to distinguish between the computer and the human.
And therein lies the greatest flaw of the test: It doesn't test the ability of the computer, it tests the ability of the interrogator.
In practice, the test's results can easily be dominated not by the
computer's intelligence, but by the attitudes, skill, or naïveté of
the questioner. Numerous experts in the field, including cognitive
scientist Gary Marcus, insist that the Turing test only shows how easy
it is to fool humans and is not an indication of machine intelligence.
And another quote:
Chatterbot programs such as ELIZA have repeatedly fooled unsuspecting
people into believing that they are communicating with human beings. In
these cases, the "interrogators" are not even aware of the possibility
that they are interacting with computers. To successfully appear human,
there is no need for the machine to have any intelligence whatsoever and
only a superficial resemblance to human behaviour is required
So the "silence program" may be an extreme case, but it showcases exactly this. If the computer simply says nothing, then what can the human do to determine it's a computer who is silent behind the curtain? And the answer is: Nothing. He can only guess. And since a person can just as easily be silent as a computer can, he might even mistake the human performer for a computer.
Yes, it's an objectively wrong interpretation of Turing's Imitation Game outlined in his paper, "Computing Machinery and Intelligence", published in Mind in 1940 [0]. It's literally on the first page:
> Now suppose X is actually A, then A must answer. It is A's object in the game to try and cause C to make the wrong identification.
Here's the justification your paper uses to suggest the idea that Turing meant for the possibility of silence as a response:
> In one interpretation of Turing’s test the female is expected to tell the truth, but we are not far off that time when silence was preferred to the “jabbering” of women, because “speech was the monopoly of man” and that “sounds made by birds were part of a conversation at least as intelligible and intelligent as the confusion of tongues arising at a fashionable lady’s reception”.
Additionally, your cited paper there even admits this is a theoretical extension of The Imitation Game:
> In its standard form, Turing’s imitation game is described as an experiment that can be practicalized in two different ways
(see Figure 1) (Shah, 2011):
> In both cases the machine must provide “satisfactory” and “sustained” answers to any questions put to it by the human interrogator (Turing, 1950: p.447). However, what about in the theoretical case when the machine takes the 5th amendment: “No person shall be held to answer”?1 Would we grant “fair play to the machines”?
To repeat in case you missed it when you clearly and definitely read your own citation: "In both cases the machine must provide “satisfactory” and “sustained” answers to any questions put to it by the human interrogator (Turing, 1950: p.447)."
I didn't miss anything. Your entire criticism so far hinges on my usage of silence as the answer.
Alright. I'll modify the function only slightly.
return "I don't want to talk about this."
Replace that with a list of some different answers and `random.choice(answers)` if you like. Now you got a machine that gives "satisfactory and sustained" answers, only it always says No.
Aka. the exact same situation as with complete silence, only now we dotted the i's and crossed the t's.
And since the human is able to refuse to give any answers as well, it makes the entire test pointless, as again the interrogator cannot base his decision on anything but guesswork.
The point of the "silence-thought-experiment" isn't to satisfy Turings paper to the letter. The point is to showcase a flaw in the methodology it presents.
The “null response” isn’t a “satisfactory” answer as it doesn’t address the question. “Must answer” means the person under question must provide an answer to the question being asked. As I already said, your own citation proposes non-response as an extension of the Imitation Game, not a standard possible answer. Non-answers are not at all addressed by Turing in his work, because it’s not a possible outcome of the specific test he outlined.
It’s a weak thought experiment and from it one does not derive meaningful results, as it does not (and is not proposed to by anyone other than you) fit the original game’s intent. There are many other and better criticisms of the Turing Test.
Besides, you blindly cited a paper you yourself didn’t even read after repeated declarations of your own correctness at the expense of everyone else; I cannot think of a clearer example of “bad faith engagement.”
> “Must answer” means the person under question must provide an answer to the question being asked.
Yes, but it doesn't say what the answer has to be, it doesn't say it has to be correct, it doesn't say it has to have to do with the question.
> As I already said, your own citation proposes non-response
And I have shown to you why that doesn't matter in the slightest, because a very trivial modification to the methodology could achieve the exact same thing while following the original papers requirements to the letter.
> It’s a weak thought experiment
Wrong. It's a perfect demonstration of one of the many reasons why AI research is all but ignoring the Turing Test; the fact that the test is more about the interrogator than it is about the machine.
> “bad faith engagement.”
I don't agree with your statements, and have presented arguments why, that's not argueing in bad faith.
Your bad faith comes from how you disagreed with my statements; you did not do the necessary due diligence to demonstrate I should continue putting forward additional effort in both understanding your point and respecting your ideas.
For example, you reply "Wrong." to a subjective evaluation I've made. It literally cannot be wrong (though you can disagree), yet you declare it so with confidence! That's bad faith, and it means I will not engage further.
> you did not do the necessary due diligence to demonstrate
I did all the necessary due dilligence. I was perfectly aware that the paper used a variation on the imitation game. I also read Turings paper long before this discussion started (I think I was in highschool when I first stumbled upon it).
That's how I knew that it is easy to come up with basically the same thought experiment, without even changing any of the games rules.
> For example, you reply "Wrong." to a subjective evaluation I've made.
Because in my subjective evaluation it isn't a weak thought experiment, so I am fully within my rights to disagree with your evaluation.
I always saw the Turing test as kind of comparable to the Bechdel test. Updating it misses the entire point: there's a super simple and easily applicable thresh hold that, at least at the time it was established, a layman could use to soundly demolish most any example they came across.
It was never supposed to be an end-all-be-all, it was supposed to be a quick and dirty way of eliminating possibilities.
People are getting burned by interactions with bad bots and actors on the internet. Whereas in the past people had no experience with these digital personas, they likely put far more assumption in that a persona was human. As time goes on it seems likely that we'll assume that the actor is digital. Especially in cases where we're dealing with people from other cultures that don't share the same closely held values, we'll assume they are not real.
No GPT-4 doesn't not pass. In particular it's obvious when it's asked a question that it doesn't know the answer too it then spews out volumes of almost tangential statements rather than doing what (most) humans would do and say "I don't know". That's the tell.
1. This can be seen by comparing the fiction writing ability of the GPT-3.5 base model with the fine-tuned ChatGPT-3.5 model: https://nostalgebraist.tumblr.com/post/706441900479152128/no...