But that is because the goal of openai wasn’t to pass the Turing test.
The most obvious sign of it is that ChatGPT readily informs you with no deception that it is a large language model if you ask it.
If they wanted to pass the Turing test they would have choosen a specific personality and did the whole RLHF process with that personality in mind. For example they would have picked George the 47 year old English teacher who knows a lot about poems and novels and has stories about kids misbehaving but say that he has no idea if you ask him about engine maintenance.
Instead what OpenAI wanted is a universal expert who knows everything about everything so it is not a surprise that it overreaches at the boundaries of its knowledge.
In other words the limitation you talk about is not inherent in the technology, but in their choices.
>In other words the limitation you talk about is not inherent in the technology, but in their choices.
I think it's somewhat inherent in the technology. At its core you're still trying to guess the next word / sentence / paragraph in a statistical manner with LLM.
Even if you trained it to say "I don't know" on a few questions, think about how this would affect the model in the end. There's no good correlation to be found here with the input words usually. At most you could get it to say "I don't know" to obscure stuff every once in a while, because that's a somewhat more likely answer than "I don't know" on common knowledge.
Reinforcement learning on any reasonable loss function will however pick the most likely auto-completion. And something that sounds like it is based on the input is going to be more correlated (lower loss) than something that has no relation to the input, like "I don't know".
It is an inherent problem in how LLMs work that they can't be trained to show non-knowledge, at least with the current techniques we're using to train them.
This is also why it's hard to tell DALL E-3 what shouldn't be in the picture. Like the famous "no cheese" on the hamburger problem. Hamburgers and cheesburgers are somewhat correlated. The first image spit out for hamburger was a cheesburger. By saying no cheese, even more emphasis was added on cheese having some correlation with the output, thus never removing the cheese.
Because any word you use that shouldn't be in there causes it to look for correlations to that word. It's again, an inherent problem in the technology
Until George the English teacher happily summarizes Nabokov's "Round the Tent of God" for you. Hallucinations are a problem inherent in the technology.
You're conflating limitations of a particular publicly deployed version of a specific model with tech as a whole. Not only it's entirely possible to train an LM to answer math questions (I suspect you mean arithmetic here because there are many kinds of math they do just fine with), but of course a sensible design would just have the model realize that it needs to invoke a tool, just as human would reach out for a calculator - and we already have systems that do just that.
As for saying "I have no idea about ...", I've seen that many times with ChatGPT even. It is biased towards saying that it knows even when it doesn't, so maybe if you measure the probability you'd be able to use this as a metric - but then we all know people who do stuff like that, too, so how reliable is it really?
Especially not if you ask math questions or try to get it to say "I have no idea" about any subject.