Remember: LLMs are basically trained to be really gifted improv actors. That's what all that "guess the next token" training boils down to. Once that base of "improv" is established, they're then trained to play a specific role, "the helpful and harmless assistant". Somehow this produces useful output much of the time, as ridiculous as it seems.
But whenever almost anything goes wrong, LLMs fall back to improv games. It's like you're talking to the cast of "Whose Line is it Anyway?" You set the scene, and the model tries to play along.
So a case like this proves almost nothing. If you ask an improv actor "What's the last question you were asked?", then they're just going to make something up. So will these models. If you give a model a sentence with 2 grammatical errors, and ask it to find 3, it will usually make up a third error. If it doesn't know the answer to a question, it will likely hallucinate.
GPT-4 is a little better at resisting the urge to make things up.
This is a stellar framing. It makes excellent predictions of LLM behavior while not being wholly dismissive of an LLM’s intelligence like the “stochastic parrot” people are. I’m stealing this.
The only story is here that some people think they can recognize a hallucination by how the text looks. You can't—the only difference between a hallucination and a valid response is that one randomly sampled response happens to produce a factual statement and the other doesn't. There's no stylistic difference, there are no tells. The only way to recognize a hallucination is to fact check with an independent source.
I'm starting to agree with another commenter [0] that the word "hallucination" is a problem—it implies that there's some malfunction that sometimes happens to cause it to produce an inaccurate result, but this isn't a good model of what's happening. There is no malfunction, no psychoactive chemical getting in the way of normal processes. There is only sampling from the model's distribution.
This is really well said. The malfunction is almost when it spews out something that happens to align with reality. It's actually us who are hallucinating its correct answers, not it who is hallucinating incorrect ones. It's like a stopped clock being right twice a day except if you expand the number of hours on the clock to trillions but it's still right the same percentage of time. The scale of right answers makes this seem impressive to us but it masks the reality of the scale of wrong answers we haven't seen yet.
'hallucination' is nothing more than GIGO (garbage in garbage out) and it's the 2nd thing everyone learns right behind 'never trust user input'.
For some reason everyone has thrown the basics out of the window so now we've got garbage in, garbage out called 'hallucinations' and 'prompt engineering' which is nothing more than being incapable of sanatizing input.
You are saying all hallucinations originate in the training data? Remember as far as GIGO, there is literally a temperature parameter feeding in randomness to the output (though still sampling the probability mass).
The most reasonable explanation is that it allucinated it, was trained to answer that or it was in the prompt.
These models are stateless, they don't remember anything, they are read only. If they can remember previous messages is just because the prompt is the concatenation of the new prompt and something like "summarize this conversation: {whole messages in conversation}".
(Disclosure: I work at Microsoft, but nowhere near anything related with copilot.)
Was going to say the same, I'll bet it's listed as an example question in the prompt or something like that. Seems unlikely to be a hallucination since multiple people got the exact same question.
(Also happen to work at Microsoft, also don't work on Bing Copilot)
There's something weird going on with the "46,449 bananas" stat. Copilot seems to inject it randomly into long pieces of generated text.
If you Google "46,449 bananas" you can find all sorts of unrelated web pages that I guess include text generated with Copilot and then were never checked by a human.
Not at all, the problem is the word "hallucinations" which I kind of people wish would stop using.
They're not doing anything AT ALL different when they "tell the truth" or "lie" or "get it right" or "get it wrong."
They are remixing groups of word chunks based on scanning older groups of word chunks. That's ALL. Most any other description is going to be overreaching anthromorphization.
LLMs cannot lie insofar as they cannot tell the truth. They're remarkably good at predicting what token comes next given a bunch of tokens, but nothing else.
Yes, but it's also generative, so at each time step it will be basing those predictions off of its own recent behavior so it's also chaotically, unpredictably performant in the quality of its predictions, but nothing else.
The only thing horrifying about this situation is the extent to which people are apparently taking these software outputs seriously. Or perhaps the extent to which others are selling the illusion for personal gain.
What's the difference between "getting confused" and "lying" in a predictive model?
Normally lying means conveying a falsehood that you know is a falsehood with the intent to deceive. Both the 'know it's a falsehood' and the 'intent to deceive' are important criteria when asking whether a human was lying or not, and an LLM seems like it cant satisfy those and so can't 'lie'.
You're right. I still think it's interesting to discuss the possibility of a leaked state, especially since hallucinations with spelling errors are very rare - even more so if the prompt didn't have any.
Copilot is terrible compared with what Bing used to be even recently, and recent Bing is terrible compared with early Bing. Most times I ask Copilot a question it'll confidently answer a similar more mainstream question that I didn't ask, and then repeat that answer with minor changes in phrasing no matter how many times I explain that this wasn't what I was looking for.
I just tested this in Edge with Copilot chat and got a similar answer to that in the posted article. However, it was clearly labeled as the result of a web search which I translate as it searched Bing for
"What was the previous question that I asked you?"
and it processed the result it found into
The previous question you asked was about the height of Mount Everest in terms of bananas. I provided a whimsical comparison, estimating that Mount Everest’s height is roughly equivalent to 46,449 bananas stacked on top of one another.
But whenever almost anything goes wrong, LLMs fall back to improv games. It's like you're talking to the cast of "Whose Line is it Anyway?" You set the scene, and the model tries to play along.
So a case like this proves almost nothing. If you ask an improv actor "What's the last question you were asked?", then they're just going to make something up. So will these models. If you give a model a sentence with 2 grammatical errors, and ask it to find 3, it will usually make up a third error. If it doesn't know the answer to a question, it will likely hallucinate.
GPT-4 is a little better at resisting the urge to make things up.