You don't even have to get deep into the internals of LLMs to see what's wrong w...

ithkuil · on Aug 25, 2024

In order to produce the next sentence you have to produce the next word first and then the word after it and so on.

Before the model arrives at the candidates for the next word it first computes vectors in high dimensional space that combine every combination of words in the context and extract semantics from it. When producing the next token the model effectively has already "decided" the direction where the answer will go and that is encoded as a high dimensional vector before being reduced to the next token (and the process repeated)

Jensson · on Aug 25, 2024

> When producing the next token the model effectively has already "decided" the direction where the answer will go

No it hasn't, if you tell it to write a random story and it starts with "A", it hasn't figured out what the next word should be, and you run it many times from that "A" you will get many different sentences.

It will do some adapting to future possibilities, but it doesn't calculate the sentence once like you suggest it will, it comes up with a new sentence for every token it generates.

ithkuil · on Aug 25, 2024

If you "tell it" to make up a random story then your prompt is part of the context and if the model is fine-tunes to follow instructions it will emit a story that sounds like a random story that begins with A (since that's a constrain).

If instead the prompt says it should emit 5 repetitions of the letter "A" unsurprisingly it will compete the output with " A A A A".

The task is performed by emitting tokens, but on order to correctly execute the task the model has to "understand" the prompt sufficiently well in order to choose the next token (and the next etc).

Now, obviously current LLM models have severe deficiencies in the ability to model the real world (which is revealed by their failures to handle common sense scenarios). This problem is completely compounded by a psychological factor in which we humans tend to ascribe more "intelligence" to an agent that "speaks well" so the dissonance of a model that sounds intelligent and yet sometimes is so hilariously stupid throws us off rails.

But there is clearly some modeling and processing going on. It's not a mere stochastic parrot. We have those (Markov chains of various sorts) and they cannot maintain a coherent text for long. LLMs OTOH are objectively a phase transition in that space.

All I'm trying to say is that whatever is lacking in LLMs is not just merely because they "just do next token prediction".

There are other things these models should do in order to go to the next level of reasoning. It's not clear if that can be achieved just by training the model on more and more data (hoping that the models learn the trick by themselves) or whether we need to improve the architecture in order to enable the next phase.

roenxi · on Aug 25, 2024

That doesn't hold together. You seem to be arguing that LLMs produce text as a sequence of words. Which, fair enough, they obviously do.

But then your argument seems to drift into humans not producing text as a series of words. I'm not sure how you type your comments but you should upload a YouTube video of it as it sounds like it'd be quite a spectacle!

If your argument is that LLMs can't reason because they don't edit their comments, it'd be worth stopping and reflecting for a few moments about how weak a position that is. I wrote this comment linearly just to make a point with no editing except spellchecking.

pfsalter · on Aug 25, 2024

Humans don't really generate text as a series of words. If you've ever known what you wanted to say but not been able to remember the word you can see this in practice. Although the analogy is probably a helpful one, LLMs are basically doing the word remembering bit of language, without any of the thought behind it.

roenxi · on Aug 25, 2024

How do you generate your text? Do you write the middle of the sentence first, come back to the start then finish it? Or do you have a special keyboard where you drop sentences as fully formed input?

As systems humans and LLMs behave in observably similar ways. You feed in some sort of prompt+context, there is a little bit of thinking done, a response is developed by some wildly black-box method, and then a series of words are generated as output. The major difference is that the black boxes presumably work differently but since they are both black boxes that doesn't matter much for which will do a better job at root cause analysis.

People seem to go a bit crazy on this topic at the idea that complex systems can be built from primitives. Just because the LLM primitives are simple doesn't mean the overall model isn't capable of complex responses.

CharlieDigital · on Aug 25, 2024

    Do you write the middle of the sentence first, come back to the start then finish it?

Am I the only one that does this?

I'll have a central point I want to make that I jot down and then come back and fill in the text around it -- both before and after.

When writing long form, I'll block out whole sections and build up an outline before starting to fill it in. This approach allows better distribution on "points of interest" (and was how I was taught to write in the 90's).

hansvm · on Aug 25, 2024

> Currently all LLMs are only determining the most probable next token, but this means they are not aware of the probability of the entire sequence of tokens they are emitting. That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence

They're normally trained to output a probability distribution for the next token and _sample_ from that distribution. Doing so iteratively, if you work through the conditional probabilities, samples from the distribution of completed prompts (or similarly if you want to stop at a single sentence) with the same distribution as the base training data.

You're right that you can't pick the most likely sentence in general, but if there exists a sentence likely enough for you to care then you can just repeat the prompt a few times and take the most common output, adjusting the repetition count in line with your desired probability of failure. Most prompts don't have a "most likely" sentence for you to care about though. If you ask for meal suggestions with some context, you almost certainly want a different response each time, and the thing that matters is that the distribution of those responses is "good." LLMs, by design, can accomplish that so long as the training data has enough information and the task requires at most a small, bounded amount of computation.

danielmarkbruce · on Aug 26, 2024

>> That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence

Beam search.