Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You don't even have to get deep into the internals of LLMs to see what's wrong with your reasoning. The problem lies with the basic mechanics of:

> predict the most objectively correct next word in a sequence of words

Currently all LLMs are only determining the most probable next token, but this means they are not aware of the probability of the entire sequence of tokens they are emitting. That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence. In practice, there are a great many very likely sentences that are composed of a fairly unlikely words. When we use the output of an LLM we're thinking it of a sequence sampled from the set of all possible sequences, but that's not really what we're getting (at least as far as probability is concerned).

There are approaches to address this: you can do multinomial sampling instead of greedy so that are casting a slightly large net or you can do beam search where your once again trying to search a broader set of possible sentences choosing by the most probable sequence. But all of these are fairly limited.

Which gets to your first remark:

> Now explain what humans are doing and why it is different.

There's very little we really know about how humans reason, but we are certainly building linguistic expressions at with a more abstract form of composition. This comment for example was planned out in parts, not even sequentially, and then reworked to the whole thing makes some sense. But at the very least humans are clearly reasoning at the level of entire sequences as their probability rather than individual tokens at a time.

The word "planning" almost tautologically implies thinking ahead of the next step. When humans write HN comments or code they're clearly planning rather than just thinking of the next most likely word over and over again with some noise to make it sound more interesting. No matter how powerful and sophisticated the mathematical models driving the core of LLMs are, we're fundamentally limited by the methods we use to sample from them.



In order to produce the next sentence you have to produce the next word first and then the word after it and so on.

Before the model arrives at the candidates for the next word it first computes vectors in high dimensional space that combine every combination of words in the context and extract semantics from it. When producing the next token the model effectively has already "decided" the direction where the answer will go and that is encoded as a high dimensional vector before being reduced to the next token (and the process repeated)


> When producing the next token the model effectively has already "decided" the direction where the answer will go

No it hasn't, if you tell it to write a random story and it starts with "A", it hasn't figured out what the next word should be, and you run it many times from that "A" you will get many different sentences.

It will do some adapting to future possibilities, but it doesn't calculate the sentence once like you suggest it will, it comes up with a new sentence for every token it generates.


If you "tell it" to make up a random story then your prompt is part of the context and if the model is fine-tunes to follow instructions it will emit a story that sounds like a random story that begins with A (since that's a constrain).

If instead the prompt says it should emit 5 repetitions of the letter "A" unsurprisingly it will compete the output with " A A A A".

The task is performed by emitting tokens, but on order to correctly execute the task the model has to "understand" the prompt sufficiently well in order to choose the next token (and the next etc).

Now, obviously current LLM models have severe deficiencies in the ability to model the real world (which is revealed by their failures to handle common sense scenarios). This problem is completely compounded by a psychological factor in which we humans tend to ascribe more "intelligence" to an agent that "speaks well" so the dissonance of a model that sounds intelligent and yet sometimes is so hilariously stupid throws us off rails.

But there is clearly some modeling and processing going on. It's not a mere stochastic parrot. We have those (Markov chains of various sorts) and they cannot maintain a coherent text for long. LLMs OTOH are objectively a phase transition in that space.

All I'm trying to say is that whatever is lacking in LLMs is not just merely because they "just do next token prediction".

There are other things these models should do in order to go to the next level of reasoning. It's not clear if that can be achieved just by training the model on more and more data (hoping that the models learn the trick by themselves) or whether we need to improve the architecture in order to enable the next phase.


That doesn't hold together. You seem to be arguing that LLMs produce text as a sequence of words. Which, fair enough, they obviously do.

But then your argument seems to drift into humans not producing text as a series of words. I'm not sure how you type your comments but you should upload a YouTube video of it as it sounds like it'd be quite a spectacle!

If your argument is that LLMs can't reason because they don't edit their comments, it'd be worth stopping and reflecting for a few moments about how weak a position that is. I wrote this comment linearly just to make a point with no editing except spellchecking.


Humans don't really generate text as a series of words. If you've ever known what you wanted to say but not been able to remember the word you can see this in practice. Although the analogy is probably a helpful one, LLMs are basically doing the word remembering bit of language, without any of the thought behind it.


How do you generate your text? Do you write the middle of the sentence first, come back to the start then finish it? Or do you have a special keyboard where you drop sentences as fully formed input?

As systems humans and LLMs behave in observably similar ways. You feed in some sort of prompt+context, there is a little bit of thinking done, a response is developed by some wildly black-box method, and then a series of words are generated as output. The major difference is that the black boxes presumably work differently but since they are both black boxes that doesn't matter much for which will do a better job at root cause analysis.

People seem to go a bit crazy on this topic at the idea that complex systems can be built from primitives. Just because the LLM primitives are simple doesn't mean the overall model isn't capable of complex responses.


    Do you write the middle of the sentence first, come back to the start then finish it?
Am I the only one that does this?

I'll have a central point I want to make that I jot down and then come back and fill in the text around it -- both before and after.

When writing long form, I'll block out whole sections and build up an outline before starting to fill it in. This approach allows better distribution on "points of interest" (and was how I was taught to write in the 90's).


> Currently all LLMs are only determining the most probable next token, but this means they are not aware of the probability of the entire sequence of tokens they are emitting. That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence

They're normally trained to output a probability distribution for the next token and _sample_ from that distribution. Doing so iteratively, if you work through the conditional probabilities, samples from the distribution of completed prompts (or similarly if you want to stop at a single sentence) with the same distribution as the base training data.

You're right that you can't pick the most likely sentence in general, but if there exists a sentence likely enough for you to care then you can just repeat the prompt a few times and take the most common output, adjusting the repetition count in line with your desired probability of failure. Most prompts don't have a "most likely" sentence for you to care about though. If you ask for meal suggestions with some context, you almost certainly want a different response each time, and the thing that matters is that the distribution of those responses is "good." LLMs, by design, can accomplish that so long as the training data has enough information and the task requires at most a small, bounded amount of computation.


>> That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence

Beam search.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: