Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's true that ChatGPT is not designed for counting and struggles with it in general.

But my point was that ChatGPT, like any tokenized LLM, doesn't even have the concept of letters. The prompt "how many e's in this sentence" is rendered as the tokens [4919, 867, 304, 338, 287, 428, 6827]. There just isn't a pathway for it to consider the letters that make up those tokens.

I'm a little surprised it did that well on your prompt, which is rendered as [10919, 2456, 886, 287, 556, 72]. The interesting thing here is that 556 = " ag" (with leading space) and 72 = "i". So I'm not sure how got to those words. "Wagagi" is tokens [54, 363, 18013], so somehow it is seeing that token 18013 is what you get when you combine 556 and 72? That seems really weird.

I'd love clarification from someone deeper into LLMs and tokenization.



This is an excellent question. I wonder if it's something like [1] on letter composition rather than meaning.

[1] https://arxiv.org/pdf/1810.04882.pdf


In a prompt, can you just tell the model which letters make up each token? Eg a list of ag = a g etc. I imagine a dictionary of that for all tokens in the training data would help.


Maybe? Individual letters are tokens, so you could say something like 3128 = 56 + 129, but the problem is that 3128 is processed as text, not the integer token ID. So the tokenizwr would turn 3128 into a series of tokens.

Intuitively I think there's an abstraction barrier there, but I'm not positive. It feels like asking us to list all of the words that trigger particular neurons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: