Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A sequence of characters is encoded into tokens, tokens are grouped characters, each token is mapped to a vector representation. When you give text to an LLM, the text is encoded into tokens, and each token corresponds to an index. Each index corresponds to one vector. The model produces vectors, and then finds the most similar vector and selects the corresponding index as the next token.

This is a spectrum, you can write a model that works on the bit level, so 2 vectors, or byte level, 256, or pairs of bytes, 2^16 and so on and so forth.

These days, we use statistical approaches to build the tokens, and a token can be 1, 2 or 3 or N characters long.

So when you give a sequence of characters to the model, it turns that to a sequence of tokens and loads a vector for each one, and when doing computations, it needs to consider all tokens together. This is called the context window.

In this case, scaling the number of tokens means scaling the context window to a large number.

GPT3.5 can do 2Ki tokens iirc, OpenAI’s GPT4 can do 4Ki iirc, Claude from anthropic can do 1Mi iirc.

The context window is kinda analogous to your working memory, the higher the better, unless there are approximations that trade off quality for length, which is what is happening here.



Original GPT3.5 can do 4k tokens and there is a recent version with 16k tokens (gpt-3.5-turbo-16k)


Ahhh thanks for the correction! And iirc GPT-4 has a 20k version too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: