The limitation is because of the word position embedding matrix size. This isn't a config issue, or an API limitation. This is a limitation on the size of a matrix that is part of the model and is decided on before training. You can't change it.
What does that mean?
For each token in your input or inference output it requires the model to have some understanding of what the position of the word means.
So there is the word position embedding matrix that contains a vector per position. The matrix has "only" 1024 entries in it for GPT2 or 4096 for GPT3. The size of each entry varies as well, containing a vector from 768 for GPT2 small and up to 12,288 for GPT3.
So the WPE (word position embeddings) for GPT2 is (1024x768) and for GPT3 (4096x12288)
Inference requires info from this vector to be added to the word tokens embedding for each token in the original prompt + each generated token.
Positional embeddings are tricky - it very much depends on the specific embedding method chosen. Some advanced methods allow conserved or even slightly improved performance with context length increased beyond what was used for the main pretraining run.
As often is the case with these large models, you can change it with some finetuning on longer context samples from the same dataset, with what is really a small amount of compute invested compared to the million hours spent on training the thing.
You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.
Where did you get that GPT3 has 12288 size token embeddings? I thin that's the internal or output size of the token inside the transformer layers, not in the embedding table.
It doesn't really use them, it uses something called RoPE which is hardcoded rather than learned and is applied multiplicatively at every layer to both the key and the value.
There are already solutions to this kind of problem. Using embeddings to store semantic meaning -> query the vector database with a question -> use extractive q/a models to get relevant context -> using a Reader model to generate answers based on the context from the document.
just checkout Haystack tutorials. I started looking into it after getting introduced to the concept by articles mentioning OpenAI embeddings and GPT 3 api, but it can be done using open source models.
I used Haystack due to the readily available colab notebook[1] for their tutorials. I wanted to feed my own text corpus to it, and that was the fastest way available.
Langchain docs are helpful, and it would be even better if you published an end-to-end notebook using a popular dataset. Definitely looking forward to try langchain as I dive deeper into this.
>Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.
Yes. You can break the document up and index each part and then tackle it that way. It works surprisingly well. The 4096 token limit is tied to the attention window, not an API restriction.
That’s a great question - typically memory in transformers scale as O(N^2) with token count, so there must be an upper limit, but I would bet it to be far more than 4096 tokens
OpenAI limits prompts to 4096 tokens.
If there was no limit, could the LLM be fed a 100 page document in the prompt and then answer questions about it?