Is there a limit for the number of tokens that can be fed to the models when run...

wjessup · on March 12, 2023

The limitation is because of the word position embedding matrix size. This isn't a config issue, or an API limitation. This is a limitation on the size of a matrix that is part of the model and is decided on before training. You can't change it.

What does that mean?

For each token in your input or inference output it requires the model to have some understanding of what the position of the word means.

So there is the word position embedding matrix that contains a vector per position. The matrix has "only" 1024 entries in it for GPT2 or 4096 for GPT3. The size of each entry varies as well, containing a vector from 768 for GPT2 small and up to 12,288 for GPT3.

So the WPE (word position embeddings) for GPT2 is (1024x768) and for GPT3 (4096x12288)

Inference requires info from this vector to be added to the word tokens embedding for each token in the original prompt + each generated token.

kir-gadjello · on March 12, 2023

Positional embeddings are tricky - it very much depends on the specific embedding method chosen. Some advanced methods allow conserved or even slightly improved performance with context length increased beyond what was used for the main pretraining run.

As often is the case with these large models, you can change it with some finetuning on longer context samples from the same dataset, with what is really a small amount of compute invested compared to the million hours spent on training the thing.

toxik · on March 12, 2023

You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.

visarga · on March 12, 2023

Where did you get that GPT3 has 12288 size token embeddings? I thin that's the internal or output size of the token inside the transformer layers, not in the embedding table.

afro88 · on March 12, 2023

Thanks for explaining, very enlightening.

7to2 · on March 12, 2023

Do you know what the WPEs are for llama?

sebzim4500 · on March 12, 2023

It doesn't really use them, it uses something called RoPE which is hardcoded rather than learned and is applied multiplicatively at every layer to both the key and the value.

https://arxiv.org/abs/2104.09864

bishes · on March 12, 2023

There are already solutions to this kind of problem. Using embeddings to store semantic meaning -> query the vector database with a question -> use extractive q/a models to get relevant context -> using a Reader model to generate answers based on the context from the document.

just checkout Haystack tutorials. I started looking into it after getting introduced to the concept by articles mentioning OpenAI embeddings and GPT 3 api, but it can be done using open source models.

fzliu · on March 12, 2023

Would like to bring up LangChain as well : https://langchain.readthedocs.io/en/latest/. We recently integrated Milvus (https://milvus.io) into LangChain, so you'll be able to store and process billions of documents.

bishes · on March 12, 2023

I used Haystack due to the readily available colab notebook[1] for their tutorials. I wanted to feed my own text corpus to it, and that was the fastest way available.

Langchain docs are helpful, and it would be even better if you published an end-to-end notebook using a popular dataset. Definitely looking forward to try langchain as I dive deeper into this.

1. https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...

jerpint · on March 12, 2023

You can checkout our library too which does just that :)

https://github.com/jerpint/buster

emptysongglass · on March 12, 2023

Is there documentation for feeding your own documentation into it?

jerpint · on March 15, 2023

We will be adding that soon

netsroht · on March 12, 2023

I use this method to answer questions about historic and real-time social media comments.

https://foretale.io/toolbox/Social_Media_QA

ivalm · on March 12, 2023

Context limit is a property of the model and set at training time. Computational complexity is quadratic with context length.

You can potentially recursively summarize/create index to breakdown larger texts into smaller chunks.

Robotbeat · on March 12, 2023

Has anyone had luck with doing this sort of thing when asking ChatGPT/GPT3.5 to generate code? Like, ask it to generate functions at a later time?

sbierwagen · on March 12, 2023

Yep.

>Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.

https://arxiv.org/abs/2212.10561

hnhg · on March 12, 2023

Yes. You can break the document up and index each part and then tackle it that way. It works surprisingly well. The 4096 token limit is tied to the attention window, not an API restriction.

jerpint · on March 12, 2023

Attention doesn’t have a window, unless you mean there’s a limit to the number of absolute positional embeddings available at train time?

freezed8 · on March 12, 2023

LlamaIndex offers ways to chunk up your data and store them in data structures for response synthesis: https://gpt-index.readthedocs.io/en/latest/guides/primer.htm...

freezed8 · on March 12, 2023

This is exactly what LlamaIndex is meant to solve!

A set of data structures to augment LLM's with your data: https://github.com/jerryjliu/gpt_index

jerpint · on March 12, 2023

That’s a great question - typically memory in transformers scale as O(N^2) with token count, so there must be an upper limit, but I would bet it to be far more than 4096 tokens

teruakohatu · on March 12, 2023

Longer inputs consume more memory, and if your input are shorter than the token length, they have to be padded.

So asking a simple one sentence question to a model that has a 81,920 token limit would be a collosal waste of resources.

visarga · on March 12, 2023

Padding + masking. So the transformer doesn't waste time on the padding.

arjvik · on March 12, 2023

Yes, it's a fundamental limitation of their architecture