Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a limit for the number of tokens that can be fed to the models when run locally?

OpenAI limits prompts to 4096 tokens.

If there was no limit, could the LLM be fed a 100 page document in the prompt and then answer questions about it?



The limitation is because of the word position embedding matrix size. This isn't a config issue, or an API limitation. This is a limitation on the size of a matrix that is part of the model and is decided on before training. You can't change it.

What does that mean?

For each token in your input or inference output it requires the model to have some understanding of what the position of the word means.

So there is the word position embedding matrix that contains a vector per position. The matrix has "only" 1024 entries in it for GPT2 or 4096 for GPT3. The size of each entry varies as well, containing a vector from 768 for GPT2 small and up to 12,288 for GPT3.

So the WPE (word position embeddings) for GPT2 is (1024x768) and for GPT3 (4096x12288)

Inference requires info from this vector to be added to the word tokens embedding for each token in the original prompt + each generated token.


Positional embeddings are tricky - it very much depends on the specific embedding method chosen. Some advanced methods allow conserved or even slightly improved performance with context length increased beyond what was used for the main pretraining run.

As often is the case with these large models, you can change it with some finetuning on longer context samples from the same dataset, with what is really a small amount of compute invested compared to the million hours spent on training the thing.


You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.


Where did you get that GPT3 has 12288 size token embeddings? I thin that's the internal or output size of the token inside the transformer layers, not in the embedding table.


Thanks for explaining, very enlightening.


Do you know what the WPEs are for llama?


It doesn't really use them, it uses something called RoPE which is hardcoded rather than learned and is applied multiplicatively at every layer to both the key and the value.

https://arxiv.org/abs/2104.09864


There are already solutions to this kind of problem. Using embeddings to store semantic meaning -> query the vector database with a question -> use extractive q/a models to get relevant context -> using a Reader model to generate answers based on the context from the document.

just checkout Haystack tutorials. I started looking into it after getting introduced to the concept by articles mentioning OpenAI embeddings and GPT 3 api, but it can be done using open source models.


Would like to bring up LangChain as well : https://langchain.readthedocs.io/en/latest/. We recently integrated Milvus (https://milvus.io) into LangChain, so you'll be able to store and process billions of documents.


I used Haystack due to the readily available colab notebook[1] for their tutorials. I wanted to feed my own text corpus to it, and that was the fastest way available.

Langchain docs are helpful, and it would be even better if you published an end-to-end notebook using a popular dataset. Definitely looking forward to try langchain as I dive deeper into this.

1. https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...


You can checkout our library too which does just that :)

https://github.com/jerpint/buster


Is there documentation for feeding your own documentation into it?


We will be adding that soon


I use this method to answer questions about historic and real-time social media comments.

https://foretale.io/toolbox/Social_Media_QA


Context limit is a property of the model and set at training time. Computational complexity is quadratic with context length.

You can potentially recursively summarize/create index to breakdown larger texts into smaller chunks.


Has anyone had luck with doing this sort of thing when asking ChatGPT/GPT3.5 to generate code? Like, ask it to generate functions at a later time?


Yep.

>Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.

https://arxiv.org/abs/2212.10561


Yes. You can break the document up and index each part and then tackle it that way. It works surprisingly well. The 4096 token limit is tied to the attention window, not an API restriction.


Attention doesn’t have a window, unless you mean there’s a limit to the number of absolute positional embeddings available at train time?


LlamaIndex offers ways to chunk up your data and store them in data structures for response synthesis: https://gpt-index.readthedocs.io/en/latest/guides/primer.htm...


This is exactly what LlamaIndex is meant to solve!

A set of data structures to augment LLM's with your data: https://github.com/jerryjliu/gpt_index


That’s a great question - typically memory in transformers scale as O(N^2) with token count, so there must be an upper limit, but I would bet it to be far more than 4096 tokens


Longer inputs consume more memory, and if your input are shorter than the token length, they have to be padded.

So asking a simple one sentence question to a model that has a 81,920 token limit would be a collosal waste of resources.


Padding + masking. So the transformer doesn't waste time on the padding.


Yes, it's a fundamental limitation of their architecture




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: