This sounds a lot like how we used to do research, by reading books and writing ...

visarga · on Sept 20, 2024

The fundamental problem of both keyword and embedding based retrieval is that they only access surface level features. If your document contains 5+5 and you search "where is the result 10" you won't find the answer. That is why all texts need to be "digested" with LLM before indexing, to draw out implicit information and make it explicit. It's also what Anthropic proposes we do to improve RAG.

"study your data before indexing it"

skybrian · on Sept 20, 2024

Makes sense. It seems after retrieval, both would be useful - both the exact quote and a summary of its context.