I spent a good amount of time last year working on a system to analyse patent documents.
Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.
The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)
If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.
I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.
You can ask the model to describe the image, but that is inherently lossy. What if it is a chart and the model gets most x, y pairs, but the user asks about a missing "x" or "y" value. Presenting the image at inference is effective since you're guaranteeing that the LLM is able to answer exactly the user's question. The only blocker here becomes how good retrieval is, and that's a smaller problem to solve. This approach allows us to only solve for passing in relevant context, the rest is taken care of by the LLM, otherwise the problem space expands to correct OCR, parsing, and getting all possible descriptions to images from the model.
This is a great example of how to use LLMs thanks.
But it also illustrates to me that the opportunities with LLMs right now are primarily about reclassifying or reprocessing existing sources of value like patent documents. In the 90-00s many successful SW businesses were building databases to replace traditional filing.
Creating fundamentally new collections of value which require upfront investment seems to still be challenging for our economy.
Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.
The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)
If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.
I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.