I am surprised nobody has mentioned it yet. If this is for anything slightly com...

cpa · on July 30, 2024

I completely agree. Like the previous comment mentioned, I've explored this area over the past year, and in my tests, the offerings from Amazon, Google, and Microsoft were far superior to the open-source options, especially for long documents. It's unfortunate, but that's the way it is.

OCR itself isn't the issue; most open-source models handle that adequately. The problem lies in the lack of comprehensive features:

- Identification of chapters and headings

- Segmentation of headers and footers with an easy way to filter them out

- Handling of images

- Correctly processing two-column or other non-standard layouts

- Avoiding out-of-memory (OOM) errors, which, while not a flaw of the open-source software itself, is a common and frustrating issue

- Transcription of tables and forms, which exists in open-source models but isn't as effective

These ergonomic features are where the open-source solutions fall short.

constantinum · on July 30, 2024

Perfectly put!

Other challenges are:

1. Complex layout tables, tables that span multiple pages

2. Handwritten text - in loan processing and income tax documents

3. Checkboxes and radio buttons are so important in insurance and loan processing to automate workflows.

4. Scanned images

5. Photographed documents from the field.

6. Orientation - landscape mode vs. Portrait mode

7. Text represented as a Bezier curve

8. Non-aligned texts in multicolumn text layout

9. Background images and watermarks

Other important considerations:

1. Privacy and security - cloud vs. On-premise

2. Performance and speed of extraction at scale

3. If you are ultimately feeding to LLMs to intelligence then how does the extractor help in reducing tokens

Anyone curious about why parsing PDF is hell for RAG you can refer this - https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

[edit] - formatting

vikp · on July 30, 2024

Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.

equilibrium · on July 30, 2024

Having explored this topic over the passed month this is the correct answer. And it has been mentioned in the comments by jumploops

CharlieDigital · on July 30, 2024

Azure Document Intelligence with the Document Layout Model is pretty damn amazing at this.

Key thing is it labels titles, headers, sections, etc. This way you can stuff headers into the child chunks for much better RAG.