Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am surprised nobody has mentioned it yet.

If this is for anything slightly commercial related you are probably going to have the best luck using Textract/Document Intelligence/Document AI. Nothing else listed in the comments is as accurate, especially when trying to extract forms, tables and text. Multi-modal will take care of your the images. The combination of those two will get you a great representation of the PDF.

Opensource tools work and can be extremely powerful but you 1) won't have images and 2) your workflows will break if you are not building it for a specific pdf template.



I completely agree. Like the previous comment mentioned, I've explored this area over the past year, and in my tests, the offerings from Amazon, Google, and Microsoft were far superior to the open-source options, especially for long documents. It's unfortunate, but that's the way it is.

OCR itself isn't the issue; most open-source models handle that adequately. The problem lies in the lack of comprehensive features:

- Identification of chapters and headings

- Segmentation of headers and footers with an easy way to filter them out

- Handling of images

- Correctly processing two-column or other non-standard layouts

- Avoiding out-of-memory (OOM) errors, which, while not a flaw of the open-source software itself, is a common and frustrating issue

- Transcription of tables and forms, which exists in open-source models but isn't as effective

These ergonomic features are where the open-source solutions fall short.


Perfectly put!

Other challenges are:

1. Complex layout tables, tables that span multiple pages

2. Handwritten text - in loan processing and income tax documents

3. Checkboxes and radio buttons are so important in insurance and loan processing to automate workflows.

4. Scanned images

5. Photographed documents from the field.

6. Orientation - landscape mode vs. Portrait mode

7. Text represented as a Bezier curve

8. Non-aligned texts in multicolumn text layout

9. Background images and watermarks

Other important considerations:

1. Privacy and security - cloud vs. On-premise

2. Performance and speed of extraction at scale

3. If you are ultimately feeding to LLMs to intelligence then how does the extractor help in reducing tokens

Anyone curious about why parsing PDF is hell for RAG you can refer this - https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

[edit] - formatting


Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.


Having explored this topic over the passed month this is the correct answer. And it has been mentioned in the comments by jumploops


Azure Document Intelligence with the Document Layout Model is pretty damn amazing at this.

Key thing is it labels titles, headers, sections, etc. This way you can stuff headers into the child chunks for much better RAG.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: