If this is for anything slightly commercial related you are probably going to have the best luck using Textract/Document Intelligence/Document AI. Nothing else listed in the comments is as accurate, especially when trying to extract forms, tables and text. Multi-modal will take care of your the images. The combination of those two will get you a great representation of the PDF.
Opensource tools work and can be extremely powerful but you 1) won't have images and 2) your workflows will break if you are not building it for a specific pdf template.
I completely agree. Like the previous comment mentioned, I've explored this area over the past year, and in my tests, the offerings from Amazon, Google, and Microsoft were far superior to the open-source options, especially for long documents. It's unfortunate, but that's the way it is.
OCR itself isn't the issue; most open-source models handle that adequately. The problem lies in the lack of comprehensive features:
- Identification of chapters and headings
- Segmentation of headers and footers with an easy way to filter them out
- Handling of images
- Correctly processing two-column or other non-standard layouts
- Avoiding out-of-memory (OOM) errors, which, while not a flaw of the open-source software itself, is a common and frustrating issue
- Transcription of tables and forms, which exists in open-source models but isn't as effective
These ergonomic features are where the open-source solutions fall short.
Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.
If this is for anything slightly commercial related you are probably going to have the best luck using Textract/Document Intelligence/Document AI. Nothing else listed in the comments is as accurate, especially when trying to extract forms, tables and text. Multi-modal will take care of your the images. The combination of those two will get you a great representation of the PDF.
Opensource tools work and can be extremely powerful but you 1) won't have images and 2) your workflows will break if you are not building it for a specific pdf template.