I was just processing a document with tesseract & ocrmypdf, and two things:
My first time processing it, I used `ocrmypdf --redo-ocr` because it looked like there was some existing OCR. After processing, the OCR was crap because ocrmypdf didn't realize it was OCR but thought it was real text in the document that should be kept. This was fixable using `ocrmypdf --force-ocr`.
Before realizing this, I discovered that Tesseract 4 & 5 use a neural network-based recognition. I then came across this step-by-step guide on fine-tuning Tesseract for a specific document set: https://www.statworx.com/en/content-hub/blog/fine-tuning-tes...
I didn't end up following the fine-tuning process because at this point `ocrmypdf --force-ocr` worked excellently, but I thought the draw_box_file_data.py script from their example was particularly useful: https://gist.github.com/flaviut/d901be509425098645e4ae527a9e...
My first time processing it, I used `ocrmypdf --redo-ocr` because it looked like there was some existing OCR. After processing, the OCR was crap because ocrmypdf didn't realize it was OCR but thought it was real text in the document that should be kept. This was fixable using `ocrmypdf --force-ocr`.
Before realizing this, I discovered that Tesseract 4 & 5 use a neural network-based recognition. I then came across this step-by-step guide on fine-tuning Tesseract for a specific document set: https://www.statworx.com/en/content-hub/blog/fine-tuning-tes...
I didn't end up following the fine-tuning process because at this point `ocrmypdf --force-ocr` worked excellently, but I thought the draw_box_file_data.py script from their example was particularly useful: https://gist.github.com/flaviut/d901be509425098645e4ae527a9e...