Rescribe: A high quality OCR tool for historic books

thaumasiotes · on Nov 26, 2021

I have the Kindle version of The Seleucid Royal Economy which for obvious reasons includes Greek text.

It's been OCRed, and the Greek has been mangled beyond belief. Sometimes the OCR will split a single character.

No real point to the story, but it feels relevant here. I see Rescribe has already encountered the problem: "In the second step we run the OCR on the preprocessed files, using our specifically trained packages and adapting language and character settings to the document at hand."

(I'm only complaining to a very small degree. Having a low-quality OCRed ebook available is much better than having no ebook available. And what is normally displayed is the image of the text, not the OCRed nonsense, so it doesn't matter that the Greek has been transformed into gibberish until you encounter the odd mid-character word break.)

raybb · on Nov 26, 2021

I think the folks at OpenLibrary.org would benefit from something like this.

wumpus · on Nov 26, 2021

PGDP is the project that's doing high quality book transcriptions. OpenLibrary is a distribution mechanism.

ZeroGravitas · on Nov 26, 2021

OpenLibrary also do automated OCR type stuff, though generally its subsidiary to scans of the page.

But search and some recent tools for extracting data from books (e.g. they find URLs in books and then save them in the Wayback machine so people can see what the book linked to) all rely on automated OCR so any improvements will help.

IshKebab · on Nov 26, 2021

Is Tesseract any good yet? Last I heard they were experimenting with deep learning based recognition but before that I've tried it and it didn't work at all. Kind of Pocketsphinx levels of rubbish.

david_allison · on Nov 26, 2021

In my experience, you need to pre-process the text before Tesseract gives reasonable results.

I'd wager 90%+ accuracy after pre-processing a book with ScanTailor.

Accuracy falls off a cliff if the text is handwritten rather than using a printing press, but it's still useful to get a side-by-side editable OCR <-> image.

IshKebab · on Nov 27, 2021

90% is pretty bad though isn't it? That means every 10th letter is wrong! Unless you're using some other measure of accuracy.

acdha · on Nov 26, 2021

It’s been good for a long time with an important caveat: Tesseract historically needed preprocessing if your scans were skewed or warped. Years back I talked with someone from Google who said that they’d focused their efforts on the preprocessor side deskewing and identifying structure like columns, but not extensively modifying Tesseract itself.

xipho · on Nov 26, 2021

"Is it good yet" is trolling. If you reference a benchmark, gold-standard etc. and list specific use case failures you might be taken more seriously.

It has a massive impact in various scientific fields that require "good enough", which all things considered is magic in many cases. I don't know of its use outside this area, but I suspect that if it has had the level of impact inside science it has had a huge impact in the broader world.

cinntaile · on Nov 26, 2021

I think it's a valid criticism of Tesseract. It's very dependent on the input and if they have managed to improve this then it would hugely increase its usefulness.