I have the Kindle version of The Seleucid Royal Economy which for obvious reasons includes Greek text.
It's been OCRed, and the Greek has been mangled beyond belief. Sometimes the OCR will split a single character.
No real point to the story, but it feels relevant here. I see Rescribe has already encountered the problem: "In the second step we run the OCR on the preprocessed files, using our specifically trained packages and adapting language and character settings to the document at hand."
(I'm only complaining to a very small degree. Having a low-quality OCRed ebook available is much better than having no ebook available. And what is normally displayed is the image of the text, not the OCRed nonsense, so it doesn't matter that the Greek has been transformed into gibberish until you encounter the odd mid-character word break.)
OpenLibrary also do automated OCR type stuff, though generally its subsidiary to scans of the page.
But search and some recent tools for extracting data from books (e.g. they find URLs in books and then save them in the Wayback machine so people can see what the book linked to) all rely on automated OCR so any improvements will help.
Is Tesseract any good yet? Last I heard they were experimenting with deep learning based recognition but before that I've tried it and it didn't work at all. Kind of Pocketsphinx levels of rubbish.
In my experience, you need to pre-process the text before Tesseract gives reasonable results.
I'd wager 90%+ accuracy after pre-processing a book with ScanTailor.
Accuracy falls off a cliff if the text is handwritten rather than using a printing press, but it's still useful to get a side-by-side editable OCR <-> image.
It’s been good for a long time with an important caveat: Tesseract historically needed preprocessing if your scans were skewed or warped. Years back I talked with someone from Google who said that they’d focused their efforts on the preprocessor side deskewing and identifying structure like columns, but not extensively modifying Tesseract itself.
"Is it good yet" is trolling. If you reference a benchmark, gold-standard etc. and list specific use case failures you might be taken more seriously.
It has a massive impact in various scientific fields that require "good enough", which all things considered is magic in many cases. I don't know of its use outside this area, but I suspect that if it has had the level of impact inside science it has had a huge impact in the broader world.
I think it's a valid criticism of Tesseract. It's very dependent on the input and if they have managed to improve this then it would hugely increase its usefulness.
It's been OCRed, and the Greek has been mangled beyond belief. Sometimes the OCR will split a single character.
No real point to the story, but it feels relevant here. I see Rescribe has already encountered the problem: "In the second step we run the OCR on the preprocessed files, using our specifically trained packages and adapting language and character settings to the document at hand."
(I'm only complaining to a very small degree. Having a low-quality OCRed ebook available is much better than having no ebook available. And what is normally displayed is the image of the text, not the OCRed nonsense, so it doesn't matter that the Greek has been transformed into gibberish until you encounter the odd mid-character word break.)