Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What is the smallest average percentage of errors that you can achieve with today's OCRs (after book scanning) processes and software?

I have heard that an error rate of 5-10% in OCR is still a good average. Any experiences or facts regarding this?

I am writing "average" as of course the quality of the source makes a big difference.

I am assuming a book scanning process by using a robotic book scanner.

ChrisZZ

Comments

Answer by gmcgath

Talking about the error rate requires narrowing down the terms and variables. As a character error rate, 5 to 10% sounds horribly high, so I'll assume we're talking about word error rate. The typeface and contrast have a significant effect; for the best case, let's say we're talking about common, readable fonts such as Times and Helvetica in black print on white paper. One study says that word errors with bitonal images can get word error rates below 1% (though the software which achieves that rate is hopeless at color text).

A study of Google Books reports a page error rate of less than 1%.

Comments

Answer by cneud

Certain fonts can in principle be recognized with next to 100% accuracy, though even this depends heavily on the source and capture.

In real life, the Bavarian State Library did a project with 20th-century monographs (good print quality, simple layout, common characteristics, modern spelling) and the outcome was > 98% percent word accuracy, which enables numerous retrieval scenarios as can be seen in the search interface (German only). This was obtained with images in greyscale made by a scanrobot and with commercial OCR software - results can be much worse with free OCR software, but this depends highly on factors such as language support and level of post-processing etc. The more tailoring and post-processing involved, the better the result.

Most importantly, as already pointed out, for use in retrieval, (non-stopword) word accuracy is what you should mainly be interested in, and which is naturally lower than character accuracy - and not even all OCR software provides this metric.

It may also be noteworthy that OCR software generally just reports "suspicious" characters or words, which in fact can be right or wrong, thus only giving an estimate of the actual number of errors. In order to measure that, comparison against ground-truth (the 100% correct text) is the only way - the impact competence centre has plenty of information on this, as about OCR in general.

Comments

Answer by Gene Golovchinsky

If the goal of OCR is to support information retrieval, choice of indexing algorithms can mitigate OCR error rates somewhat. Indexing character n-grams (runs of n characters, where n is typically 2-4) rather than whole words will reduce the impact of single-character mis-recognition (and speeling errors on queries, too :-)). Thus you metric should not necessarily be word error rates, but precision.

Comments