The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

Converting our printed genealogies into digital form

Our genealogy society has seven 900-page printed volumes, published in the 1870s and 1980s-1990s. The font is typewriter-like, which does not work well with OCR. We own the publishing rights, and have experimented with tearing apart individual volumes to feed through a sheet-fed scanner. The results are good for creating an image PDF but not for text. As you can imagine, an OCR error which changes a marriage date from 1868 to 1888 is significant, and unacceptable.

It's a simple matter to strip a book and create a book on CD as our books go out of print. However, moving forward into the 21st century, we would like to convert the information to digital form (a standard genealogy database). After three years of trying to figure something out, our only solution is to type the thousands of pages from scratch into our genealogy database.

One thing which makes OCR particularly difficult is that the tails of letters tend to be faint. That is, the bottom of the "g" in "Strong" is OCR'd as a "q". Since this is the Strong family genealogy, that's a problem! I can search and replace, but again the error rate remains too high. With a typewriter font, my OCR program "cleans" nearly all punctuation out, so no periods or semicolons remain.

We'd be grateful for any suggestions in converting our print books to database form.

Edward Barnard


Answer by Nathan

Although this doesn't solve your OCR problem, you might have a look at LeafSeek. It's a free and open source program for creating searchable online databases specifically about genealogy. It's fairly robust and sports an attractive interface.

Regarding OCR, what resolution are you scanning at? I've had fairly good results with typewriter documents by scanning at 300 PPI, sometimes 600 PPI. Also, the software you're using affects the output. ABBY FineReader usually does a good job, but I tend to use Adobe Acrobat's ClearScan OCR (I think it's in versions 9+). It also improves the image quality.


Answer by LAG

It would be helpful to know about some of your "specs" - what you're scanning at, which OCR software you're using, and whether you're running high or low resolution images through that software.

One thing we try to keep in mind with our digitized genealogical content is that having the content accessible online, even with an OCR error rate, is still better than it sitting on a shelf in a single location. We've also had fantastic success, both in uptake and in accuracy of transcriptions, by asking volunteers to help transcribe our content through Flickr (http://www.flickr.com/photos/statelibrarync/sets/72157627124710723/). Genealogists are hugely motivated! While a crowdsourcing project may be down the road for you, keep that in mind as an option.

This sounds like a fantastic opportunity, and I hope you can move forward with digitization and also preservation of this content.

Lisa Gregory State Library of North Carolina.