Zombse

The Zombie Stack Exchanges That Just Won't Die

View the Project on GitHub anjackson/zombse

What is the indexing/matching/ranking algorithm of the LOC newspaper archive?

When searching the "Chronicling America" pages, on the LOC web site, I am wondering if my lack or results is due to OCR errors. How is that archive indexed? Does it index OCR's words, or character n-grams? The advanced search page has an option to search within a word window, which is often useful, but if the words themselves are poorly recognized, word-based searching will not be as effective as n-grams might be.

Does anyone know the search engine is organized?

Gene Golovchinsky

Comments

Answer by Ed Summers

I work as a software developer on the Chronicling America project. We index the OCR with Solr, which is a web service wrapper around the Lucene search library. Lucene's scoring algorithm uses Vector Space Model and tf-idf for ranking. What this basically means is summed up in the Lucene document:

In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query.

This is why you often see pages with lots of pictures, and little text on them, at the top of search results, since the ratio of the matched words to the words on the page is higher than pages with more words on them. Sometimes we joke that the search is actually a historic comic strip search engine :-)

The OCR quality does vary from newspaper to newspaper, depending on the quality of the microfilm that was digitized, and the quality of the newspaper that was microfilmed. It can be very noisy indeed. We actually expose the OCR for pages, which you can get by clicking on the "Text" hyperlink when viewing a newspaper page. Here's an example There are links to a plain text and OCR XML version of the OCR at the bottom.

We are actually in the process of bundling up the newspaper OCR for bulk download so that researchers, partners and other interested folks can index the content easily. If you have any ideas for improving search, or questions about failing searches we would love to hear them. There are knobs to turn in Solr for boosting fields, and changing the way indexing is performed, so there is definitely room for improvement.

The Chronicling America codebase is currently available on SourceForge, and you can email the mailing list if you want to get in touch. I hope this helps!

Comments