Improving Machine-Readable Text for Newspapers in Chronicling America – Library of Congress (.gov)

Top of page

Posted by: Amber Paranick
The following is a guest post by Nathan Yarasavage, a Digital Projects Specialist in the Serial and Government Publications Division.
Back in 2005, the National Endowment for the Humanities (NEH) and the Library of Congress launched the National Digital Newspaper Program (NDNP), the program that supports Chronicling America and its keyword-searchable access to now over 23 million newspaper pages. The Library is excited to announce a new effort to reprocess some of the newspapers that we digitized in the early years of the program. The reason behind this effort is to improve the machine-readable text that powers the keyword search of this rich content.
Machine-readable text is created by a technology called Optical Character Recognition (OCR). Using the Tesseract Open Source OCR Engine (External) and customized post-processing tools, the Library created a new OCR reprocessing workflow specifically for Chronicling America.
A lot has changed in 20 years, and OCR technology is no exception. By taking advantage of these improvements, the Library will provide a higher-quality search experience. Better OCR yields more accurate search results for users and a cleaner full-text index for our servers.
Pictured below is an example of the improvements gained in just one paragraph of a newspaper that has gone through this new workflow.  Before reprocessing, the machine-readable text was mostly unrecognizable. As a result of our new workflow, newspaper articles like this will now appear in search results.
 
 
Since its implementation, the new workflow has significantly improved access to newspapers added during the early years of the program. However, given the inherent challenges of working with historical newspapers on microfilm, we always remind users that achieving 100% accuracy in an automated OCR process is virtually impossible. The creation of quality machine-readable text from historical materials on microfilm is impacted by a variety of factors, many out of our control.  Original newsprint can be damaged or suffer from deterioration before it was microfilmed. Additionally, poor quality microfilming practices can lead to subpar images, therefore interfering with the OCR process. On top of these factors, historical newspapers often exhibit tight columns and tiny text sizes that further complicate the OCR process.
The Library team working on this effort is just getting started. To date, we have reprocessed over 170,000 newspaper pages. You can track the progress at the Improved Machine-Readable Text for Newspapers page on the Chronicling America research guide. Additionally, the improved text is available now for searching in the new interface of Chronicling America. Read more about the migration to the new interface.
 
As technologies improve and evolve, we plan to adapt our workflows even further to take advantage of the new opportunities.  In the future, we will provide more details about the technologies and processes we used.  For questions about the new workflow, please contact [email protected].
The Chronicling America historic newspapers online collection is a product of the National Digital Newspaper Program and jointly sponsored by the Library and the National Endowment for the Humanities. For more guidance on how to use Chronicling America, take a look at Chronicling America: A Guide for Researchers. This guide provides an overview of the collection as well as recommended search topics, search strategies, website features, and frequently asked questions.
Try out a search in the new Chronicling America interface today!
Thank you! Chronicling America has helped me find family for decades. Now it’s going to be better.
Thanks for the update and to the Chronicling team for the fresh effort and improved searchability!

Your email address will not be published. Required fields are marked *



These blogs are governed by the general rules of respectful civil discourse. By commenting on our blogs, you are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user’s privilege to post content on the Library site. Read our Comment and Posting Policy.
Links to external Internet sites on Library of Congress Web pages do not constitute the Library’s endorsement of the content of their Web sites or of their policies or products. Please read our Standard Disclaimer.

source

Spread the love

Leave a Reply

This will close in 50 seconds

Signup On Sugerfx & get free $5 Instantly

X