Utilizing web data in identification and correction of OCR errors

Kazem Taghva; Shivam Agarwal

doi:10.1117/12.2042403

24 March 2014 Utilizing web data in identification and correction of OCR errors

Kazem Taghva, Shivam Agarwal

Proceedings Volume 9021, Document Recognition and Retrieval XXI; 902109 (2014) https://doi.org/10.1117/12.2042403
Event: IS&T/SPIE Electronic Imaging, 2014, San Francisco, California, United States

Abstract

In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.

Citation Download Citation

Kazem Taghva and Shivam Agarwal "Utilizing web data in identification and correction of OCR errors", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 902109 (24 March 2014); https://doi.org/10.1117/12.2042403

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available