Paper
24 March 2014 Utilizing web data in identification and correction of OCR errors
Author Affiliations +
Proceedings Volume 9021, Document Recognition and Retrieval XXI; 902109 (2014) https://doi.org/10.1117/12.2042403
Event: IS&T/SPIE Electronic Imaging, 2014, San Francisco, California, United States
Abstract
In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.
© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Kazem Taghva and Shivam Agarwal "Utilizing web data in identification and correction of OCR errors", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 902109 (24 March 2014); https://doi.org/10.1117/12.2042403
Lens.org Logo
CITATIONS
Cited by 5 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Error analysis

Data corrections

Liquid crystals

Lanthanum

Machine learning

Computer science

RELATED CONTENT

Post processing with first and second order hidden Markov...
Proceedings of SPIE (February 04 2013)
Implementation of relational graph for multiple scenarios
Proceedings of SPIE (August 16 2023)
Fuzzy support vector machines based on linear clustering
Proceedings of SPIE (November 03 2005)
Efficiently mining maximal frequent patterns: fast-miner
Proceedings of SPIE (March 27 2001)
Image categorization for marketing purposes
Proceedings of SPIE (February 07 2011)
Asymptotic cost in document conversion
Proceedings of SPIE (January 23 2012)

Back to Top