Optical character recognition errors and their effects on natural language processing

Lopresti, Daniel

doi:10.1007/s10032-009-0094-8

Optical character recognition errors and their effects on natural language processing

Original Paper
Published: 25 September 2009

Volume 12, pages 141–151, (2009)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Daniel Lopresti¹

448 Accesses
42 Citations
5 Altmetric
Explore all metrics

Abstract

Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

References

Blando, L.R., Kanai, J., Nartker, T.A.: Prediction of OCR accuracy using simple image features. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 319–322, Montréal, Canada, August (1995)
Cannon, M., Hochberg, J., Kelly, P.: Quality assessment and restoration of typewritten document images. Technical Report LA-UR 99-1233, Los Alamos National Laboratory (1999)
Esakov, J., Lopresti, D.P., Sandberg, J.S.: Classification and distribution of optical character recognition errors. In: Proceedings of Document Recognition I (IS&T/SPIE Electronic Imaging), vol. 2181, pp. 204–216, San Jose, February (1994)
Esakov, J., Lopresti, D.P., Sandberg, J.S., Zhou, J.: Issues in automatic OCR error classification. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 401–412, April (1994)
Farooq, F., Al-Onaizan, Y. : Effect of degraded input on statistical machine translation. In: Proceedings of the Symposium on Document Image Understanding Technology, pp. 103–109, November (2005)
Foster, J.: Treebanks gone bad: generating a treebank of ungrammatical English. In: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, January (2007)
Govindaraju, V., Srihari, S.N.: Assessment of image quality to predict readability of documents. In: Proceedings of Document Recognition III (IS&T/SPIE Electronic Imaging), vol. 2660, pp. 333–342, San Jose, January (1996)
Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Lopresti, D.P., Zhou, J. (eds.) Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE Electronic Imaging), vol. 3967, pp. 291–302, San Jose, January (2000)
Hu J., Kashi R., Lopresti D., Wilfong G.: Evaluating the performance of table processing algorithms. Int. J. Document Anal. Recogn. 4(3), 140–153 (2002)
Article Google Scholar
Jing, H., Lopresti, D., Shih, C.: Summarizing noisy documents. In: Proceedings of the Symposium on Document Image Understanding Technology, pp. 111–119, April (2003)
Lewis, D.D.: Reuters-21578 Test Collection, Distribution 1.0, May (2008). http://www.daviddlewis.com/resources/testcollections/reuters21578/
Lopresti, D.: Performance evaluation for text processing of noisy inputs. In: Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), pp. 759–763, Santa Fe, March (2005)
Lopresti, D.: Measuring the impact of character recognition errors on downstream text analysis. In: Proceedings of Document Recognition and Retrieval XV (IS&T/SPIE Electronic Imaging), vol. 6815, pp. 0G.01–0G.11, San Jose, January (2008)
Lopresti, D.: Noisy OCR text dataset, May 2008. http://www.cse.lehigh.edu/~lopresti/noisytext.html
Lopresti, D.: Optical character recognition errors and their effects on natural language processing. In: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, pp. 9–16, Singapore, July (2008)
MacIntyre, R.: Penn Treebank tokenizer (sed script source code) (1995). http://www.cis.upenn.edu/~treebank/tokenizer.sed
Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: Speech and OCR. In: Proceedings of the 6th Applied Natural Language Processing Conference, pp. 316–324, Seattle, (2000)
Palmer D.D., Ostendorf, M.: Improving information extraction by modeling errors in speech recognizer output. In: Allan, J. (ed.) Proceedings of the 1st International Conference on Human Language Technology Research (2001)
Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, May (1996). ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz
Reynar, J.C. Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, March–April (1997). ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz
Second workshop on analytics for noisy unstructured text data. Singapore, July (2008). http://and2008workshop.googlepages.com/
Taghva K., Borsack J., Condit A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996)
Article Google Scholar
Tesseract open source OCR engine, May (2008). http://code.google.com/p/tesseract-ocr/
Third workshop on analytics for noisy unstructured text data. Barcelona, July (2009). http://and2009workshop.googlepages.com/
Workshop on analytics for noisy unstructured text data. Hyderabad, India, January (2007). http://research.ihost.com/and2007/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Lehigh University, 19 Memorial Drive West, Bethlehem, PA, 18015, USA
Daniel Lopresti

Authors

Daniel Lopresti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Lopresti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopresti, D. Optical character recognition errors and their effects on natural language processing. IJDAR 12, 141–151 (2009). https://doi.org/10.1007/s10032-009-0094-8

Download citation

Received: 19 December 2008
Revised: 26 August 2009
Accepted: 26 August 2009
Published: 25 September 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10032-009-0094-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Optical character recognition errors and their effects on natural language processing

Abstract

Access this article

Similar content being viewed by others

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optical character recognition errors and their effects on natural language processing

Abstract

Access this article

Similar content being viewed by others

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation