Skip to main content
Log in

Optical character recognition errors and their effects on natural language processing

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Blando, L.R., Kanai, J., Nartker, T.A.: Prediction of OCR accuracy using simple image features. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 319–322, Montréal, Canada, August (1995)

  2. Cannon, M., Hochberg, J., Kelly, P.: Quality assessment and restoration of typewritten document images. Technical Report LA-UR 99-1233, Los Alamos National Laboratory (1999)

  3. Esakov, J., Lopresti, D.P., Sandberg, J.S.: Classification and distribution of optical character recognition errors. In: Proceedings of Document Recognition I (IS&T/SPIE Electronic Imaging), vol. 2181, pp. 204–216, San Jose, February (1994)

  4. Esakov, J., Lopresti, D.P., Sandberg, J.S., Zhou, J.: Issues in automatic OCR error classification. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 401–412, April (1994)

  5. Farooq, F., Al-Onaizan, Y. : Effect of degraded input on statistical machine translation. In: Proceedings of the Symposium on Document Image Understanding Technology, pp. 103–109, November (2005)

  6. Foster, J.: Treebanks gone bad: generating a treebank of ungrammatical English. In: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, January (2007)

  7. Govindaraju, V., Srihari, S.N.: Assessment of image quality to predict readability of documents. In: Proceedings of Document Recognition III (IS&T/SPIE Electronic Imaging), vol. 2660, pp. 333–342, San Jose, January (1996)

  8. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Lopresti, D.P., Zhou, J. (eds.) Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE Electronic Imaging), vol. 3967, pp. 291–302, San Jose, January (2000)

  9. Hu J., Kashi R., Lopresti D., Wilfong G.: Evaluating the performance of table processing algorithms. Int. J. Document Anal. Recogn. 4(3), 140–153 (2002)

    Article  Google Scholar 

  10. Jing, H., Lopresti, D., Shih, C.: Summarizing noisy documents. In: Proceedings of the Symposium on Document Image Understanding Technology, pp. 111–119, April (2003)

  11. Lewis, D.D.: Reuters-21578 Test Collection, Distribution 1.0, May (2008). http://www.daviddlewis.com/resources/testcollections/reuters21578/

  12. Lopresti, D.: Performance evaluation for text processing of noisy inputs. In: Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), pp. 759–763, Santa Fe, March (2005)

  13. Lopresti, D.: Measuring the impact of character recognition errors on downstream text analysis. In: Proceedings of Document Recognition and Retrieval XV (IS&T/SPIE Electronic Imaging), vol. 6815, pp. 0G.01–0G.11, San Jose, January (2008)

  14. Lopresti, D.: Noisy OCR text dataset, May 2008. http://www.cse.lehigh.edu/~lopresti/noisytext.html

  15. Lopresti, D.: Optical character recognition errors and their effects on natural language processing. In: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, pp. 9–16, Singapore, July (2008)

  16. MacIntyre, R.: Penn Treebank tokenizer (sed script source code) (1995). http://www.cis.upenn.edu/~treebank/tokenizer.sed

  17. Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: Speech and OCR. In: Proceedings of the 6th Applied Natural Language Processing Conference, pp. 316–324, Seattle, (2000)

  18. Palmer D.D., Ostendorf, M.: Improving information extraction by modeling errors in speech recognizer output. In: Allan, J. (ed.) Proceedings of the 1st International Conference on Human Language Technology Research (2001)

  19. Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, May (1996). ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz

  20. Reynar, J.C. Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, March–April (1997). ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz

  21. Second workshop on analytics for noisy unstructured text data. Singapore, July (2008). http://and2008workshop.googlepages.com/

  22. Taghva K., Borsack J., Condit A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996)

    Article  Google Scholar 

  23. Tesseract open source OCR engine, May (2008). http://code.google.com/p/tesseract-ocr/

  24. Third workshop on analytics for noisy unstructured text data. Barcelona, July (2009). http://and2009workshop.googlepages.com/

  25. Workshop on analytics for noisy unstructured text data. Hyderabad, India, January (2007). http://research.ihost.com/and2007/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Lopresti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopresti, D. Optical character recognition errors and their effects on natural language processing. IJDAR 12, 141–151 (2009). https://doi.org/10.1007/s10032-009-0094-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-009-0094-8

Keywords

Navigation