Skip to main content

Results of Applying Probabilistic IR to OCR Text

  • Conference paper
SIGIR ’94

Abstract

Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR’s goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using a probabilistic IR system. We compare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. Journal of the American Society for Information Science, 45(1):50–58 January 1994

    Article  Google Scholar 

  2. W. B. Croft, S. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1994. (to appear).

    Google Scholar 

  3. J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the Third International Conference on Database and Expert Systems Applications, pages 78–83, 1992.

    Google Scholar 

  4. T. A. Nartker, R. B. Bradford, and B. A. Cerny. A preliminary report on UNLV/GT1: A database for ground-truth testing in document analysis and character recognition. In Proceedings of the First Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, March 1992.

    Google Scholar 

  5. Stephen V. Rice, Junichi Kanai, and Thomas A. Nartker. An evaluation of OCR accuracy. Technical Report 93–01, Information Science Research Institute, University of Nevada, Las Vegas, April 1993.

    Google Scholar 

  6. Brian Huey, Jeff Gilbreth, John Goetz Jr., and J. Borsack. Verification of GT1. Technical Report 93–10, Information Science Research Institute, University of Nevada, Las Vegas, December 1993.

    Google Scholar 

  7. Richard G. Casey and Kwam Y. Wong. Image Analysis Applications, chapter 1, pages 1–36. Marcel Dekker, Inc., 1990.

    Google Scholar 

  8. Simon Kahan, Theo Pavlidis, and Henry S. Baird. On the recognition of printed characters of any size and font. IEEE Transactions on Pattern Analysis and Machine Intelligence, Pami-9(2): 274–288, 1987.

    Article  Google Scholar 

  9. D. Harman. Overview of the first TREC conference. In Proceedings of the Sixteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 36–47, Pittsburgh, PA, June 1993. ACM Press.

    Google Scholar 

  10. G. Nagy. Optical scanning digitizers. IEEE Computer, pages 13–24, 1983.

    Google Scholar 

  11. S. N. Srihari. Document image understanding. In Proceedings of the ACM-IEEE Computer Society, Dallas, TX, 1986. 1986 Fall Joint Computer Conference.

    Google Scholar 

  12. Stephen V. Rice. The OCR experimental environnent. Technical Report 92–01, Information Science Research Institute, University of Nevada, Las Vegas, March 1992.

    Google Scholar 

  13. W. Bruce Croft and Howard R. Turtle. Text retrieval and inference. In Paul S. Jacobs, editor, Text-based Intelligent Systems, pages 127–155. Lawrence Erlbaunr Associates, 1992.

    Google Scholar 

  14. Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proceedings of tire ISCT/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, San Jose, CA, February 1994.

    Google Scholar 

  15. R. E. Gorin, Pace Willisson, Walt Buehring, Geoff Kuenning, et al. Ispell, a free software package for spell checking files. The UNIX community, 1971-present. version 2.0.02.

    Google Scholar 

  16. Chris Buckley. Personal communication.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag London Limited

About this paper

Cite this paper

Taghva, K., Borsack, J., Condit, A. (1994). Results of Applying Probabilistic IR to OCR Text. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-2099-5_21

  • Publisher Name: Springer, London

  • Print ISBN: 978-3-540-19889-5

  • Online ISBN: 978-1-4471-2099-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics