Results of Applying Probabilistic IR to OCR Text

Taghva, Kazem; Borsack, Julie; Condit, Allen

doi:10.1007/978-1-4471-2099-5_21

Kazem Taghva³,
Julie Borsack³ &
Allen Condit³

437 Accesses
22 Citations

Abstract

Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR’s goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using a probabilistic IR system. We compare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. Journal of the American Society for Information Science, 45(1):50–58 January 1994
Article Google Scholar
W. B. Croft, S. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1994. (to appear).
Google Scholar
J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the Third International Conference on Database and Expert Systems Applications, pages 78–83, 1992.
Google Scholar
T. A. Nartker, R. B. Bradford, and B. A. Cerny. A preliminary report on UNLV/GT1: A database for ground-truth testing in document analysis and character recognition. In Proceedings of the First Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, March 1992.
Google Scholar
Stephen V. Rice, Junichi Kanai, and Thomas A. Nartker. An evaluation of OCR accuracy. Technical Report 93–01, Information Science Research Institute, University of Nevada, Las Vegas, April 1993.
Google Scholar
Brian Huey, Jeff Gilbreth, John Goetz Jr., and J. Borsack. Verification of GT1. Technical Report 93–10, Information Science Research Institute, University of Nevada, Las Vegas, December 1993.
Google Scholar
Richard G. Casey and Kwam Y. Wong. Image Analysis Applications, chapter 1, pages 1–36. Marcel Dekker, Inc., 1990.
Google Scholar
Simon Kahan, Theo Pavlidis, and Henry S. Baird. On the recognition of printed characters of any size and font. IEEE Transactions on Pattern Analysis and Machine Intelligence, Pami-9(2): 274–288, 1987.
Article Google Scholar
D. Harman. Overview of the first TREC conference. In Proceedings of the Sixteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 36–47, Pittsburgh, PA, June 1993. ACM Press.
Google Scholar
G. Nagy. Optical scanning digitizers. IEEE Computer, pages 13–24, 1983.
Google Scholar
S. N. Srihari. Document image understanding. In Proceedings of the ACM-IEEE Computer Society, Dallas, TX, 1986. 1986 Fall Joint Computer Conference.
Google Scholar
Stephen V. Rice. The OCR experimental environnent. Technical Report 92–01, Information Science Research Institute, University of Nevada, Las Vegas, March 1992.
Google Scholar
W. Bruce Croft and Howard R. Turtle. Text retrieval and inference. In Paul S. Jacobs, editor, Text-based Intelligent Systems, pages 127–155. Lawrence Erlbaunr Associates, 1992.
Google Scholar
Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proceedings of tire ISCT/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, San Jose, CA, February 1994.
Google Scholar
R. E. Gorin, Pace Willisson, Walt Buehring, Geoff Kuenning, et al. Ispell, a free software package for spell checking files. The UNIX community, 1971-present. version 2.0.02.
Google Scholar
Chris Buckley. Personal communication.
Google Scholar

Download references

Author information

Authors and Affiliations

Information Science Research Institute, University of Nevada, Las Vegas, Las Vegas, NV, 89154, USA
Kazem Taghva, Julie Borsack & Allen Condit

Authors

Kazem Taghva
View author publications
You can also search for this author in PubMed Google Scholar
Julie Borsack
View author publications
You can also search for this author in PubMed Google Scholar
Allen Condit
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Massachusetts, 01003, Amherst, MA, USA
Bruce W. Croft
Department of Computer Science, University of Glasgow, G12 8RZ, 8–17 Lilybank Gardens, Glasgow, Scotland
C. J. van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taghva, K., Borsack, J., Condit, A. (1994). Results of Applying Probabilistic IR to OCR Text. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_21

Download citation

DOI: https://doi.org/10.1007/978-1-4471-2099-5_21
Publisher Name: Springer, London
Print ISBN: 978-3-540-19889-5
Online ISBN: 978-1-4471-2099-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics