Abstract
Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR’s goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using a probabilistic IR system. We compare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. Journal of the American Society for Information Science, 45(1):50–58 January 1994
W. B. Croft, S. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1994. (to appear).
J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the Third International Conference on Database and Expert Systems Applications, pages 78–83, 1992.
T. A. Nartker, R. B. Bradford, and B. A. Cerny. A preliminary report on UNLV/GT1: A database for ground-truth testing in document analysis and character recognition. In Proceedings of the First Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, March 1992.
Stephen V. Rice, Junichi Kanai, and Thomas A. Nartker. An evaluation of OCR accuracy. Technical Report 93–01, Information Science Research Institute, University of Nevada, Las Vegas, April 1993.
Brian Huey, Jeff Gilbreth, John Goetz Jr., and J. Borsack. Verification of GT1. Technical Report 93–10, Information Science Research Institute, University of Nevada, Las Vegas, December 1993.
Richard G. Casey and Kwam Y. Wong. Image Analysis Applications, chapter 1, pages 1–36. Marcel Dekker, Inc., 1990.
Simon Kahan, Theo Pavlidis, and Henry S. Baird. On the recognition of printed characters of any size and font. IEEE Transactions on Pattern Analysis and Machine Intelligence, Pami-9(2): 274–288, 1987.
D. Harman. Overview of the first TREC conference. In Proceedings of the Sixteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 36–47, Pittsburgh, PA, June 1993. ACM Press.
G. Nagy. Optical scanning digitizers. IEEE Computer, pages 13–24, 1983.
S. N. Srihari. Document image understanding. In Proceedings of the ACM-IEEE Computer Society, Dallas, TX, 1986. 1986 Fall Joint Computer Conference.
Stephen V. Rice. The OCR experimental environnent. Technical Report 92–01, Information Science Research Institute, University of Nevada, Las Vegas, March 1992.
W. Bruce Croft and Howard R. Turtle. Text retrieval and inference. In Paul S. Jacobs, editor, Text-based Intelligent Systems, pages 127–155. Lawrence Erlbaunr Associates, 1992.
Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proceedings of tire ISCT/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, San Jose, CA, February 1994.
R. E. Gorin, Pace Willisson, Walt Buehring, Geoff Kuenning, et al. Ispell, a free software package for spell checking files. The UNIX community, 1971-present. version 2.0.02.
Chris Buckley. Personal communication.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1994 Springer-Verlag London Limited
About this paper
Cite this paper
Taghva, K., Borsack, J., Condit, A. (1994). Results of Applying Probabilistic IR to OCR Text. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_21
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2099-5_21
Publisher Name: Springer, London
Print ISBN: 978-3-540-19889-5
Online ISBN: 978-1-4471-2099-5
eBook Packages: Springer Book Archive