Abstract
Optical character reader (OCR) misrecognition is a serious problem when searching against OCR-scanned documents in databases such as digital libraries. This paper proposes fuzzy retrieval methods for English text that contains errors in the recognized text without cor- recting the errors manually. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term based on probabilistic automata reflecting both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.56% to 97.88% at the cost of a decrease in precision rate from 100.00% to 95.52% with 20 expanded search terms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Eugene Charniak. Statistical Language Learning. The MIT Press, 1993.
W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proc. of SDAIR’94 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 115–126, Las Vegas, NV, April 1994.
Daniel Lopresti and Jiangying Zhou. Retrieval strategies for noisy text. In Proc. of SDAIR’96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 255–269, Las Vegas, NV, April 1996.
Daniel P. Lopresti. Robust retrieval of noisy text. In Proc. of ADL’96 Forum on Research and Technology Advances in Digital Libraries, pages 76–85, Library of Congress, Washington, D. C., May 1996. URL http://dlt.gsfc.nasa.gov/adl96/.
Manabu Ohta, Atsuhiro Takasu, and Jun Adachi. Reduction of expanded search terms for fuzzy English-text retrieval. In Proc. of ECDL’98, LNCS 1513, pages 619–633, Crete, Greece, September 1998. Springer.
Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proc. of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, pages 270–278, San Jose, CA, February 1994.
Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.
Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Information Processing & Management, 32(3):317–327, 1996.
Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. on Information Systems, 14(1):64–93, January 1996.
Kazem Taghva, Allen Condit, and Julie Borsack. An evaluation of an automatic markup system. In Proc. of the IS&T/SPIE 1995 International Symposium on Electronic Imaging Science and Technology, pages 317–327, San Jose, CA, February 1995.
Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Jeff Gilbreth. The MANICURE document processing system. Technical Report 95–02, Information Science Research Institute, University of Nevada, Las Vegas, NV, March 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ohta, M., Takasu, A., Adachi, J. (2000). Probabilistic Automaton Model for Fuzzy English-Text Retrieval. In: Borbinha, J., Baker, T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2000. Lecture Notes in Computer Science, vol 1923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45268-0_4
Download citation
DOI: https://doi.org/10.1007/3-540-45268-0_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41023-2
Online ISBN: 978-3-540-45268-3
eBook Packages: Springer Book Archive