Probabilistic Automaton Model for Fuzzy English-Text Retrieval

Ohta, Manabu; Takasu, Atsuhiro; Adachi, Jun

doi:10.1007/3-540-45268-0_4

Manabu Ohta³,
Atsuhiro Takasu³ &
Jun Adachi³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1923))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

947 Accesses
4 Citations

Abstract

Optical character reader (OCR) misrecognition is a serious problem when searching against OCR-scanned documents in databases such as digital libraries. This paper proposes fuzzy retrieval methods for English text that contains errors in the recognized text without cor- recting the errors manually. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term based on probabilistic automata reflecting both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.56% to 97.88% at the cost of a decrease in precision rate from 100.00% to 95.52% with 20 expanded search terms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An Efficient String Searching Algorithm Based on Occurrence Frequency and Pattern of Vowels and Consonants in a Pattern

Learning string distance with smoothing for OCR spelling correction

Article Open access 07 December 2016

A Multi-pattern Matching Algorithm for Chinese-Hmong Mixed Strings

References

Eugene Charniak. Statistical Language Learning. The MIT Press, 1993.
Google Scholar
W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proc. of SDAIR’94 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 115–126, Las Vegas, NV, April 1994.
Google Scholar
Daniel Lopresti and Jiangying Zhou. Retrieval strategies for noisy text. In Proc. of SDAIR’96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 255–269, Las Vegas, NV, April 1996.
Google Scholar
Daniel P. Lopresti. Robust retrieval of noisy text. In Proc. of ADL’96 Forum on Research and Technology Advances in Digital Libraries, pages 76–85, Library of Congress, Washington, D. C., May 1996. URL http://dlt.gsfc.nasa.gov/adl96/.
Manabu Ohta, Atsuhiro Takasu, and Jun Adachi. Reduction of expanded search terms for fuzzy English-text retrieval. In Proc. of ECDL’98, LNCS 1513, pages 619–633, Crete, Greece, September 1998. Springer.
Google Scholar
Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proc. of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, pages 270–278, San Jose, CA, February 1994.
Google Scholar
Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.
Google Scholar
Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Information Processing & Management, 32(3):317–327, 1996.
Article Google Scholar
Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. on Information Systems, 14(1):64–93, January 1996.
Article Google Scholar
Kazem Taghva, Allen Condit, and Julie Borsack. An evaluation of an automatic markup system. In Proc. of the IS&T/SPIE 1995 International Symposium on Electronic Imaging Science and Technology, pages 317–327, San Jose, CA, February 1995.
Google Scholar
Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Jeff Gilbreth. The MANICURE document processing system. Technical Report 95–02, Information Science Research Institute, University of Nevada, Las Vegas, NV, March 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Informatics (NII), Hitotsubashi 2-1-2, 101-8430, Chiyoda-ku, Tokyo, Japan
Manabu Ohta, Atsuhiro Takasu & Jun Adachi

Authors

Manabu Ohta
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Adachi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Library of Portugal, Campo Grande, 83, 1749-081, Lisboa, Portugal
José Borbinha
GMD Library, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Thomas Baker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ohta, M., Takasu, A., Adachi, J. (2000). Probabilistic Automaton Model for Fuzzy English-Text Retrieval. In: Borbinha, J., Baker, T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2000. Lecture Notes in Computer Science, vol 1923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45268-0_4

Download citation

DOI: https://doi.org/10.1007/3-540-45268-0_4
Published: 17 November 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41023-2
Online ISBN: 978-3-540-45268-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics