Skip to main content
Log in

Reduction of expanded search terms for fuzzy English-text retrieval

  • Natural language processing for digital libraries
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract.

Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, a few million search terms are occasionally generated in English-text fuzzy retrieval, giving an intolerable effect on retrieval speed. Therefore, this paper presents two remedies to reduce the number of generated search terms while maintaining retrieval effectiveness. One remedy is to restrict the number of errors included in each expanded search term, while the other is to introduce another validity value different to our conventional one. Experimental results indicate that the former remedy reduced the number of terms to about 50 and the latter to not more than 20.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Additional information

Received: 18 December 1998 / Revised: 31 May 1999

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ohta, M., Takasu, A. & Adachi, J. Reduction of expanded search terms for fuzzy English-text retrieval . Int J Digit Libr 3, 140–151 (2000). https://doi.org/10.1007/s007999900014

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s007999900014

Navigation