Probabilistic retrieval of OCR degraded text using N-grams

Harding, S. M.; Croft, W. B.; Weir, C.

doi:10.1007/BFb0026737

S. M. Harding¹,
W. B. Croft¹ &
C. Weir²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1324))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

207 Accesses
25 Citations

Abstract

The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Callan, J.P., Croft, W.B. and Harding, S.M.: The INQUERY Retrieval System. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications (1992) 78–83.
Google Scholar
Cavnar, W.: Using an N-Gram-Based Document Representation with a Vector Processing Retrieval Model. In Overview of the Third Text Retrieval Conference (TREC-3), D.K. Harman, Editor (1994) 269–278.
Google Scholar
Cohen, D.J.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. J. Amer. Soc. Info. Sci. 46 (1995) 162–174.
Article Google Scholar
Croft, W.B., Harding, S.M., Taghva, K. and Borsack, J.: An evaluation of Information Retrieval Accuracy with Simulated OCR Output. Symposium of Document Analysis and Information Retrieval (1994).
Google Scholar
Pierce, C. and Nicholas, C.: TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data. J. Amer. Sec. Info. Sci 47 (1996) 263–275.
Article Google Scholar
Rice, S., Kanai, J. and Nartker, T.: An Evaluation of Information Retrieval Accuracy. In UNLV Information Science Research Institute Annual Report (1993) 9–20.
Google Scholar
Taghva, K., Borsack, J., Condit, A., Erva, S.: The effects of noisy data on text retrieval. In UNLV Information Science Research Institute Annual Report (1993) 71–80.
Google Scholar
Taylor, S.L., Lipshutz, M., Dahl. D.A. and Weir, C.: An Intelligent Document Understanding System. In Second International Conference on Document Analysis and Recognition (1993) 107–220.
Google Scholar
Turtle, H. and Croft, W.B.: Evaluation of an Inference Network-Based Retrieval Model. ACM Trans. on Info. Sys. 9 (1991) 187–222.
Article Google Scholar
Ukkonen, E.: Approximate String-Matching with Q-grams and Maximal Matches. Theor. Comp. Sci. 92 (1992) 191–211.
Article Google Scholar
Weir, C., Taylor, S.L., Harding, S.M. and Croft, W.B.: The Skeleton Document Image Retrieval System. In Symposium on Document Image Understanding Technologies (1997).
Google Scholar
Zamora, A.: Automatic Detection and Correction of Spelling Errors in a Large Data Base. J. Amer. Soc. Info. Sci. 31 (1980) 51–57.
Google Scholar
Zobel, J. and Dart, P.: Finding Approximate Matches in Large Lexicons. Soft. Pract. and Exper. 25 (1995) 331–345.
Google Scholar
Zobel, J. and Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In Proceedings 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 166–173.
Google Scholar

Download references

Author information

Authors and Affiliations

CIIR, University of Massachusetts, 01003, Amherst, MA, USA
S. M. Harding & W. B. Croft
Lockheed Martin C2 Systems, 19355, Frazer, PA, USA
C. Weir

Authors

S. M. Harding
View author publications
You can also search for this author in PubMed Google Scholar
W. B. Croft
View author publications
You can also search for this author in PubMed Google Scholar
C. Weir
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Carol Peters Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Harding, S.M., Croft, W.B., Weir, C. (1997). Probabilistic retrieval of OCR degraded text using N-grams. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026737

Download citation

DOI: https://doi.org/10.1007/BFb0026737
Published: 17 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63554-3
Online ISBN: 978-3-540-69597-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics