Word-Based Correction for Retrieval of Arabic OCR Degraded Documents

Magdy, Walid; Darwish, Kareem

doi:10.1007/11880561_17

Walid Magdy¹⁹ &
Kareem Darwish¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

643 Accesses
2 Citations

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abu-Salem, H., Al-Omari, M., Evens, M.: Stemming Methodologies Over Individual Query Words for Arabic Information Retrieval. JASIS 50(6), 524–529 (1999)
Article Google Scholar
Agirre, E., Gojenola, K., Sarasola, K., Voutilainen, A.: Towards a Single Proposal in Spelling Correction. In: COLING-ACL (1998)
Google Scholar
Ahmed, M.: A Large-Scale Computational Processor of Arabic Morphology and Applications. MSc. Thesis, in Faculty of Engineering Cairo University: Cairo, Egypt (2000)
Google Scholar
Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D., Frieder, O.: IIT at TREC-10. In: TREC-2001, Gaithersbury, MD (2001)
Google Scholar
Al-Kharashi, I., Evens, M.: Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. JASIS 45(8), 548–560 (1994)
Article Google Scholar
Baeza-Yates, R., Navarro, G.: A Faster Algorithm for Approximate String Matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075. Springer, Heidelberg (1996)
Google Scholar
Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC-2002, Gaithersburg, MD (2002)
Google Scholar
Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: SIGIR-2003 (2003)
Google Scholar
Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR-2002 (2002)
Google Scholar
De Roeck, A., Al-Fares, W.: A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots. In: The 38th Annual Meeting of the ACL, Hong Kong (2000)
Google Scholar
Domeij, R., Hollman, J., Kann, V.: Detection of spelling errors in Swedish not using a word list en clair. Journal of Quantitative Linguistics, 195-201 (1994)
Google Scholar
Fraser, A., Xu, J., Weischedel, R.: TREC 2002 Cross-lingual Retrieval at BBN. In: TREC-2002. Gaithersburg, MD (2002)
Google Scholar
Gey, F., Oard, D.: The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries. In: TREC-2001, Gaithersburg, MD (2001)
Google Scholar
Harding, S., Croft, W., Weir, C.: Probabilistic Retrieval of OCR-degraded Text Using N-Grams. In: European Conference on Digital Libraries (1997)
Google Scholar
Harman, D.: Overview of the Fourth Text REtrieval Conference. In: TREC (1995)
Google Scholar
Hong, T.: Degraded Text Recognition Using Visual and Linguistic Context. Ph.D. Thesis, Computer Science Department, SUNY Buffalo: Buffalo (1995)
Google Scholar
Jurafsky, D., Martin, J.: Speech and Language Processing. Prentice Hall, Englewood Cliffs (2000)
Google Scholar
Larkey, L., Allen, J., Connell, M.E., Bolivar, A., Wade, C.: UMass at TREC 2002: Cross Language and Novelty Tracks. In: TREC-2002. Gaithersburg, MD (2002)
Google Scholar
Lu, Z., Bazzi, I., Kornai, A., Makhoul, J., Natarajan, P., Schwartz, R.: A Robust, Language-Independent OCR System. In: The 27th AIPR Workshop: Advances in Computer Assisted Recognition, SPIE (1999)
Google Scholar
Mayfield, J., McNamee, P., Costello, C., Piatko, C., Banerjee, A.: JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval. In: TREC-2001. Gaithersburg, MD (2001)
Google Scholar
McNamee, P., Piatko, C., Mayfield, J.: JHU/APL at TREC 2002: Experiments in Filtering and Arabic Retrieval. In: TREC-2002, Gaithersburg, MD (2002)
Google Scholar
Moussa, B., Maamouri, M., Jin, H., Bies, A., Ma, X.: Arabic Treebank: Part 1 - 10Kword English Translation. Linguistic Data Consortium
Google Scholar
Oard, D., Gey, F.: The TREC 2002 Arabic/English CLIR Track. In: TREC-2002, Gaithersburg, MD (2002)
Google Scholar
Oflazer, K.: Error-Tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction. Computational Linguistics 22(1), 73–90 (1996)
Google Scholar
Sanderson, M., Zobel, J.: Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In: SIGIR 2005, Sheffield (2005)
Google Scholar
Taghva, K., Borsack, J., Condit, A.: An Expert System for Automatically Correcting OCR Output. In: SPIE - Document Recognition (1994)
Google Scholar
Tillenius, M.: Efficient generation and ranking of spelling error corrections. NADA (1996)
Google Scholar
Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Technology Development Center, P.O. Box 166, El-Ahram, Giza, Egypt
Walid Magdy & Kareem Darwish

Authors

Walid Magdy
View author publications
You can also search for this author in PubMed Google Scholar
Kareem Darwish
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Dipartimento di Informatica, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Magdy, W., Darwish, K. (2006). Word-Based Correction for Retrieval of Arabic OCR Degraded Documents. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_17

Download citation

DOI: https://doi.org/10.1007/11880561_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics