Abstract
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abu-Salem, H., Al-Omari, M., Evens, M.: Stemming Methodologies Over Individual Query Words for Arabic Information Retrieval. JASIS 50(6), 524–529 (1999)
Agirre, E., Gojenola, K., Sarasola, K., Voutilainen, A.: Towards a Single Proposal in Spelling Correction. In: COLING-ACL (1998)
Ahmed, M.: A Large-Scale Computational Processor of Arabic Morphology and Applications. MSc. Thesis, in Faculty of Engineering Cairo University: Cairo, Egypt (2000)
Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D., Frieder, O.: IIT at TREC-10. In: TREC-2001, Gaithersbury, MD (2001)
Al-Kharashi, I., Evens, M.: Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. JASIS 45(8), 548–560 (1994)
Baeza-Yates, R., Navarro, G.: A Faster Algorithm for Approximate String Matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075. Springer, Heidelberg (1996)
Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC-2002, Gaithersburg, MD (2002)
Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: SIGIR-2003 (2003)
Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR-2002 (2002)
De Roeck, A., Al-Fares, W.: A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots. In: The 38th Annual Meeting of the ACL, Hong Kong (2000)
Domeij, R., Hollman, J., Kann, V.: Detection of spelling errors in Swedish not using a word list en clair. Journal of Quantitative Linguistics, 195-201 (1994)
Fraser, A., Xu, J., Weischedel, R.: TREC 2002 Cross-lingual Retrieval at BBN. In: TREC-2002. Gaithersburg, MD (2002)
Gey, F., Oard, D.: The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries. In: TREC-2001, Gaithersburg, MD (2001)
Harding, S., Croft, W., Weir, C.: Probabilistic Retrieval of OCR-degraded Text Using N-Grams. In: European Conference on Digital Libraries (1997)
Harman, D.: Overview of the Fourth Text REtrieval Conference. In: TREC (1995)
Hong, T.: Degraded Text Recognition Using Visual and Linguistic Context. Ph.D. Thesis, Computer Science Department, SUNY Buffalo: Buffalo (1995)
Jurafsky, D., Martin, J.: Speech and Language Processing. Prentice Hall, Englewood Cliffs (2000)
Larkey, L., Allen, J., Connell, M.E., Bolivar, A., Wade, C.: UMass at TREC 2002: Cross Language and Novelty Tracks. In: TREC-2002. Gaithersburg, MD (2002)
Lu, Z., Bazzi, I., Kornai, A., Makhoul, J., Natarajan, P., Schwartz, R.: A Robust, Language-Independent OCR System. In: The 27th AIPR Workshop: Advances in Computer Assisted Recognition, SPIE (1999)
Mayfield, J., McNamee, P., Costello, C., Piatko, C., Banerjee, A.: JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval. In: TREC-2001. Gaithersburg, MD (2001)
McNamee, P., Piatko, C., Mayfield, J.: JHU/APL at TREC 2002: Experiments in Filtering and Arabic Retrieval. In: TREC-2002, Gaithersburg, MD (2002)
Moussa, B., Maamouri, M., Jin, H., Bies, A., Ma, X.: Arabic Treebank: Part 1 - 10Kword English Translation. Linguistic Data Consortium
Oard, D., Gey, F.: The TREC 2002 Arabic/English CLIR Track. In: TREC-2002, Gaithersburg, MD (2002)
Oflazer, K.: Error-Tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction. Computational Linguistics 22(1), 73–90 (1996)
Sanderson, M., Zobel, J.: Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In: SIGIR 2005, Sheffield (2005)
Taghva, K., Borsack, J., Condit, A.: An Expert System for Automatically Correcting OCR Output. In: SPIE - Document Recognition (1994)
Tillenius, M.: Efficient generation and ranking of spelling error corrections. NADA (1996)
Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Magdy, W., Darwish, K. (2006). Word-Based Correction for Retrieval of Arabic OCR Degraded Documents. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_17
Download citation
DOI: https://doi.org/10.1007/11880561_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)