Abstract
This paper provides a new model enhancing the Arabic OCR degraded text retrieval effectiveness. The proposed model based on simulating the Arabic OCR recognition mistakes on a word based approach. Then the model expands the user search query using the expected OCR errors. The resulting expanded search query gives higher precision and recall in searching Arabic OCR-Degraded text rather than the original query. The proposed new model showed a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the new model is %97, while the best effectiveness published for word based approach was %84 and the best effectiveness for character based approach was %56. In addition, the new model overcomes several limitations of the current two existing models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Darwish, K.: Probabilistic Methods for Searching OCR-Degraded Arabic Text, A PhD Dissertation, University of Maryland, College Park (2003)
Elghazaly, T.: Cross Language Information Retrieval (CLIR) for digital libraries with Arabic OCR-Degraded Text, A PhD Dissertation, Cairo University, Faculty of Computers and Information (2009)
Chen, A., Gey, F.: Building an Arabic Stemmer for Information Retrieval. In: TREC, Gaithersburg, MD (2002)
Burgin, B.: Variations in Relevance Judgments and the Evaluation of Retrieval Performance. Information Processing and Management 28(5), 619–627 (1992)
Callan, P., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: SIGIR (1995)
Blando, L.R., Kanai, J., Nartker, T.A.: Prediction of OCR accuracy using simple image features. In: Proceedings of the Third International Conference on Document Analysis and Recognition, August 14-16, vol. 1, pp. 319–322 (1995)
Chen, S., Subramaniam, S., Haralick, R.R., Phillips, I.: Performance Evaluation of Two OCR Systems. In: Annual Symp. on Document Analysis and Information Retrieval (1994)
Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC, Gaithersburg, MD (2002)
Cole, A., Graff, D., Walker, K.: Arabic Newswire Part 1 Corpus (1-58563-190-6), Linguistic Data Consortium (LDC)
Darwish, K.: Building a Shallow Morphological Analyzer in One Day. In: ACL Workshop on Computational Approaches to Semitic Languages (2002)
Rice, S., Jenkins, F., Nartker, T.: The fifth annual test of OCR accuracy. Information Science Research Institute, University of Nevada, Las Vegas (1996)
Harman, D.K.: Overview of the first Text REtrieval Conference (TREC-1). In: Proceedings of the First Text Retrieval Conference (TREC-1). pp. 1–20. NIST Special Publication 500-207 (March 1993)
Kanungo, T., Marton, G.A., Bulbul, O.: OmniPage vs. Sakhr: Paired model evaluation of two Arabic OCR products. In: Proc. of SPIE Conf. on Document Recognition and Retrieval (1999)
WWW.SAKHR.COM (last visited on June 2013)
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: SIGIR (2001)
Voorhees, E.: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In: SIGIR, Melbourne, Australia (1998)
Wayne, C.: Detection & Tracking: A Case Study in Corpus Creation & Evaluation Methodologies. In: Language Resources and Evaluation Conference, Granada, Spain (1998)
Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD (2001)
Salton, G., Lesk, M.: Relevance Assessments and Retrieval System Evaluation. Information Storage and Retrieval 4, 343–359 (1969)
Publishers, Al-Areeb Electronic
Elghazaly, T., Fahmy, A.: Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 481–497. Springer, Heidelberg (2009)
Elghazaly, T.A., Fahmy, A.A.: English/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text. Communications of the IBIMA 9(25), 208–218 (2009); ISSN 19437765
Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC, Gaithersburg, MD (2002)
Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR (2002)
Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: To appear in SIGIR (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ezzat, M., ElGhazaly, T., Gheith, M. (2013). An Enhanced Arabic OCR Degraded Text Retrieval Model. In: Castro, F., Gelbukh, A., González, M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science(), vol 8265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45114-0_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-45114-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45113-3
Online ISBN: 978-3-642-45114-0
eBook Packages: Computer ScienceComputer Science (R0)