An Enhanced Arabic OCR Degraded Text Retrieval Model

Ezzat, Mostafa; ElGhazaly, Tarek; Gheith, Mervat

doi:10.1007/978-3-642-45114-0_31

Mostafa Ezzat²²,
Tarek ElGhazaly²² &
Mervat Gheith²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8265))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1378 Accesses
2 Citations

Abstract

This paper provides a new model enhancing the Arabic OCR degraded text retrieval effectiveness. The proposed model based on simulating the Arabic OCR recognition mistakes on a word based approach. Then the model expands the user search query using the expected OCR errors. The resulting expanded search query gives higher precision and recall in searching Arabic OCR-Degraded text rather than the original query. The proposed new model showed a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the new model is %97, while the best effectiveness published for word based approach was %84 and the best effectiveness for character based approach was %56. In addition, the new model overcomes several limitations of the current two existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improving OCR-Degraded Arabic Text Retrieval Through an Enhanced Orthographic Query Expansion Model

A novel Arabic OCR post-processing using rule-based and word context techniques

Article 05 April 2018

Text-to-Concept: A Semantic Indexing Framework for Arabic News Videos

References

Darwish, K.: Probabilistic Methods for Searching OCR-Degraded Arabic Text, A PhD Dissertation, University of Maryland, College Park (2003)
Google Scholar
Elghazaly, T.: Cross Language Information Retrieval (CLIR) for digital libraries with Arabic OCR-Degraded Text, A PhD Dissertation, Cairo University, Faculty of Computers and Information (2009)
Google Scholar
Chen, A., Gey, F.: Building an Arabic Stemmer for Information Retrieval. In: TREC, Gaithersburg, MD (2002)
Google Scholar
Burgin, B.: Variations in Relevance Judgments and the Evaluation of Retrieval Performance. Information Processing and Management 28(5), 619–627 (1992)
Article Google Scholar
Callan, P., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: SIGIR (1995)
Google Scholar
Blando, L.R., Kanai, J., Nartker, T.A.: Prediction of OCR accuracy using simple image features. In: Proceedings of the Third International Conference on Document Analysis and Recognition, August 14-16, vol. 1, pp. 319–322 (1995)
Google Scholar
Chen, S., Subramaniam, S., Haralick, R.R., Phillips, I.: Performance Evaluation of Two OCR Systems. In: Annual Symp. on Document Analysis and Information Retrieval (1994)
Google Scholar
Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC, Gaithersburg, MD (2002)
Google Scholar
Cole, A., Graff, D., Walker, K.: Arabic Newswire Part 1 Corpus (1-58563-190-6), Linguistic Data Consortium (LDC)
Google Scholar
Darwish, K.: Building a Shallow Morphological Analyzer in One Day. In: ACL Workshop on Computational Approaches to Semitic Languages (2002)
Google Scholar
Rice, S., Jenkins, F., Nartker, T.: The fifth annual test of OCR accuracy. Information Science Research Institute, University of Nevada, Las Vegas (1996)
Google Scholar
Harman, D.K.: Overview of the first Text REtrieval Conference (TREC-1). In: Proceedings of the First Text Retrieval Conference (TREC-1). pp. 1–20. NIST Special Publication 500-207 (March 1993)
Google Scholar
Kanungo, T., Marton, G.A., Bulbul, O.: OmniPage vs. Sakhr: Paired model evaluation of two Arabic OCR products. In: Proc. of SPIE Conf. on Document Recognition and Retrieval (1999)
Google Scholar
WWW.SAKHR.COM (last visited on June 2013)
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: SIGIR (2001)
Google Scholar
Voorhees, E.: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In: SIGIR, Melbourne, Australia (1998)
Google Scholar
Wayne, C.: Detection & Tracking: A Case Study in Corpus Creation & Evaluation Methodologies. In: Language Resources and Evaluation Conference, Granada, Spain (1998)
Google Scholar
Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD (2001)
Google Scholar
Salton, G., Lesk, M.: Relevance Assessments and Retrieval System Evaluation. Information Storage and Retrieval 4, 343–359 (1969)
Google Scholar
Publishers, Al-Areeb Electronic
Google Scholar
Elghazaly, T., Fahmy, A.: Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 481–497. Springer, Heidelberg (2009)
Chapter Google Scholar
Elghazaly, T.A., Fahmy, A.A.: English/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text. Communications of the IBIMA 9(25), 208–218 (2009); ISSN 19437765
Google Scholar
Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC, Gaithersburg, MD (2002)
Google Scholar
Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR (2002)
Google Scholar
Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: To appear in SIGIR (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Sciences Department, Institute of Statistical Studies & Research, Cairo University, Egypt
Mostafa Ezzat, Tarek ElGhazaly & Mervat Gheith

Authors

Mostafa Ezzat
View author publications
You can also search for this author in PubMed Google Scholar
Tarek ElGhazaly
View author publications
You can also search for this author in PubMed Google Scholar
Mervat Gheith
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad Autónoma del Estado de Hidalgo, Ciudad Universitaria,, Carretera Pachuca–Tulancingo km 4.5, Hidalgo, Mexico
Félix Castro
Centro de Investigación en Computación, Instituto Politécnico Nacional, Av. Juan Dios Bátiz s/n, Col. Nueva Industrial Vallejo, 07738, Mexico City, Mexico
Alexander Gelbukh
Tecnológico de Monterrey, Campus Estado de México,, Carretera Lago de Guadalupe Km 3.5, Atizapán de Zaragoza,, CP 52926, Estado de México, Mexico
Miguel González

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ezzat, M., ElGhazaly, T., Gheith, M. (2013). An Enhanced Arabic OCR Degraded Text Retrieval Model. In: Castro, F., Gelbukh, A., González, M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science(), vol 8265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45114-0_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-45114-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45113-3
Online ISBN: 978-3-642-45114-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics