Skip to main content

An Enhanced Arabic OCR Degraded Text Retrieval Model

  • Conference paper
Advances in Artificial Intelligence and Its Applications (MICAI 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8265))

Included in the following conference series:

Abstract

This paper provides a new model enhancing the Arabic OCR degraded text retrieval effectiveness. The proposed model based on simulating the Arabic OCR recognition mistakes on a word based approach. Then the model expands the user search query using the expected OCR errors. The resulting expanded search query gives higher precision and recall in searching Arabic OCR-Degraded text rather than the original query. The proposed new model showed a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the new model is %97, while the best effectiveness published for word based approach was %84 and the best effectiveness for character based approach was %56. In addition, the new model overcomes several limitations of the current two existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Darwish, K.: Probabilistic Methods for Searching OCR-Degraded Arabic Text, A PhD Dissertation, University of Maryland, College Park (2003)

    Google Scholar 

  2. Elghazaly, T.: Cross Language Information Retrieval (CLIR) for digital libraries with Arabic OCR-Degraded Text, A PhD Dissertation, Cairo University, Faculty of Computers and Information (2009)

    Google Scholar 

  3. Chen, A., Gey, F.: Building an Arabic Stemmer for Information Retrieval. In: TREC, Gaithersburg, MD (2002)

    Google Scholar 

  4. Burgin, B.: Variations in Relevance Judgments and the Evaluation of Retrieval Performance. Information Processing and Management 28(5), 619–627 (1992)

    Article  Google Scholar 

  5. Callan, P., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: SIGIR (1995)

    Google Scholar 

  6. Blando, L.R., Kanai, J., Nartker, T.A.: Prediction of OCR accuracy using simple image features. In: Proceedings of the Third International Conference on Document Analysis and Recognition, August 14-16, vol. 1, pp. 319–322 (1995)

    Google Scholar 

  7. Chen, S., Subramaniam, S., Haralick, R.R., Phillips, I.: Performance Evaluation of Two OCR Systems. In: Annual Symp. on Document Analysis and Information Retrieval (1994)

    Google Scholar 

  8. Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC, Gaithersburg, MD (2002)

    Google Scholar 

  9. Cole, A., Graff, D., Walker, K.: Arabic Newswire Part 1 Corpus (1-58563-190-6), Linguistic Data Consortium (LDC)

    Google Scholar 

  10. Darwish, K.: Building a Shallow Morphological Analyzer in One Day. In: ACL Workshop on Computational Approaches to Semitic Languages (2002)

    Google Scholar 

  11. Rice, S., Jenkins, F., Nartker, T.: The fifth annual test of OCR accuracy. Information Science Research Institute, University of Nevada, Las Vegas (1996)

    Google Scholar 

  12. Harman, D.K.: Overview of the first Text REtrieval Conference (TREC-1). In: Proceedings of the First Text Retrieval Conference (TREC-1). pp. 1–20. NIST Special Publication 500-207 (March 1993)

    Google Scholar 

  13. Kanungo, T., Marton, G.A., Bulbul, O.: OmniPage vs. Sakhr: Paired model evaluation of two Arabic OCR products. In: Proc. of SPIE Conf. on Document Recognition and Retrieval (1999)

    Google Scholar 

  14. WWW.SAKHR.COM (last visited on June 2013)

  15. Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: SIGIR (2001)

    Google Scholar 

  16. Voorhees, E.: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In: SIGIR, Melbourne, Australia (1998)

    Google Scholar 

  17. Wayne, C.: Detection & Tracking: A Case Study in Corpus Creation & Evaluation Methodologies. In: Language Resources and Evaluation Conference, Granada, Spain (1998)

    Google Scholar 

  18. Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD (2001)

    Google Scholar 

  19. Salton, G., Lesk, M.: Relevance Assessments and Retrieval System Evaluation. Information Storage and Retrieval 4, 343–359 (1969)

    Google Scholar 

  20. Publishers, Al-Areeb Electronic

    Google Scholar 

  21. Elghazaly, T., Fahmy, A.: Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 481–497. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  22. Elghazaly, T.A., Fahmy, A.A.: English/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text. Communications of the IBIMA 9(25), 208–218 (2009); ISSN 19437765

    Google Scholar 

  23. Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC, Gaithersburg, MD (2002)

    Google Scholar 

  24. Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR (2002)

    Google Scholar 

  25. Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: To appear in SIGIR (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ezzat, M., ElGhazaly, T., Gheith, M. (2013). An Enhanced Arabic OCR Degraded Text Retrieval Model. In: Castro, F., Gelbukh, A., González, M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science(), vol 8265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45114-0_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45114-0_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45113-3

  • Online ISBN: 978-3-642-45114-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics