Skip to main content

Word-Based Correction for Retrieval of Arabic OCR Degraded Documents

  • Conference paper
String Processing and Information Retrieval (SPIRE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abu-Salem, H., Al-Omari, M., Evens, M.: Stemming Methodologies Over Individual Query Words for Arabic Information Retrieval. JASIS 50(6), 524–529 (1999)

    Article  Google Scholar 

  2. Agirre, E., Gojenola, K., Sarasola, K., Voutilainen, A.: Towards a Single Proposal in Spelling Correction. In: COLING-ACL (1998)

    Google Scholar 

  3. Ahmed, M.: A Large-Scale Computational Processor of Arabic Morphology and Applications. MSc. Thesis, in Faculty of Engineering Cairo University: Cairo, Egypt (2000)

    Google Scholar 

  4. Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D., Frieder, O.: IIT at TREC-10. In: TREC-2001, Gaithersbury, MD (2001)

    Google Scholar 

  5. Al-Kharashi, I., Evens, M.: Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. JASIS 45(8), 548–560 (1994)

    Article  Google Scholar 

  6. Baeza-Yates, R., Navarro, G.: A Faster Algorithm for Approximate String Matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075. Springer, Heidelberg (1996)

    Google Scholar 

  7. Darwish, K., Oard, D.: CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In: TREC-2002, Gaithersburg, MD (2002)

    Google Scholar 

  8. Darwish, K., Oard, D.: Probabilistic Structured Query Methods. In: SIGIR-2003 (2003)

    Google Scholar 

  9. Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR-2002 (2002)

    Google Scholar 

  10. De Roeck, A., Al-Fares, W.: A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots. In: The 38th Annual Meeting of the ACL, Hong Kong (2000)

    Google Scholar 

  11. Domeij, R., Hollman, J., Kann, V.: Detection of spelling errors in Swedish not using a word list en clair. Journal of Quantitative Linguistics, 195-201 (1994)

    Google Scholar 

  12. Fraser, A., Xu, J., Weischedel, R.: TREC 2002 Cross-lingual Retrieval at BBN. In: TREC-2002. Gaithersburg, MD (2002)

    Google Scholar 

  13. Gey, F., Oard, D.: The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries. In: TREC-2001, Gaithersburg, MD (2001)

    Google Scholar 

  14. Harding, S., Croft, W., Weir, C.: Probabilistic Retrieval of OCR-degraded Text Using N-Grams. In: European Conference on Digital Libraries (1997)

    Google Scholar 

  15. Harman, D.: Overview of the Fourth Text REtrieval Conference. In: TREC (1995)

    Google Scholar 

  16. Hong, T.: Degraded Text Recognition Using Visual and Linguistic Context. Ph.D. Thesis, Computer Science Department, SUNY Buffalo: Buffalo (1995)

    Google Scholar 

  17. Jurafsky, D., Martin, J.: Speech and Language Processing. Prentice Hall, Englewood Cliffs (2000)

    Google Scholar 

  18. Larkey, L., Allen, J., Connell, M.E., Bolivar, A., Wade, C.: UMass at TREC 2002: Cross Language and Novelty Tracks. In: TREC-2002. Gaithersburg, MD (2002)

    Google Scholar 

  19. Lu, Z., Bazzi, I., Kornai, A., Makhoul, J., Natarajan, P., Schwartz, R.: A Robust, Language-Independent OCR System. In: The 27th AIPR Workshop: Advances in Computer Assisted Recognition, SPIE (1999)

    Google Scholar 

  20. Mayfield, J., McNamee, P., Costello, C., Piatko, C., Banerjee, A.: JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval. In: TREC-2001. Gaithersburg, MD (2001)

    Google Scholar 

  21. McNamee, P., Piatko, C., Mayfield, J.: JHU/APL at TREC 2002: Experiments in Filtering and Arabic Retrieval. In: TREC-2002, Gaithersburg, MD (2002)

    Google Scholar 

  22. Moussa, B., Maamouri, M., Jin, H., Bies, A., Ma, X.: Arabic Treebank: Part 1 - 10Kword English Translation. Linguistic Data Consortium

    Google Scholar 

  23. Oard, D., Gey, F.: The TREC 2002 Arabic/English CLIR Track. In: TREC-2002, Gaithersburg, MD (2002)

    Google Scholar 

  24. Oflazer, K.: Error-Tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction. Computational Linguistics 22(1), 73–90 (1996)

    Google Scholar 

  25. Sanderson, M., Zobel, J.: Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In: SIGIR 2005, Sheffield (2005)

    Google Scholar 

  26. Taghva, K., Borsack, J., Condit, A.: An Expert System for Automatically Correcting OCR Output. In: SPIE - Document Recognition (1994)

    Google Scholar 

  27. Tillenius, M.: Efficient generation and ranking of spelling error corrections. NADA (1996)

    Google Scholar 

  28. Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Magdy, W., Darwish, K. (2006). Word-Based Correction for Retrieval of Arabic OCR Degraded Documents. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_17

Download citation

  • DOI: https://doi.org/10.1007/11880561_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45774-9

  • Online ISBN: 978-3-540-45775-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics