Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

Abstract

In this paper, we present our work in the RISOT track of FIRE 2011. Here, we describe an error modeling technique for OCR errors in an Indic script. Based on the error model, we apply a two-fold error correction method on the OCRed corpus. First, we correct the corpus by correction with full confidence and correction without full confidence approaches. Finally, we use query expansion for error correction. We have achieved retrieval results which are significantly better than the baseline and the difference between our best result and the original text run is not significant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., Singhal, A., Buckley, C.: Length normalization in degraded text collections. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, pp. 149–162 (1996)

    Google Scholar 

  2. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)

    Article  Google Scholar 

  3. Baron, J., Hedin, B., Tomlinson, S., Oard, D.: Overview of the trec 2009 legal track. In: The Eighteenth Text Retrieval Conference (2009)

    Google Scholar 

  4. Chaudhuri, B.B., Pal, U.: Ocr error detection and correction of an inflectional indian language script. Pattern Recognition 3, 245–249 (1996)

    Google Scholar 

  5. Tomlinson, S., Oard, D., Hedin, B., Baron, J.: Overview of the trec 2008 legal track. In: The Seventeenth Text Retrieval Conference (2008)

    Google Scholar 

  6. Harman, D.: Overview of the fourth text retrieval conference. In: The Fourth Text Retrieval Conference, pp. 1–24 (1995)

    Google Scholar 

  7. Lewis, D., Baron, J., Oard, D.: The trec-2006 legal track. In: The Fifteenth Text Retrieval Conference (2006)

    Google Scholar 

  8. Borsack, J., Taghva, K., Condit, A.: Results of applying probabilistic ir to ocr text. In: The Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 202–211 (1994)

    Google Scholar 

  9. Kantor, P., Voorhees, E.: Report on the trec-5 confusion track. In: The Fifth Text Retrieval Conference, pp. 65–74 (1996)

    Google Scholar 

  10. Kolak, O., Resnik, P.: Ocr error correction using a noisy channel model. In: HLT, pp. 149–162 (2002)

    Google Scholar 

  11. Magdy, W., Darwish, K.: Arabic ocr error correction using character segment correction, language modeling, and shallow morphology. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408–414 (2006)

    Google Scholar 

  12. Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2009)

    Article  Google Scholar 

  13. Baron, J., Tomlinson, S., Oard, D., Thompson, P.: Overview of the trec 2007 legal track. In: The Sixteenth Text Retrieval Conference (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ghosh, K., Parui, S.K. (2013). Retrieval from OCR Text: RISOT Track. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40087-2_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40086-5

  • Online ISBN: 978-3-642-40087-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics