Retrieval from OCR Text: RISOT Track

Ghosh, Kripabandhu; Parui, Swapan Kumar

doi:10.1007/978-3-642-40087-2_21

Kripabandhu Ghosh²¹ &
Swapan Kumar Parui²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

672 Accesses
1 Citations

Abstract

In this paper, we present our work in the RISOT track of FIRE 2011. Here, we describe an error modeling technique for OCR errors in an Indic script. Based on the error model, we apply a two-fold error correction method on the OCRed corpus. First, we correct the corpus by correction with full confidence and correction without full confidence approaches. Finally, we use query expansion for error correction. We have achieved retrieval results which are significantly better than the baseline and the difference between our best result and the original text run is not significant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Salton, G., Singhal, A., Buckley, C.: Length normalization in degraded text collections. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, pp. 149–162 (1996)
Google Scholar
Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Article Google Scholar
Baron, J., Hedin, B., Tomlinson, S., Oard, D.: Overview of the trec 2009 legal track. In: The Eighteenth Text Retrieval Conference (2009)
Google Scholar
Chaudhuri, B.B., Pal, U.: Ocr error detection and correction of an inflectional indian language script. Pattern Recognition 3, 245–249 (1996)
Google Scholar
Tomlinson, S., Oard, D., Hedin, B., Baron, J.: Overview of the trec 2008 legal track. In: The Seventeenth Text Retrieval Conference (2008)
Google Scholar
Harman, D.: Overview of the fourth text retrieval conference. In: The Fourth Text Retrieval Conference, pp. 1–24 (1995)
Google Scholar
Lewis, D., Baron, J., Oard, D.: The trec-2006 legal track. In: The Fifteenth Text Retrieval Conference (2006)
Google Scholar
Borsack, J., Taghva, K., Condit, A.: Results of applying probabilistic ir to ocr text. In: The Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 202–211 (1994)
Google Scholar
Kantor, P., Voorhees, E.: Report on the trec-5 confusion track. In: The Fifth Text Retrieval Conference, pp. 65–74 (1996)
Google Scholar
Kolak, O., Resnik, P.: Ocr error correction using a noisy channel model. In: HLT, pp. 149–162 (2002)
Google Scholar
Magdy, W., Darwish, K.: Arabic ocr error correction using character segment correction, language modeling, and shallow morphology. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408–414 (2006)
Google Scholar
Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2009)
Article Google Scholar
Baron, J., Tomlinson, S., Oard, D., Thompson, P.: Overview of the trec 2007 legal track. In: The Sixteenth Text Retrieval Conference (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Statistical Institute, Kolkata, West Bengal, India
Kripabandhu Ghosh & Swapan Kumar Parui

Authors

Kripabandhu Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Swapan Kumar Parui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
Indian Institutte of Technology, Bombay, India
Pushpak Bhattacharyya
IBM Research New Delhi, India
L. Venkata Subramaniam & Danish Contractor &
NLE Lab - ELiRF, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghosh, K., Parui, S.K. (2013). Retrieval from OCR Text: RISOT Track. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-40087-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics