Information Retrieval Strategies for Digitized Handwritten Medieval Documents

Naji, Nada; Savoy, Jacques

doi:10.1007/978-3-642-25631-8_10

Information Retrieval Strategies for Digitized Handwritten Medieval Documents

Nada Naji²¹ &
Jacques Savoy²¹

Conference paper

1348 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7097))

Abstract

This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce four automatic transcriptions, each introducing some form of spelling correction as an attempt to improve the retrieval effectiveness. We evaluated the retrieval effectiveness for each of these versions using three text representations combined with five IR models, three stemming strategies and two query formulations. We employed a manually-transcribed error-free version to define the ground-truth. Based on our experiments, we conclude that taking account of the single best recognition word or all possible top-k recognition alternatives does not provide the best performance. Selecting all possible words each having a log-likelihood close to the best alternative yields the best text surrogate. Within this representation, different retrieval strategies tend to produce similar performance levels.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Toni, M., Manmatha, R., Lavrenko, V.: A Search Engine for Historical Manuscript Images. In: Proceedings of the ACM-SIGIR, pp. 369–376. The ACM Press, New York (2004)
Google Scholar
Nicholas, R., Toni, M., Manmatha, R.: Boosted Decision Trees for Word Recognition in Handwritten Document Retrieval. In: Proceedings of the ACM-SIGIR, pp. 377–383. The ACM Press, New York (2005)
Google Scholar
Callan, J., Kantor, P., Grossman, D.: Information Retrieval and OCR: From Converting Content to Grasping Meaning. SIGIR Forum 36(2), 58–61 (2002)
Article Google Scholar
Voorhees, E.M., Garofolo, J.S.: Retrieving Noisy Text. In: Voorhees, E.M., Harman, D.K. (eds.) TREC, Experiment and Evaluation in Information Retrieval, pp. 183–197. The MIT Press, Cambridge (2005)
Google Scholar
Buckley, C., Voorhees, E.: Retrieval System Evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC, Experiment and Evaluation in Information Retrieval, pp. 53–75. The MIT Press, Cambridge (2005)
Google Scholar
Ballerini, J.P., Büchel, M., Domering, R., Knaus, D., Mateev, B., Mittendorf, E., Schäuble, P., Sheridan, P., Wechsler, M.: SPIDER Retrieval System at TREC-5. In: Proceedings of TREC-5, pp. 217–228. NIST Publication #500-238 (1997)
Google Scholar
Tagva, K., Borsack, J., Condit, A.: Results of Applying Probabilistic IR to OCR Text. In: Proceedings of the ACM-SIGIR, pp. 202–211. The ACM Press, New York (1994)
Google Scholar
Craig, H., Whipp, R.: Old Spellings, New Methods: Automated Procedures for Indeterminate Linguistic Data. Literary & Linguistic Computing 25(1), 37–52 (2010)
Article Google Scholar
Pilz, T., Luther, W., Fuhr, N., Ammon, U.: Rule-Based Search in Text Databases with Nonstandard Orthography. Literacy & Linguistic Computing 21(2), 179–186 (2006)
Article Google Scholar
Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.): CLEF 2003. LNCS, vol. 3237. Springer, Heidelberg (2004)
Google Scholar
Gardt, A., Hauss-Zumkehr, U., Roelcke, T.: Sprachgeschichte als Kulturgeschichte. Walter de Gruyter, Berlin (1999)
Book Google Scholar
Fischer, A., Wüthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M.: Automatic Transcription of Handwritten Medieval Documents. In: 15th International Conference on Virtual Systems and Multimedia (2007)
Google Scholar
Azzopardi, L., de Rijke, M.: Automatic Construction of Known-Item Finding Test Beds. In: Proceeding ACM SIGIR, pp. 603–604. The ACM Press, New York (2006)
Google Scholar
Callan, J., Connell, M.: Query-Based Sampling of Text Databases. Information Systems 19(2), 97–130 (2001)
Google Scholar
Jordan, C., Watters, C., Gao, Q.: Using Controlled Query Generation to Evaluate Blind Relevance Feedback Algorithms. In: Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 286–295. The ACM Press, New York (2006)
Chapter Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Harman, D.: How Effective is Suffixing. Journal of the American Society for Information Science 42(1), 7–15 (1991)
Article MathSciNet Google Scholar
McNamee, P., Mayfield, J.: Character n-gram Tokenization for European Language Text Retrieval. IR Journal 7(1-2), 73–97 (2004)
Google Scholar
Buckley, C., Singhal, A., Mitra, M., Salton, G.: New Retrieval Approaches using SMART. In: Proceedings of TREC-4, pp. 25–48. NIST Publication #500-236 (1996)
Google Scholar
Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a Way of Life: Okapi at TREC. Information Processing & Management 36(1), 95–108 (2000)
Article Google Scholar
Amati, G., van Rijsbergen, C.J.: Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002)
Article Google Scholar
Hiemstra, D.: Using Language Models for Information Retrieval. CTIT Ph.D. Thesis (2000)
Google Scholar
Eguchi, K., Oyama, K., Ishida, E., Kando, N., Kuriyama, K.: Overview of the Web Retrieval Task at the Third NTCIR Workshop. NII Publication (2003)
Google Scholar
Sakai, T.: Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 374–389. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Neuchatel, Rue Emile-Argand 11, 2000, Neuchatel, Switzerland
Nada Naji & Jacques Savoy

Authors

Nada Naji
View author publications
You can also search for this author in PubMed Google Scholar
Jacques Savoy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science and Engineering, University of Wollongong, Dubai Knowledge Village, P.O. Box 20182, Dubai, United Arab Emirates
Mohamed Vall Mohamed Salem
Faculty of Engineering and IT, Dubai International Academic City, Block 11, 1st and 2nd Floor, P.O. Box 345015, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computer Science and Engineering, University of Wollongong, Dubai Knowledge Village, P.O. Box 20183, Dubai, United Arab Emirates
Farhad Oroumchian
Department of Electrical and Computer Engineering, University of Tehran, Faculty of Engineering, North Kargar Street, P.O. Box 14395-515, Tehran, Iran
Azadeh Shakery
Faculty of Computer Science and Engineering, University of Wollongong, Dubai knowledge Village, P.O. Box 20183, Dubai, United Arab Emirates
Halim Khelalfa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Naji, N., Savoy, J. (2011). Information Retrieval Strategies for Digitized Handwritten Medieval Documents. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-25631-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25630-1
Online ISBN: 978-3-642-25631-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics