Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

Savoy, Jacques; Dolamic, Ljiljana; Akasereh, Mitra

doi:10.1007/978-3-642-40087-2_30

Jacques Savoy²¹,
Ljiljana Dolamic²¹ &
Mitra Akasereh²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

677 Accesses
1 Citations

Abstract

Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during this second campaign. We have compared their retrieval effectiveness with both light stemming strategy and n-gram language-independent approach. As another language-independent indexing strategy, we have evaluated the trunc-n method in which the indexing term is formed by considering only the first n letters of each word. To evaluate these solutions we have used various IR models including models derived from Divergence from Randomness (DFR), Language Model (LM) as well as Okapi, or the classical tf idf vector-processing approach.

For the three studied languages, our experiments tend to show that IR models derived from Divergence from Randomness (DFR) paradigm tend to produce the best overall results. For these languages, our various experiments demonstrate also that either an aggressive stemming procedure or the trunc-n indexing approach produces better retrieval effectiveness when compared to other word-based or n-gram language-independent approaches. Applying the Z-score as data fusion operator after a blind-query expansion tends also to improve the MAP of the merged run over the best single IR system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Savoy, J.: Combining Multiple Strategies for Effective Monolingual and Cross-Lingual Retrieval. IR Journal 7, 121–148 (2004)
Google Scholar
Savoy, J.: Comparative Study of Monolingual and Multilingual Search Models for Use with Asian Languages. ACM - Transactions on Asian Languages Information Processing 4, 163–189 (2005)
Article Google Scholar
Dolamic, L., Savoy, J.: UniNE at FIRE 2008: Hindi, Marathi and Bengali IR. FIRE 2008 Working Notes (2008)
Google Scholar
Voorhees, E.M., Harman, D.K. (eds.): TREC. Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)
Google Scholar
Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a Way of Life: Okapi at TREC. Information Processing & Management 36, 95–108 (2002)
Article Google Scholar
Amati, G., van Rijsbergen, C.J.: Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems 20, 357–389 (2002)
Article Google Scholar
Hiemstra, D.: Using Language Models for Information Retrieval. Ph.D. Thesis (2000)
Google Scholar
Hiemstra, D.: Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval. In: Proceedings of ACM-SIGIR, pp. 35–41. The ACM Press (2002)
Google Scholar
Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Transactions on Information Systems 22, 179–214 (2004)
Article Google Scholar
Fox, C.: A Stop List for General Text. ACM-SIGIR Forum 24, 19–35 (1990)
Article Google Scholar
Dolamic, L., Savoy, J.: When Stopword Lists Make the Difference. Journal of the American Society for Information Sciences and Technology 61, 200–203 (2010)
Article Google Scholar
Savoy, J.: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages. In: Proceedings of ACM-SAC, pp. 1031–1035. The ACM Press (2006)
Google Scholar
Harman, D.K.: How Effective is Suffxing? Journal of the American Society for Information Science 42, 7–15 (1991)
Article MathSciNet Google Scholar
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14, 130–137 (1980)
Article Google Scholar
Fautsch, C., Savoy, J.: Algorithmic Stemmers or Morphological Analysis: An Evaluation. Journal of the American Society for Information Sciences and Technology 60, 1616–1624 (2009)
Article Google Scholar
Buckley, C., Voorhees, E.M.: Retrieval System Evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC. Experiment and Evaluation in Information Retrieval, pp. 53–75. The MIT Press, Cambridge (2005)
Google Scholar
Savoy, J.: Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing & Management 33(4), 495-512
Google Scholar
Abdou, S., Savoy, J.: Statistical and Comparative Evaluation of Various Indexing and Search Models. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 362–373. Springer, Heidelberg (2006)
Chapter Google Scholar
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. IR Journal 7, 73–97 (2004)
Google Scholar
McNamee, P., Nicholas, C., Mayfield, J.: Addressing Morphological Variation in Alphabetic Languages. In: Proceedings of ACM-SIGIR 2009, pp. 75–82. The ACM Press (2009)
Google Scholar
Buckley, C., Singhal, A., Mitra, M., Salton, G.: New Retrieval Approaches Using SMART. In: Proceedings of TREC-4, pp. 25–48. NIST Publication #500-236, Gaithersburg (1996)
Google Scholar
Peat, H.J., Willett, P.: The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems. Journal of the American Society for Information Science 42, 378–383 (1991)
Article Google Scholar
Abdou, S., Savoy, J.: Searching in Medline: Stemming, Query Expansion, and Manual Indexing Evaluation. Information Processing & Management 44, 781–789 (2008)
Article Google Scholar
Vogt, C.C., Cottrell, G.W.: Fusion via a Linear Combination of Scores. IR Journal 1, 151–173 (1999)
Google Scholar
Fox, E.A., Shaw, J.A.: Combination of Multiple Searches. In: Proceedings of TREC-2, pp. 243–249. NIST Publication #500-215, Gaithersburg (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Neuchatel, Rue Emile Argand 11, 2000, Neuchatel, Switzerland
Jacques Savoy, Ljiljana Dolamic & Mitra Akasereh

Authors

Jacques Savoy
View author publications
You can also search for this author in PubMed Google Scholar
Ljiljana Dolamic
View author publications
You can also search for this author in PubMed Google Scholar
Mitra Akasereh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
Indian Institutte of Technology, Bombay, India
Pushpak Bhattacharyya
IBM Research New Delhi, India
L. Venkata Subramaniam & Danish Contractor &
NLE Lab - ELiRF, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Savoy, J., Dolamic, L., Akasereh, M. (2013). Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-642-40087-2_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics