research-article

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Authors:
Ljiljana Dolamic

University of Neuchatel

University of Neuchatel
View Profile

,
Jacques Savoy

University of Neuchatel

University of Neuchatel
View Profile

ACM Transactions on Asian Language Information Processing Volume 9 Issue 3Article No.: 11pp 1–24https://doi.org/10.1145/1838745.1838748

Published:01 September 2010Publication History

ACM Transactions on Asian Language Information Processing

Abstract

The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them.

In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc.

Experiments performed with all three languages demonstrate that the I(n_e)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.

References

}}Abdou S. and Savoy, J. 2006. Statistical and comparative evaluation of various indexing and search models. In Proceedings of the Conference on Alliance of Information and Referral Systems (AIRS’06). Lecture Notes in Computer Science, 362--373. Google ScholarDigital Library
}}Alkula, R. 2001. From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Inform. Retriev. 4, 3--4, 195--208. Google ScholarDigital Library
}}Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inform. Syst. 20, 4, 357--389. Google ScholarDigital Library
}}Beames, J. 1891. Grammar of the Bengali Language, Literary, and Colloquial. Clarendon Press, Oxford, UK.Google Scholar
}}Braschler, M. and Peters, C. 2004. Cross-language evaluation forum: Objective, results, achievements? IR J. 7, 1--2, 7--31. Google ScholarDigital Library
}}Braschler, M. and Ripplinger, B. 2004. How effective is stemming and decompounding for German text retrieval? IR J. 7, 3--4, 291--316. Google ScholarDigital Library
}}Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Overview of the 3rd Text Retrieval Conference (TREC’96). D. K. Harman Eds., 25--48.Google Scholar
}}Buckley, C. and Voorhees, E. M. 2005. Retrieval system evaluation. In E. M. Voorhees, D. K. Harman Eds., TREC Experiment and evaluation in information retrieval. The MIT Press, Cambridge, MA, 53--75.Google Scholar
}}Di Nunzio, G. M., Ferro, N., Melucci, M., and Orio, N. 2004. Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science, Springer, Berlin, 220--235.Google Scholar
}}Dolamic, L. and Savoy, J. 2010. Indexing and stemming approaches for the Czech language. Inform. Proc. Manage. To appear. Google ScholarDigital Library
}}Fautsch, C. and Savoy, J. 2009. Algorithmic stemmers or morphological analysis: An evaluation. J. Am. Soc. Inform. Sci. Technol. 60, 1616--1624. Google ScholarDigital Library
}}Fox, C. 1990. A stop list for general text. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’90). 24, 19--35. Google ScholarDigital Library
}}Gungaly, D. and Mitra, M. 2008. Using language modeling at FIRE 2008 Bengali monolingual track. In Working Notes of the Forum for Information Retrieval Evaluation (FIRE’08). http://www.isical.ac.in/~fire/paper/lm_at_fire.pdf.Google Scholar
}}Harter, S. P. 1975. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Am. Assoc. Inform. Sci. 26, 197--216.Google ScholarCross Ref
}}Hiemstra, D. 2000. Using language models for information retrieval. CTIT Ph.D. Thesis.Google Scholar
}}Hollink, V., Kamps, J., Monz, C., and De Rijke, M. 2004. Monolingual document retrieval for European languages. Inform. Retriev. 7, 1--2, 33--52. Google ScholarDigital Library
}}Kellogg, S. H. 1938. A Grammar of the Hindi Language. Kegan Paul, Trench, Trubner & Co. Ltd., London, UK.Google Scholar
}}Kettunen, K. and Airo, E. 2006. Is a morphologically complex language really that complex in full-text retrieval? In Advances in Natural Language Processing, 411--422. Lecture Notes in Computer Science. Springer, Berlin. Google ScholarDigital Library
}}Korenius, T., Laurikkala, J., Järvelin, K., and Juhola, M. 2004. Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’04). The ACM Press, 625--633. Google ScholarDigital Library
}}Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). 191--202. Google ScholarDigital Library
}}Lovins, J. B. 1968. Development of a stemming algorithm. Mechan. Trans. Comput. Linguist. 11, 1, 22--31.Google Scholar
}}Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. YASS: Yet another suffix stripper. ACM Trans. Inform. Syst. 25, 4, 18. Google ScholarDigital Library
}}Manning, C., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
}}Masica, C. P. 1991. The Indo-Aryan Languages. Cambridge University Press, Cambridge, UK.Google Scholar
}}McNamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. IR J. 7, 1--2, 73--97. Google ScholarDigital Library
}}McNamee, P., Nicholas, C., and Mayfield, J. 2009. Addressing morphological variation in alphabetic languages. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 75--82. Google ScholarDigital Library
}}Navalkar, G. R. 2001. The Student’s Marathi Grammar. Asian Education Services, New Dehli.Google Scholar
}}Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A. and Santos, D. Eds. 2008. Advances in multilingual and multimodal information retrieval. Lecture Notes in Comuter Science. Springer-Verlag, Berlin. Google ScholarDigital Library
}}Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
}}Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings Workshop of Computational Linguistics for the South Asian Languages (EACL’03). 42--48.Google Scholar
}}Robertson, S. E., Walker, S., and Beaulieu, M. 2000. Experimentation as a way of life: Okapi at TREC. Inform. Proc. Manage. 36, 1, 95--108. Google ScholarDigital Library
}}Sakar, S. and Bandyopadhyay, S. 2008. Design of a rule-based stemmer for natural language text in Bengal. In Proceedings of the International Joint Conference on Natural Language Processing for Less Privileged Languages (IJCNLP’08). 65--72.Google Scholar
}}Salton, G. Ed. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, N.J. Google ScholarDigital Library
}}Savoy, J. 1993. Stemming of French words based on grammatical category. J. Am. Soc. Inform. Sci. 44, 1, 1--9.Google ScholarCross Ref
}}Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Inform. Proc. Manage. 33, 4, 495--512. Google ScholarDigital Library
}}Savoy, J. 2006. Light stemming approaches for the French, Portuguese, German, and Hungarian languages. In Proceedings of the ACM Symposium on Applied Computing (SAC’06). 1031--1035. Google ScholarDigital Library
}}Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47. Google ScholarDigital Library
}}Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA.Google Scholar
}}Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServer^TM at CLEF 2003 (2004). In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science. Springer-Verlag, Berlin, 286--300.Google Scholar
}}Xu, J. and Croft, B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inform. Syst. 16, 1, 61--81. Google ScholarDigital Library
}}Zhai, C. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inform. Syst. 22, 2, 179--214. Google ScholarDigital Library

Index Terms

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages
1. Information systems
  1. Information retrieval
    1. Document representation
  2. Information storage systems

Recommendations

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing
Natural Language Processing (NLP) has been in practice for the past couple of decades, and extensive work has been done for the Western languages, particularly the English language. The Eastern counterpart, especially the languages of the Indian ...
Read More
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this article: 1) How to ...
Read More
BenLem (A Bengali Lemmatizer) and Its Role in WSD

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 9, Issue 3
September 2010
82 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1838745
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2010
- Accepted: 1 April 2010
- Revised: 1 March 2010
- Received: 1 September 2009
Published in talip Volume 9, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bengali language
Hindi language
Indic languages
Marathi language
natural language processing with Indo-European languages
search engines for Asian languages
stemmer
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 480
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

BenLem (A Bengali Lemmatizer) and Its Role in WSD

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

BenLem (A Bengali Lemmatizer) and Its Role in WSD

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media