Abstract
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them.
In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc.
Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.
- }}Abdou S. and Savoy, J. 2006. Statistical and comparative evaluation of various indexing and search models. In Proceedings of the Conference on Alliance of Information and Referral Systems (AIRS’06). Lecture Notes in Computer Science, 362--373. Google ScholarDigital Library
- }}Alkula, R. 2001. From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Inform. Retriev. 4, 3--4, 195--208. Google ScholarDigital Library
- }}Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inform. Syst. 20, 4, 357--389. Google ScholarDigital Library
- }}Beames, J. 1891. Grammar of the Bengali Language, Literary, and Colloquial. Clarendon Press, Oxford, UK.Google Scholar
- }}Braschler, M. and Peters, C. 2004. Cross-language evaluation forum: Objective, results, achievements? IR J. 7, 1--2, 7--31. Google ScholarDigital Library
- }}Braschler, M. and Ripplinger, B. 2004. How effective is stemming and decompounding for German text retrieval? IR J. 7, 3--4, 291--316. Google ScholarDigital Library
- }}Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Overview of the 3rd Text Retrieval Conference (TREC’96). D. K. Harman Eds., 25--48.Google Scholar
- }}Buckley, C. and Voorhees, E. M. 2005. Retrieval system evaluation. In E. M. Voorhees, D. K. Harman Eds., TREC Experiment and evaluation in information retrieval. The MIT Press, Cambridge, MA, 53--75.Google Scholar
- }}Di Nunzio, G. M., Ferro, N., Melucci, M., and Orio, N. 2004. Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science, Springer, Berlin, 220--235.Google Scholar
- }}Dolamic, L. and Savoy, J. 2010. Indexing and stemming approaches for the Czech language. Inform. Proc. Manage. To appear. Google ScholarDigital Library
- }}Fautsch, C. and Savoy, J. 2009. Algorithmic stemmers or morphological analysis: An evaluation. J. Am. Soc. Inform. Sci. Technol. 60, 1616--1624. Google ScholarDigital Library
- }}Fox, C. 1990. A stop list for general text. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’90). 24, 19--35. Google ScholarDigital Library
- }}Gungaly, D. and Mitra, M. 2008. Using language modeling at FIRE 2008 Bengali monolingual track. In Working Notes of the Forum for Information Retrieval Evaluation (FIRE’08). http://www.isical.ac.in/~fire/paper/lm_at_fire.pdf.Google Scholar
- }}Harter, S. P. 1975. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Am. Assoc. Inform. Sci. 26, 197--216.Google ScholarCross Ref
- }}Hiemstra, D. 2000. Using language models for information retrieval. CTIT Ph.D. Thesis.Google Scholar
- }}Hollink, V., Kamps, J., Monz, C., and De Rijke, M. 2004. Monolingual document retrieval for European languages. Inform. Retriev. 7, 1--2, 33--52. Google ScholarDigital Library
- }}Kellogg, S. H. 1938. A Grammar of the Hindi Language. Kegan Paul, Trench, Trubner & Co. Ltd., London, UK.Google Scholar
- }}Kettunen, K. and Airo, E. 2006. Is a morphologically complex language really that complex in full-text retrieval? In Advances in Natural Language Processing, 411--422. Lecture Notes in Computer Science. Springer, Berlin. Google ScholarDigital Library
- }}Korenius, T., Laurikkala, J., Järvelin, K., and Juhola, M. 2004. Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’04). The ACM Press, 625--633. Google ScholarDigital Library
- }}Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). 191--202. Google ScholarDigital Library
- }}Lovins, J. B. 1968. Development of a stemming algorithm. Mechan. Trans. Comput. Linguist. 11, 1, 22--31.Google Scholar
- }}Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. YASS: Yet another suffix stripper. ACM Trans. Inform. Syst. 25, 4, 18. Google ScholarDigital Library
- }}Manning, C., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- }}Masica, C. P. 1991. The Indo-Aryan Languages. Cambridge University Press, Cambridge, UK.Google Scholar
- }}McNamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. IR J. 7, 1--2, 73--97. Google ScholarDigital Library
- }}McNamee, P., Nicholas, C., and Mayfield, J. 2009. Addressing morphological variation in alphabetic languages. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 75--82. Google ScholarDigital Library
- }}Navalkar, G. R. 2001. The Student’s Marathi Grammar. Asian Education Services, New Dehli.Google Scholar
- }}Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A. and Santos, D. Eds. 2008. Advances in multilingual and multimodal information retrieval. Lecture Notes in Comuter Science. Springer-Verlag, Berlin. Google ScholarDigital Library
- }}Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
- }}Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings Workshop of Computational Linguistics for the South Asian Languages (EACL’03). 42--48.Google Scholar
- }}Robertson, S. E., Walker, S., and Beaulieu, M. 2000. Experimentation as a way of life: Okapi at TREC. Inform. Proc. Manage. 36, 1, 95--108. Google ScholarDigital Library
- }}Sakar, S. and Bandyopadhyay, S. 2008. Design of a rule-based stemmer for natural language text in Bengal. In Proceedings of the International Joint Conference on Natural Language Processing for Less Privileged Languages (IJCNLP’08). 65--72.Google Scholar
- }}Salton, G. Ed. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, N.J. Google ScholarDigital Library
- }}Savoy, J. 1993. Stemming of French words based on grammatical category. J. Am. Soc. Inform. Sci. 44, 1, 1--9.Google ScholarCross Ref
- }}Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Inform. Proc. Manage. 33, 4, 495--512. Google ScholarDigital Library
- }}Savoy, J. 2006. Light stemming approaches for the French, Portuguese, German, and Hungarian languages. In Proceedings of the ACM Symposium on Applied Computing (SAC’06). 1031--1035. Google ScholarDigital Library
- }}Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47. Google ScholarDigital Library
- }}Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA.Google Scholar
- }}Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003 (2004). In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science. Springer-Verlag, Berlin, 286--300.Google Scholar
- }}Xu, J. and Croft, B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inform. Syst. 16, 1, 61--81. Google ScholarDigital Library
- }}Zhai, C. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inform. Syst. 22, 2, 179--214. Google ScholarDigital Library
Index Terms
- Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages
Recommendations
A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing
Natural Language Processing (NLP) has been in practice for the past couple of decades, and extensive work has been done for the Western languages, particularly the English language. The Eastern counterpart, especially the languages of the Indian ...
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this article: 1) How to ...
BenLem (A Bengali Lemmatizer) and Its Role in WSD
A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the ...
Comments