skip to main content
research-article

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them.

In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n-gram and trunc-n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc.

Experiments performed with all three languages demonstrate that the I(ne)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.

References

  1. }}Abdou S. and Savoy, J. 2006. Statistical and comparative evaluation of various indexing and search models. In Proceedings of the Conference on Alliance of Information and Referral Systems (AIRS’06). Lecture Notes in Computer Science, 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Alkula, R. 2001. From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Inform. Retriev. 4, 3--4, 195--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. }}Amati, G. and van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inform. Syst. 20, 4, 357--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Beames, J. 1891. Grammar of the Bengali Language, Literary, and Colloquial. Clarendon Press, Oxford, UK.Google ScholarGoogle Scholar
  5. }}Braschler, M. and Peters, C. 2004. Cross-language evaluation forum: Objective, results, achievements? IR J. 7, 1--2, 7--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Braschler, M. and Ripplinger, B. 2004. How effective is stemming and decompounding for German text retrieval? IR J. 7, 3--4, 291--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Overview of the 3rd Text Retrieval Conference (TREC’96). D. K. Harman Eds., 25--48.Google ScholarGoogle Scholar
  8. }}Buckley, C. and Voorhees, E. M. 2005. Retrieval system evaluation. In E. M. Voorhees, D. K. Harman Eds., TREC Experiment and evaluation in information retrieval. The MIT Press, Cambridge, MA, 53--75.Google ScholarGoogle Scholar
  9. }}Di Nunzio, G. M., Ferro, N., Melucci, M., and Orio, N. 2004. Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science, Springer, Berlin, 220--235.Google ScholarGoogle Scholar
  10. }}Dolamic, L. and Savoy, J. 2010. Indexing and stemming approaches for the Czech language. Inform. Proc. Manage. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Fautsch, C. and Savoy, J. 2009. Algorithmic stemmers or morphological analysis: An evaluation. J. Am. Soc. Inform. Sci. Technol. 60, 1616--1624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Fox, C. 1990. A stop list for general text. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’90). 24, 19--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Gungaly, D. and Mitra, M. 2008. Using language modeling at FIRE 2008 Bengali monolingual track. In Working Notes of the Forum for Information Retrieval Evaluation (FIRE’08). http://www.isical.ac.in/~fire/paper/lm_at_fire.pdf.Google ScholarGoogle Scholar
  14. }}Harter, S. P. 1975. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. Am. Assoc. Inform. Sci. 26, 197--216.Google ScholarGoogle ScholarCross RefCross Ref
  15. }}Hiemstra, D. 2000. Using language models for information retrieval. CTIT Ph.D. Thesis.Google ScholarGoogle Scholar
  16. }}Hollink, V., Kamps, J., Monz, C., and De Rijke, M. 2004. Monolingual document retrieval for European languages. Inform. Retriev. 7, 1--2, 33--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}Kellogg, S. H. 1938. A Grammar of the Hindi Language. Kegan Paul, Trench, Trubner & Co. Ltd., London, UK.Google ScholarGoogle Scholar
  18. }}Kettunen, K. and Airo, E. 2006. Is a morphologically complex language really that complex in full-text retrieval? In Advances in Natural Language Processing, 411--422. Lecture Notes in Computer Science. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. }}Korenius, T., Laurikkala, J., Järvelin, K., and Juhola, M. 2004. Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’04). The ACM Press, 625--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. }}Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. }}Lovins, J. B. 1968. Development of a stemming algorithm. Mechan. Trans. Comput. Linguist. 11, 1, 22--31.Google ScholarGoogle Scholar
  22. }}Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. YASS: Yet another suffix stripper. ACM Trans. Inform. Syst. 25, 4, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Manning, C., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. }}Masica, C. P. 1991. The Indo-Aryan Languages. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  25. }}McNamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. IR J. 7, 1--2, 73--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. }}McNamee, P., Nicholas, C., and Mayfield, J. 2009. Addressing morphological variation in alphabetic languages. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. }}Navalkar, G. R. 2001. The Student’s Marathi Grammar. Asian Education Services, New Dehli.Google ScholarGoogle Scholar
  28. }}Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A. and Santos, D. Eds. 2008. Advances in multilingual and multimodal information retrieval. Lecture Notes in Comuter Science. Springer-Verlag, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. }}Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  30. }}Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings Workshop of Computational Linguistics for the South Asian Languages (EACL’03). 42--48.Google ScholarGoogle Scholar
  31. }}Robertson, S. E., Walker, S., and Beaulieu, M. 2000. Experimentation as a way of life: Okapi at TREC. Inform. Proc. Manage. 36, 1, 95--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. }}Sakar, S. and Bandyopadhyay, S. 2008. Design of a rule-based stemmer for natural language text in Bengal. In Proceedings of the International Joint Conference on Natural Language Processing for Less Privileged Languages (IJCNLP’08). 65--72.Google ScholarGoogle Scholar
  33. }}Salton, G. Ed. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, N.J. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. }}Savoy, J. 1993. Stemming of French words based on grammatical category. J. Am. Soc. Inform. Sci. 44, 1, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  35. }}Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Inform. Proc. Manage. 33, 4, 495--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. }}Savoy, J. 2006. Light stemming approaches for the French, Portuguese, German, and Hungarian languages. In Proceedings of the ACM Symposium on Applied Computing (SAC’06). 1031--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. }}Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. }}Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  39. }}Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003 (2004). In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science. Springer-Verlag, Berlin, 286--300.Google ScholarGoogle Scholar
  40. }}Xu, J. and Croft, B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inform. Syst. 16, 1, 61--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. }}Zhai, C. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inform. Syst. 22, 2, 179--214. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian Language Information Processing
        ACM Transactions on Asian Language Information Processing  Volume 9, Issue 3
        September 2010
        82 pages
        ISSN:1530-0226
        EISSN:1558-3430
        DOI:10.1145/1838745
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 September 2010
        • Accepted: 1 April 2010
        • Revised: 1 March 2010
        • Received: 1 September 2009
        Published in talip Volume 9, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader