Skip to main content

UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task

  • Conference paper
  • 674 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

Abstract

UTA participated in the monolingual Bengali ad hoc Track at FIRE 2010. As Bengali is highly inflectional, we experimented with three language normalizers: one stemmer, YASS, and two lemmatizers, GRALE and StaLe. YASS is a corpus-based unsupervised statistical stemmer capable of handling several languages through suffix removal. GRALE is a novel graph-based lemmatizer for Bengali, but extendable for other agglutinative languages. StaLe is a statistical rule-based lemmatizer that has been implemented for several languages. We analyze 9 runs, using the three systems for the title (T) and title-and-description (TD) and title-description-and-narrative (TDN). The T runs were the least effective with MAP about 0.34 (P@10 about 0.30). All the TD runs delivered a MAP close to 0.45 (P@10 about 0.37), while the TDN runs gave a MAP of 0.50 to 0.52 (P@10 about 0.41). The performances of the three normalizers are close to each other, but they have different strengths in other aspects. The performances compare well with the ones other groups obtained in the monolingual Bengali ad hoc Track at FIRE 2010.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9, 249–271 (2006)

    Article  Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley (2011)

    Google Scholar 

  3. Kettunen, K.: Reductive and Generative Approaches to Morphological Variation of Keywords in Monolingual Information Retrieval. Acta Universitatis Tamperensis 1261. University of Tampere, Tampere (2007)

    Google Scholar 

  4. Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. Thesis, University of Helsinki, Department of General Linguistics, Helsinki (1983)

    Google Scholar 

  5. Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th ACM/SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Pennsylvania, USA, pp. 191–202 (1993)

    Google Scholar 

  6. Lemur: The Lemur Tool-kit for Language Modelling and Information Retrieval, http://www.lemurproject.org/ (visited March 30, 2010)

  7. Lindén, K.: A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  8. Loponen, A., Järvelin, K.: A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  9. Losee, R.M.: Is 1 Noun Worth 2 Adjectives? Measuring Relative Feature Utility. Information Processing and Management 42(5), 1248–1259 (2006)

    Article  Google Scholar 

  10. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet Another Suffix Stripper. ACM Transactions on Information Systems (TOIS) 25(4) (2007)

    Google Scholar 

  11. Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy translation of cross-lingual spelling variants. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 345–353 (2003)

    Google Scholar 

  12. Plisson, J., Lavrac, N., Mladenic, D.: A rule based approach to word lemmatization. In: Proceedings of the 7th International Multi-Conference Information Society IS 2004, pp. 83–86 (2004)

    Google Scholar 

  13. Wicentowski, R.: Modelling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Thesis, Baltimore, Maryland, USA (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Loponen, A., Paik, J.H., Järvelin, K. (2013). UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40087-2_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40086-5

  • Online ISBN: 978-3-642-40087-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics