UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task

Loponen, Aki; Paik, Jiaul H.; Järvelin, Kalervo

doi:10.1007/978-3-642-40087-2_25

UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task

Aki Loponen^21,,
Jiaul H. Paik^21, &
Kalervo Järvelin^21,

Conference paper

674 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

Abstract

UTA participated in the monolingual Bengali ad hoc Track at FIRE 2010. As Bengali is highly inflectional, we experimented with three language normalizers: one stemmer, YASS, and two lemmatizers, GRALE and StaLe. YASS is a corpus-based unsupervised statistical stemmer capable of handling several languages through suffix removal. GRALE is a novel graph-based lemmatizer for Bengali, but extendable for other agglutinative languages. StaLe is a statistical rule-based lemmatizer that has been implemented for several languages. We analyze 9 runs, using the three systems for the title (T) and title-and-description (TD) and title-description-and-narrative (TDN). The T runs were the least effective with MAP about 0.34 (P@10 about 0.30). All the TD runs delivered a MAP close to 0.45 (P@10 about 0.37), while the TDN runs gave a MAP of 0.50 to 0.52 (P@10 about 0.41). The performances of the three normalizers are close to each other, but they have different strengths in other aspects. The performances compare well with the ones other groups obtained in the monolingual Bengali ad hoc Track at FIRE 2010.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9, 249–271 (2006)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley (2011)
Google Scholar
Kettunen, K.: Reductive and Generative Approaches to Morphological Variation of Keywords in Monolingual Information Retrieval. Acta Universitatis Tamperensis 1261. University of Tampere, Tampere (2007)
Google Scholar
Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. Thesis, University of Helsinki, Department of General Linguistics, Helsinki (1983)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th ACM/SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Pennsylvania, USA, pp. 191–202 (1993)
Google Scholar
Lemur: The Lemur Tool-kit for Language Modelling and Information Retrieval, http://www.lemurproject.org/ (visited March 30, 2010)
Lindén, K.: A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)
Chapter Google Scholar
Loponen, A., Järvelin, K.: A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)
Chapter Google Scholar
Losee, R.M.: Is 1 Noun Worth 2 Adjectives? Measuring Relative Feature Utility. Information Processing and Management 42(5), 1248–1259 (2006)
Article Google Scholar
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet Another Suffix Stripper. ACM Transactions on Information Systems (TOIS) 25(4) (2007)
Google Scholar
Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy translation of cross-lingual spelling variants. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 345–353 (2003)
Google Scholar
Plisson, J., Lavrac, N., Mladenic, D.: A rule based approach to word lemmatization. In: Proceedings of the 7th International Multi-Conference Information Society IS 2004, pp. 83–86 (2004)
Google Scholar
Wicentowski, R.: Modelling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Thesis, Baltimore, Maryland, USA (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Sciences, University of Tampere, Finland
Aki Loponen, Jiaul H. Paik & Kalervo Järvelin

Authors

Aki Loponen
View author publications
You can also search for this author in PubMed Google Scholar
Jiaul H. Paik
View author publications
You can also search for this author in PubMed Google Scholar
Kalervo Järvelin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
Indian Institutte of Technology, Bombay, India
Pushpak Bhattacharyya
IBM Research New Delhi, India
L. Venkata Subramaniam & Danish Contractor &
NLE Lab - ELiRF, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Loponen, A., Paik, J.H., Järvelin, K. (2013). UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-40087-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics