Abstract
UTA participated in the monolingual Bengali ad hoc Track at FIRE 2010. As Bengali is highly inflectional, we experimented with three language normalizers: one stemmer, YASS, and two lemmatizers, GRALE and StaLe. YASS is a corpus-based unsupervised statistical stemmer capable of handling several languages through suffix removal. GRALE is a novel graph-based lemmatizer for Bengali, but extendable for other agglutinative languages. StaLe is a statistical rule-based lemmatizer that has been implemented for several languages. We analyze 9 runs, using the three systems for the title (T) and title-and-description (TD) and title-description-and-narrative (TDN). The T runs were the least effective with MAP about 0.34 (P@10 about 0.30). All the TD runs delivered a MAP close to 0.45 (P@10 about 0.37), while the TDN runs gave a MAP of 0.50 to 0.52 (P@10 about 0.41). The performances of the three normalizers are close to each other, but they have different strengths in other aspects. The performances compare well with the ones other groups obtained in the monolingual Bengali ad hoc Track at FIRE 2010.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9, 249–271 (2006)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley (2011)
Kettunen, K.: Reductive and Generative Approaches to Morphological Variation of Keywords in Monolingual Information Retrieval. Acta Universitatis Tamperensis 1261. University of Tampere, Tampere (2007)
Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. Thesis, University of Helsinki, Department of General Linguistics, Helsinki (1983)
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th ACM/SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Pennsylvania, USA, pp. 191–202 (1993)
Lemur: The Lemur Tool-kit for Language Modelling and Information Retrieval, http://www.lemurproject.org/ (visited March 30, 2010)
Lindén, K.: A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)
Loponen, A., Järvelin, K.: A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)
Losee, R.M.: Is 1 Noun Worth 2 Adjectives? Measuring Relative Feature Utility. Information Processing and Management 42(5), 1248–1259 (2006)
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet Another Suffix Stripper. ACM Transactions on Information Systems (TOIS) 25(4) (2007)
Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy translation of cross-lingual spelling variants. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 345–353 (2003)
Plisson, J., Lavrac, N., Mladenic, D.: A rule based approach to word lemmatization. In: Proceedings of the 7th International Multi-Conference Information Society IS 2004, pp. 83–86 (2004)
Wicentowski, R.: Modelling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Thesis, Baltimore, Maryland, USA (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Loponen, A., Paik, J.H., Järvelin, K. (2013). UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-40087-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)