Abstract
Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages, Budapest, pp. 43–48 (2003)
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: Yass: Yet another suffix stripper. ACM Trans. Inf. Syst. 25(4) (October 2007)
Pandey, A.K., Siddiqui, T.J.: An unsupervised Hindi stemmer with heuristic improvements. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, AND 2008, Singapore, pp. 99–105 (2008)
Aswani, N., Gaizauskas, R.: Developing morphological analysers for South Asian Languages: Experimenting with the Hindi and Gujarati languages. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC), Malta, pp. 811–815 (2010)
Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. International Journal of Computer Applications 11(12), 18–23 (2010)
Majgaonker, M.M., Siddiqui, T.J.: Discovering suffixes: A case study for Marathi language. International Journal on Computer Science and Engineering 04, 2716–2720 (2010)
Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, Philadelphia, pp. 1–6 (2002)
Sharma, U., Kalita, J., Das, R.: Root word stemming by multiple evidence from corpus. In: Proceedings of 6th International Conference on Computational Intelligence and Natural Computing (CINC 2003), North Carolina, pp. 1593–1596 (2003)
Sharma, U., Kalita, J.K., Das, R.K.: Acquisition of morphology of an indic language from text corpus. ACM Transactions of Asian Language Information Processing (TALIP) 7(3), 9:1–9:33 (2008)
Saharia, N., Sharma, U., Kalita, J.: Analysis and evaluation of stemming algorithms: a case study with Assamese. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI 2012, Chennai, India, pp. 842–846. ACM (2012)
Saharia, N., Sharma, U., Kalita, J.: A suffix-based noun and verb classifier for an inflectional language. In: Proceedings of the 2010 International Conference on Asian Language Processing, IALP 2010, Harbin, China, pp. 19–22. IEEE Computer Society (2010)
Al-Shammari, E.T., Lin, J.: Towards an error-free Arabic stemming. In: Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, iNEWS 2008, pp. 9–16. ACM, New York (2008)
Gaustad, T., Bouma, G.: Accurate stemming of Dutch for text classification. Language and Computers 14, 104–117 (2002)
Suba, K., Jiandani, D., Bhattacharyya, P.: Hybrid inflectional stemmer and rule-based derivational stemmer for Gujrati. In: 2nd Workshop on South and Southeast Asian Natural Languages Processing, Chiang Mai, Thailand (2011)
Ram, V.S., Devi, S.L.: Malayalam stemmer. In: Parakh, M. (ed.) Morphological Analysers and Generators, LDC-IL, Mysore, pp. 105–113 (2010)
Bora, L.S.: Asamiya Bhasar Ruptattva. M/s Banalata, Guwahati, Assam, India (2006)
Creutz, M., Lagus, K.: Induction of a simple morphology for highly-inflecting languages. In: Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology, SIGMorPhon 2004, Barcelona, Spain, pp. 43–51. ACL (2004)
Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), 26–30 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Saharia, N., Konwar, K.M., Sharma, U., Kalita, J.K. (2013). An Improved Stemming Approach Using HMM for a Highly Inflectional Language. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)