Skip to main content

An Improved Stemming Approach Using HMM for a Highly Inflectional Language

  • Conference paper
Book cover Computational Linguistics and Intelligent Text Processing (CICLing 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Abstract

Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)

    Article  Google Scholar 

  2. Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages, Budapest, pp. 43–48 (2003)

    Google Scholar 

  3. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: Yass: Yet another suffix stripper. ACM Trans. Inf. Syst. 25(4) (October 2007)

    Google Scholar 

  4. Pandey, A.K., Siddiqui, T.J.: An unsupervised Hindi stemmer with heuristic improvements. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, AND 2008, Singapore, pp. 99–105 (2008)

    Google Scholar 

  5. Aswani, N., Gaizauskas, R.: Developing morphological analysers for South Asian Languages: Experimenting with the Hindi and Gujarati languages. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC), Malta, pp. 811–815 (2010)

    Google Scholar 

  6. Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. International Journal of Computer Applications 11(12), 18–23 (2010)

    Article  Google Scholar 

  7. Majgaonker, M.M., Siddiqui, T.J.: Discovering suffixes: A case study for Marathi language. International Journal on Computer Science and Engineering 04, 2716–2720 (2010)

    Google Scholar 

  8. Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, Philadelphia, pp. 1–6 (2002)

    Google Scholar 

  9. Sharma, U., Kalita, J., Das, R.: Root word stemming by multiple evidence from corpus. In: Proceedings of 6th International Conference on Computational Intelligence and Natural Computing (CINC 2003), North Carolina, pp. 1593–1596 (2003)

    Google Scholar 

  10. Sharma, U., Kalita, J.K., Das, R.K.: Acquisition of morphology of an indic language from text corpus. ACM Transactions of Asian Language Information Processing (TALIP) 7(3), 9:1–9:33 (2008)

    Article  Google Scholar 

  11. Saharia, N., Sharma, U., Kalita, J.: Analysis and evaluation of stemming algorithms: a case study with Assamese. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI 2012, Chennai, India, pp. 842–846. ACM (2012)

    Google Scholar 

  12. Saharia, N., Sharma, U., Kalita, J.: A suffix-based noun and verb classifier for an inflectional language. In: Proceedings of the 2010 International Conference on Asian Language Processing, IALP 2010, Harbin, China, pp. 19–22. IEEE Computer Society (2010)

    Google Scholar 

  13. Al-Shammari, E.T., Lin, J.: Towards an error-free Arabic stemming. In: Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, iNEWS 2008, pp. 9–16. ACM, New York (2008)

    Chapter  Google Scholar 

  14. Gaustad, T., Bouma, G.: Accurate stemming of Dutch for text classification. Language and Computers 14, 104–117 (2002)

    Google Scholar 

  15. Suba, K., Jiandani, D., Bhattacharyya, P.: Hybrid inflectional stemmer and rule-based derivational stemmer for Gujrati. In: 2nd Workshop on South and Southeast Asian Natural Languages Processing, Chiang Mai, Thailand (2011)

    Google Scholar 

  16. Ram, V.S., Devi, S.L.: Malayalam stemmer. In: Parakh, M. (ed.) Morphological Analysers and Generators, LDC-IL, Mysore, pp. 105–113 (2010)

    Google Scholar 

  17. Bora, L.S.: Asamiya Bhasar Ruptattva. M/s Banalata, Guwahati, Assam, India (2006)

    Google Scholar 

  18. Creutz, M., Lagus, K.: Induction of a simple morphology for highly-inflecting languages. In: Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology, SIGMorPhon 2004, Barcelona, Spain, pp. 43–51. ACL (2004)

    Google Scholar 

  19. Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), 26–30 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saharia, N., Konwar, K.M., Sharma, U., Kalita, J.K. (2013). An Improved Stemming Approach Using HMM for a Highly Inflectional Language. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics