Abstract
Stemming is a program that matches the morphological variants of the word to its root word. Stemming is extensively used as a pre-processing tool in the field of natural language processing, information retrieval, and language modeling. Though a lot of advancements have been made in the field, yet organized arrangement of the previous work and efforts are lacking in this field. In this paper, we present a review of the text stemming theory, algorithms, and applications. It first describes the existing literature relevant to text stemming by classifying it according to certain key parameters; then it describes the deep analysis of some well-known stemming algorithms on standard data sets. In the end, the current state-of-the-art and certain open issues related to unsupervised stemming are presented. The main aim of this paper is to provide an extensive and useful understanding of the important aspects of text stemming. The open issues and analysis of the current stemming techniques will help the researchers to think of new lines to conduct research in future.
Similar content being viewed by others
References
Adam G, Asimakis K, Bouras C, Poulopoulos V (2010) An efficient mechanism for stemming and tagging: the case of greek language. In: Proceedings of the 14th international conference on knowledge-based and intelligent information and engineering systems, pp 389–397
Ahmad F, Yusoff M, Sembok T (1996) Experiments with a stemming algorithm for Malay words. J Am Soc Inf Sci 47:909–918
Ahmed F, Nürnberger A (2009) Evaluation of n-gram conflation approaches for Arabic text retrieval. J Am Soc Inf Sci Technol 60:1448–1465
Akram Q-A, Naseer A, Hussain S (2009) Assas-band, an affix-exception-list based Urdu stemmer. In: Proceedings of the 7th workshop on Asian language resources, pp 40–46
Aljlayl M, Frieder O (2002) On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: ACM eleventh conference on information and knowledge management, pp 340–347
Al-Kabi M (2013) Towards improving Khoja rule-based Arabic stemmer. In: IEEE Jordan conference on applied electrical engineering and computing technologies (AEECT), pp 1–6
Alshalabi R (2005) Pattern-based stemmer for finding Arabic roots. Inf Technol J 4:38–43
Al-Shalabi R, Kannan G, Hilat I et al (2005) Experiments with the successor variety algorithm using the cutoff and entropy methods. Inf Technol J 4:55–62
Al-shammari E, Lin J (2008) Towards an error-free Arabic stemming. In: Proceedings of the 2nd ACM workshop on improving non English web searching, iNEWS’08, pp 9–16
Alvares R, Garcia A, Ferraz I (2005) STEMBR: a stemming algorithm for the Brazilian Portuguese language. In: Proceedings of 12th Portuguese conference on artificial intelligence, EPIA 2005, pp 693–701
Al-Zyoud A, Al-Rabayah W (2015) Arabic stemming techniques: comparisons and new vision. In: Proceedings of the 8th IEEE GCC conference and exhibition, pp 1–6
Amati G (2006) Frequentist and bayesian approach to information retrieval. In: Advances in information retrieval. Springer, pp 13–24
Amati G, Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 20:357–389
Apache lucene. http://lucene.apache.org
Baayen RH, Piepenbrock R, van H R (1993) The CELEX lexical data base (CD-ROM). Linguistic data consortium. University of Pennsylvania, Philadelphia
Bacchin M, Ferro N, Melucci M (2002) The effiectiveness of a graph-based algorithm for stemming. In: Digital libraries: people, knowledge, and technology. Springer, pp 117–128
Bacchin M, Ferro N, Melucci M (2005) A probabilistic model for stemmer generation. Inf Process Manag 41:121–137
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. ACM Press, Los Angeles
Baroni M, Matiasek J, Trost H (2002) Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Workshop on morphological and phonological learning (MPL’02), pp 48–57
Bhamidipati NL, Pal SK (2007) Stemming via distribution-based word segregation for classification and retrieval. IEEE Trans Syst Man Cybern B Cybern 37:350–360
Bhattacharya S, Chhoudhury M, Sarkar S, Basu A (2005) Inflectional morphology synthesis for Bengali noun, pronoun and verb systems. In: Proceedings of the national conference on computer processing of Bangla, pp 34–43
Biba M, Gjatu E (2014) Boosting text classification through stemming of composite words. Recent Adv Intell Inform 235:185–194
Bisazza A, Federico M (2009) Morphological pre-processing for Turkish to English statistical machine translation. In: International workshop on spoken language translation, pp 129–135
Braschler M, RippLinger B (2004) How effective is stemming and decompounding for German text retrieval? Inf Retr Boston 7:291–316
Brychcín T, Konopík M (2015) HPS: high precision stemmer. Inf Process Manag 51:68–91
Carlberger J, Dalianis H, Hassel M, Knutsson O (2001) Improving precision in information retrieval for Swedish using stemming. In: Proceedings of 13th Nordic conference on computational linguistics (NODALIDA ’01)
Chan E (2006) Learning probabilistic paradigms for morphology in a latent class model. In: Proceedings of the eighth meeting of the ACL special interest group on computational phonology and morphology, pp 69–78
Chaupattnaik S, Nanda S, Mohanty S (2012) A suffix stripping algorithm for Odia stemmer. Int J Comput Linguist Nat Lang Process 1:1–5
Chen A, Gey F (2002) Building an Arabic stemmer for information retrieval. In: Proceedings of the text retrieval conference (TREC’02), pp 631–639
Cilden E (2006) Stemming Turkish words using snowball. http://snowball.tartarus.org/algorithms/turkish/stemmer.html
Darwish K, Oard D (2002) CLIR experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In: Proceedings of the text retrieval conference (TREC’02), pp 703–710
Das A, Bandyopadhyay S (2010) Morphological stemming cluster identification for Bangla. In: Knowledge sharing event-I: task 3: morphological analyzers and generators, Mysore
Dasgupta S, Khan M (2004) Feature unification for morphological parsing in Bangla. In: Proceedings of the 7th international conference on computer and information technology
Dawson JL (1974) Suffix removal for word conflation. Bull Assoc Lit Linguist Comput 2:33–46
Deepamala N, Kumar P (2015) Kannada stemmer and its effect on Kannada documents classification. In: Proceedings of the international conference on computational intelligence in data mining, pp 75–86
Dolamic L, Savoy J (2009a) Indexing and stemming approaches for the Czech language. Inf Process Manag 45:714–720
Dolamic L, Savoy J (2009b) Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol 60:2540–2547
Dolamic L, Savoy J (2010) Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process 9:11
El-Beltagy S, Rafea A (2011) An accuracy-enhanced light stemmer for Arabic text. ACM Trans Speech Lang Process 7:1–22
Elrajubi O (2013) An improved Arabic light stemmer. In: 3rd International conference on research and innovation in information systems (ICRIIS’13), pp 33–38
Eryiğit G, Adalı E (2004) An affix stripping morphological analyzer for Turkish. In: Proceedings of the IASTED international conference artificial intelligence and applications
Fareed NS, Mousa HM, Elsisi AB (2013) Enhanced semantic Arabic Question answering system based on Khoja stemmer and AWN. In: 9th international computer engineering conference (ICENCO). IEEE, Giza, pp 85–91
Fernández A, Díaz J, Gutiérrez Y (2011) An unsupervised method to improve Spanish stemmer. In: Natural language processing and information systems. Springer, pp 221–224
Figuerola C, Gomez R, Rodriguez A, Berrocal J (2001) Stemming in Spanish: a first approach to its impact on information retrieval. In: Working notes of CLEF 2001 workshop. Darmstadt, Germany, pp 197–202
Frakes WB (1992) Stemming algorithms. In: Frakes WB, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Upper Saddle River, New Jersey, pp 131–160
Frakes WB, Fox CJ (2003) Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37:26–30
Froud H, Benslimane R, Lachkar A, Ouatik SA (2010) Stemming and similarity measures for Arabic documents clustering. In: 5th International symposium on communications and mobile network (ISVC), pp 1–4
Ganguly D, Leveling J, Jones G (2012) DCU@FIRE-2012: rule-based stemmers for Bengali and Hindi. In: Fourth workshop of the forum for information retrieval evaluation (FIRE 2012)
Gaustad T, Bouma G, Groningen R (2002) Accurate stemming of Dutch for text classification. Lang Comput 45:104–117
Goldsmith J (2001) Unsupervised learning of the morphology of a natural language. J Comput Linguist 27:153–198
Goldsmith J (2006) An algorithm for the unsupervised learning of morphology. Nat Lang Eng 12:353–371
Gupta V (2014) Hindi rule based stemmer for nouns. Int J Adv Res Comput Sci Softw Eng 4:62–65
Gupta V, Lehal GS (2011) Punjabi language stemmer for nouns and proper names. In: Proceedings of the 2nd workshop on south and southeast Asian natural language processing (WSSANLP), pp 35–39
Hafer MA, Weiss SF (1974) Word segmentation by letter successor varieties. Inf Storage Retr 10:371–385
Hammarström H, Borin L (2011) Unsupervised learning of morphology. Comput Linguist 37:309–350
Harman D (1991) How effective is suffixing? J Am Soc Inf Sci 42:7–15
Harmanani H, Keirouz W, Raheel S (2006) A rule-based extensible stemmer for information retrieval with application to Arabic. Int Arab J Inf Technol 3:265–272
Hegde Y, Kadambe S, Naduthota P (2013) Suffix stripping algorithm for Kannada information retrieval. In: International conference on advances in computing, communications and informatics (ICACCI), pp 527–533
Hiemstra D (2001) Using language models for information retrieval. Taaluitgeverij Neslia Paniculata
Honrado A, Leon R, O’Dennol R, Sinclair D (2000) A word stemming algorithm for the Spanish language. In: Proceedings of the 7th international symposium on string processing and information retrieval, pp 139–145
Huddleston R (1988) English grammar: an outline. Cambridge University Press, Cambridge
Hull DA (1996) Stemming algorithms—a case study for detailed evaluation. J Am Soc Inf Sci 47:70–84
Islam M, Uddin M, Khan M (2007) A Light weight stemmer for Bengali and its use in spelling checker. In: Proceedings of the 1st international conference on digital communications and computers
Jivani AG (2011) A comparative study of stemming algorithms. Int J Comput Technol Appl 2:1930–1938
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning, chapter. Springer, pp 137–142
Jordan C, Healy J, Keselj V (2006) Swordfish: an unsupervised ngram based approach to morphological analysis. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 657–658
Jurka TP, Collingwood L, Boydstun AE et al (2013) RTextTools: a supervised learning package for text classification. R J 5:6–12
Kalamboukis T, Nikolaidis S (1995) Suffix stripping with modern Greek. Progr Electron Libr Inf Syst 29:313–321
Kalamboukis T, Nikolaidis S (1999) An evaluation of stemming algorithms with modern Greek. In: Proceedings of the 7th Hellenic conference on informatics, pp 61–70
Kchaou Z, Kanoun S (2008) Arabic stemming with two dictionaries. In: IEEE international conferenece on innovations in information technology, pp 688–691
Khoja S, Garside R (1999) Stemming Arabic text. Computing Department, Lancaster University, Lancaster
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46:604–632
Konkol M, Konopík M (2014) Named entity recognition for highly inflectional languages: effects of various lemmatization and stemming approaches. In: Text, speech and dialogue, pp 267–274
Korenius T, Laurikkala J, Jarvelin K, Juhola M (2004) Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the thirteenth ACM international conference on information and knowledge management (CIKM’04), pp 625–633
Kraaij W, Pohlman R (1994) Porter’s stemming algorithm for Dutch. New Rev Doc Text Manag 1:25–43
Kraaij W, Pohlman R (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, pp 40–48
Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, pp 191–202
Kumar D, Rana P (2010) Design and development of a stemmer for Punjabi. Int J Comput Appl 11:18–23
Larkey L, Ballesteros L, Connell ME (2007) Light stemming for Arabic information retrieval. Arab Comput Morphol Text Speech Lang Technol 38:221–243
Larkey L, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th annual international ACM conference on research and development in information retrieval (SIGIR ’02), pp 275–282
Larkey L, Connell M, Abdulijaleel N (2003) Hindi CLIR in thirty days. ACM Trans Asian Lang Inf Process 2:130–142
Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Machine translation: from real users to research. Springer, pp 134–143
Lennon M, Peirce DS, Tarry BD, Willett P (1981) An evaluation of some conflation algorithms for information retrieval. J Inf Sci 3:177–183
Lennon M, Pierce DS, Tarry BD, Willett P (1988) An evaluation of some conflation algorithms for information retrieval. In: Document retrieval systems, pp 99–105
Louis A, Nenkova A (2009) Automatically evaluating content selection in summarization without human models. In: Proceedings of the conference on empirical methods in natural language processing, pp 306–314
Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31
Lushanthan S, Weerasingha A, Hearth D (2014) Morphological analyzer and generator for Tamil language. In: International conference on advances in ICT for emerging regions (ICTer), pp 190–196
Mass D (1996) MPROE—Ein system zur analyse und synthese deutscher Wörter. In: Hauser R (Ed) Linguistische Verifikation. Max Niemeyer Verlag, Tübingen
Mahmud M, Afrin M, Razzaque M et al (2014) A rule based Bengali stemmer. In: International conference on advances in computing, communication and informatics, pp 2750–2756
Majumder P, Mitra M, Datta K (2007a) Statistical vs. rule-based stemming for monolingual french retrieval. Eval Multiling Multi Modal Inf Retr 4730:107–110
Majumder P, Mitra M, Parui SK et al (2007b) YASS: yet another suffix stripper. ACM Trans Inf Syst 25:18
Majumder P, Mitra M, Pal D (2008) Bulgarian, Hungarian and Czech stemming using YASS. In: Advances in multilingual and multimodal information retrieval, pp 49–56
Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Mayfield J, Mcnamee P (2003) Single N-gram stemming. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development of information retrieval, pp 415–416
Mcnamee P, Mayfield J (2004) Character n-gram tokenization for European language text retrieval. Inf Retr Boston 7:73–97
Melucci M, Orio N (2003) A novel method for stemmer generation based on hidden Markov models. In: Proceedings of the twelfth international conference on information and knowledge management (CIKM’03), pp 131–138
Méndez-Cruz C-F, Torres-Moreno J-M, Medina-Urrea A, Sierra G (2013) Extrinsic evaluation on automatic summarization tasks: testing affixality measurements for statistical word stemming. In: Advances in computational intelligence. Springer, pp 46–57
Meyer D, Dimitriadou E, Hornik K et al (2012) Misc functions of the department of statistics (e1071), TU Wien. R Packag 1:5–24
Monz C (2003) From document retrieval to question answering. Institute for Logic, Language and Computation, Amsterdam
Monz C, Rijke M (2002) Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. Eval Cross Lang Inf Retr Syst 2046:262–277
Moral C, Antonio A, Imbert R, Ramirez J (2014) A survey of stemming algorithms in information retrieval. Inf Res 19:1–14
Nakov P (2003) Design and evaluation of inflectional stemmer for Bulgarian. In: Proceedings of workshop on Balkan language resources and tools
Ntais G (2006) Development of a stemmer for the Greek language. Master Thesis, Department of Computer and Systems Sciences, Stockholm University
Oard D, Levow G, Cabezas C (2001) CLEF experiments at Maryland? Statistical stemming and backoff translation. In: Proceedings of the workshop of cross-language evaluation forum on cross language information retrieval and evaluation. Springer, Berlin, pp 176–187
Open American National Corpus. http://www.anc.org/data/oanc
Orengo V, Huyck C (2001) A stemming algorithm for the Portuguese language. In: Proceedings of 8th internatioanl symposium on string processing and information retrieval, pp 186–193
Othman R (1993) Footer Malay word for document retrieval system. M.Sc. Thesis. National University of Malaysia
Ounis I, Amati G, Plachouras V, et al (2006) Terrier: a high performance and scalable information retrieval platform. In: Proceedings of ACM SIGIR’06 workshop on open source information retrieval (OSIR 2006)
Paice CD (1990) Another stemmer. ACM SIGIR Forum 24:56–61
Paice CD (1994) An evaluation method for stemming algorithms. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp 42–50
Paik J, Mitra M, Parui S, Jarvelin K (2011a) GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst 29:1–24
Paik JH, Pal D, Parui SK (2011c) A novel corpus-based stemming algorithm using co-occurrence statistics. In: Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’11). ACM, New York, pp 863–872
Paik JH, Parui SK (2011b) A fast corpus-based stemmer. ACM Trans Asian Lang Inf Process 10:1–16. doi:10.1145/1967293.1967295
Paik JH, Parui SK, Pal D, Robertson SE (2013) Effective and robust query-based stemming. ACM Trans Inf Syst 31:1–29. doi:10.1145/2536736.2536738
Patel P, Popat K, Bhattacharyya P (2010) Hybrid stemmer for Gujarati. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 51–55
Peng F, Ahmed N, Li X, Lu Y (2007) Context sensitive stemming for web search. In: Proceedings of the 30th annual International ACM SIGIR conference on research and development in information retrieval—SIGIR ’07. ACM Press, New York, p 639
Popovic M, Willet P (1992) The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci 43:384–390
Porter MF (1980) An algorithm for suffix stripping. Progr Electron Libr Inf Syst 14:130–137
Porter MF (2001) Snowball: a language for stemming algorithms. http://snowball.tartarus.org
Ramachandran V, Krishnamurthi I (2012) An iterative stemmer for Tamil language. In: Proceedings of the 4th Asian conference, ACIIDS 2012, pp 197–205
Ramanathan A, Hegde J, Shah RM, et al (2008) Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In: International joint conference on natural language processing, pp 513–520
Ramanathan A, Rao D (2003) A lightweight stemmer for Hindi. In: Proceedings of the 10th conference of the European chapter of the association for computational linguistics
Robertson SE, Walker S, Beaulieu M (2000) Experimentation as a way of life: Okapi at TREC. Inf Process Manag 36:95–108
Rosell M (2003) Improving clustering of Swedish newspaper articles using stemming and compound splitting. In: NoDaLiDa 2003, Reykjavik, Iceland 2003, pp 1–7
Salton G, McGill M (1971) The SMART retrieval system—experiments in automatic document retrieval. Prentice Hall Inc., Englewood Cliffs
Sandhya N, Lalitha YS, Sowmya V et al (2011) Analysis of stemming algorithm for text clustering. IJCSI Int J Comput Sci Issues 8:352–359
Savoy J (1999) A stemming procedure and stopword list for general French corpora. J Am Soc Inf Sci 50:944–952
Savoy J (2006) Light stemming approaches for the French, Portuguese, German and Hungarian languages. In: Proceedings of the 2006 ACM symposium on applied computing, pp 1031–1035
Savoy J (2008) Searching strategies for the Hungarian language. Inf Process Manag 44:310–324
Savoy J, Berger P-Y (2006) Monolingual, Bilingual, and GIRT information retrieval at CLEF-2005. In: 6th workshop of the cross-language evalution forum, CLEF 2005, pp 131–140
Sembok T (2005) Word stemming algorithms and retrieval effectiveness in Malay and Arabic documents retrieval systems. In: Proceedings of the world academy of science, engineering and technology
Sever H, Bitirim Y (2003) FindStem: analysis and evaluation of a Turkish stemming algorithm. In: Proceedings of the 10th international symposium on string processing and information retrieval, pp 238–251
Sharifloo A, Shamsfard M (2008) A bottom up approach to persian stemming. In: Proceedings of the third international joint conference on natural language processing
Shrivastava M, Bhattacharyya P (2008) Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: Proceedings of international conference on NLP (ICON08)
Shrivastava M, Mohapatra B, Bhattacharyya P et al (2005) Morphology based natural language processing tools for Indian languages. In: Proceedings of the 4th annual international research student seminar in computer science
Smirnov I (2008) Overview of stemming algorithms. In: Mechanical Translation. http://thesmirnovs.org/info/stemming.pdf. Accessed 25 May 2014
Soares MVB, Prati RC, Monard MC (2009) Improvement on the Porter’s stemming algorithm for Portuguese. IEEE Lat Am Trans 7:472–477
Stein B, Potthast M (2007) Putting successor variety stemming to work. In: Advances in data analysis. Springer, pp 367–374
Suba K, Jiandani D, Bhattacharyya P (2011) Hybrid inflectional stemmer and rule-based derivational stemmer for Gujarati. In: Sangal R, Malik M (eds) Proceedings of the 23rd workshop on south and southeast Asian natural language processing (WSSANLP). Asian Federation of Natural Language Processing, Chiang Mai, Thailand, pp 1–8
Taghva K, Elkhoury R, Coombs J (2005) Arabic stemming without a root dictionary. In: Proceedings of the International conference on information technology: coding and computing (ITCC’05), pp 152–157
Tai S, Ong C, Abdullah N (2000) On designing an automated Malaysian stemmer for the Malay language. In: Proceedings of the fifth international workshop on information retrieval with Asian languages, pp 207–208
Tala F (2003) A study of stemming effects on information retrieval in Bahasa Indonesia. Master Thesis, University of Amsterdam
Terrier information retrieval platform. http://terrier.org
The lemur project. http://www.lemurproject.org
The R project for statistical computing. http://www.r-project.org
Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation. In: Association for computational linguistics, pp 514–522
Xapian project website. http://xapian.org
Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst 16:61–81
Yadav A, Yadav R, Pal S (2012) ISM@FIRE-2012 adhoc retrieval and morpheme extraction task. In: Post proceedings of FIRE-2012
Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology, pp 201–204
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Singh, J., Gupta, V. A systematic review of text stemming techniques. Artif Intell Rev 48, 157–217 (2017). https://doi.org/10.1007/s10462-016-9498-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-016-9498-2