Skip to main content
Log in

A systematic review of text stemming techniques

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Stemming is a program that matches the morphological variants of the word to its root word. Stemming is extensively used as a pre-processing tool in the field of natural language processing, information retrieval, and language modeling. Though a lot of advancements have been made in the field, yet organized arrangement of the previous work and efforts are lacking in this field. In this paper, we present a review of the text stemming theory, algorithms, and applications. It first describes the existing literature relevant to text stemming by classifying it according to certain key parameters; then it describes the deep analysis of some well-known stemming algorithms on standard data sets. In the end, the current state-of-the-art and certain open issues related to unsupervised stemming are presented. The main aim of this paper is to provide an extensive and useful understanding of the important aspects of text stemming. The open issues and analysis of the current stemming techniques will help the researchers to think of new lines to conduct research in future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Adam G, Asimakis K, Bouras C, Poulopoulos V (2010) An efficient mechanism for stemming and tagging: the case of greek language. In: Proceedings of the 14th international conference on knowledge-based and intelligent information and engineering systems, pp 389–397

  • Ahmad F, Yusoff M, Sembok T (1996) Experiments with a stemming algorithm for Malay words. J Am Soc Inf Sci 47:909–918

    Article  Google Scholar 

  • Ahmed F, Nürnberger A (2009) Evaluation of n-gram conflation approaches for Arabic text retrieval. J Am Soc Inf Sci Technol 60:1448–1465

    Article  Google Scholar 

  • Akram Q-A, Naseer A, Hussain S (2009) Assas-band, an affix-exception-list based Urdu stemmer. In: Proceedings of the 7th workshop on Asian language resources, pp 40–46

  • Aljlayl M, Frieder O (2002) On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: ACM eleventh conference on information and knowledge management, pp 340–347

  • Al-Kabi M (2013) Towards improving Khoja rule-based Arabic stemmer. In: IEEE Jordan conference on applied electrical engineering and computing technologies (AEECT), pp 1–6

  • Alshalabi R (2005) Pattern-based stemmer for finding Arabic roots. Inf Technol J 4:38–43

    Article  Google Scholar 

  • Al-Shalabi R, Kannan G, Hilat I et al (2005) Experiments with the successor variety algorithm using the cutoff and entropy methods. Inf Technol J 4:55–62

    Article  Google Scholar 

  • Al-shammari E, Lin J (2008) Towards an error-free Arabic stemming. In: Proceedings of the 2nd ACM workshop on improving non English web searching, iNEWS’08, pp 9–16

  • Alvares R, Garcia A, Ferraz I (2005) STEMBR: a stemming algorithm for the Brazilian Portuguese language. In: Proceedings of 12th Portuguese conference on artificial intelligence, EPIA 2005, pp 693–701

  • Al-Zyoud A, Al-Rabayah W (2015) Arabic stemming techniques: comparisons and new vision. In: Proceedings of the 8th IEEE GCC conference and exhibition, pp 1–6

  • Amati G (2006) Frequentist and bayesian approach to information retrieval. In: Advances in information retrieval. Springer, pp 13–24

  • Amati G, Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 20:357–389

    Article  Google Scholar 

  • Apache lucene. http://lucene.apache.org

  • Baayen RH, Piepenbrock R, van H R (1993) The CELEX lexical data base (CD-ROM). Linguistic data consortium. University of Pennsylvania, Philadelphia

  • Bacchin M, Ferro N, Melucci M (2002) The effiectiveness of a graph-based algorithm for stemming. In: Digital libraries: people, knowledge, and technology. Springer, pp 117–128

  • Bacchin M, Ferro N, Melucci M (2005) A probabilistic model for stemmer generation. Inf Process Manag 41:121–137

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. ACM Press, Los Angeles

    Google Scholar 

  • Baroni M, Matiasek J, Trost H (2002) Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Workshop on morphological and phonological learning (MPL’02), pp 48–57

  • Bhamidipati NL, Pal SK (2007) Stemming via distribution-based word segregation for classification and retrieval. IEEE Trans Syst Man Cybern B Cybern 37:350–360

    Article  Google Scholar 

  • Bhattacharya S, Chhoudhury M, Sarkar S, Basu A (2005) Inflectional morphology synthesis for Bengali noun, pronoun and verb systems. In: Proceedings of the national conference on computer processing of Bangla, pp 34–43

  • Biba M, Gjatu E (2014) Boosting text classification through stemming of composite words. Recent Adv Intell Inform 235:185–194

    Article  Google Scholar 

  • Bisazza A, Federico M (2009) Morphological pre-processing for Turkish to English statistical machine translation. In: International workshop on spoken language translation, pp 129–135

  • Braschler M, RippLinger B (2004) How effective is stemming and decompounding for German text retrieval? Inf Retr Boston 7:291–316

    Article  MATH  Google Scholar 

  • Brychcín T, Konopík M (2015) HPS: high precision stemmer. Inf Process Manag 51:68–91

    Article  Google Scholar 

  • Carlberger J, Dalianis H, Hassel M, Knutsson O (2001) Improving precision in information retrieval for Swedish using stemming. In: Proceedings of 13th Nordic conference on computational linguistics (NODALIDA ’01)

  • Chan E (2006) Learning probabilistic paradigms for morphology in a latent class model. In: Proceedings of the eighth meeting of the ACL special interest group on computational phonology and morphology, pp 69–78

  • Chaupattnaik S, Nanda S, Mohanty S (2012) A suffix stripping algorithm for Odia stemmer. Int J Comput Linguist Nat Lang Process 1:1–5

    Google Scholar 

  • Chen A, Gey F (2002) Building an Arabic stemmer for information retrieval. In: Proceedings of the text retrieval conference (TREC’02), pp 631–639

  • Cilden E (2006) Stemming Turkish words using snowball. http://snowball.tartarus.org/algorithms/turkish/stemmer.html

  • Darwish K, Oard D (2002) CLIR experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In: Proceedings of the text retrieval conference (TREC’02), pp 703–710

  • Das A, Bandyopadhyay S (2010) Morphological stemming cluster identification for Bangla. In: Knowledge sharing event-I: task 3: morphological analyzers and generators, Mysore

  • Dasgupta S, Khan M (2004) Feature unification for morphological parsing in Bangla. In: Proceedings of the 7th international conference on computer and information technology

  • Dawson JL (1974) Suffix removal for word conflation. Bull Assoc Lit Linguist Comput 2:33–46

    Google Scholar 

  • Deepamala N, Kumar P (2015) Kannada stemmer and its effect on Kannada documents classification. In: Proceedings of the international conference on computational intelligence in data mining, pp 75–86

  • Dolamic L, Savoy J (2009a) Indexing and stemming approaches for the Czech language. Inf Process Manag 45:714–720

    Article  Google Scholar 

  • Dolamic L, Savoy J (2009b) Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol 60:2540–2547

    Article  Google Scholar 

  • Dolamic L, Savoy J (2010) Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process 9:11

    Article  Google Scholar 

  • El-Beltagy S, Rafea A (2011) An accuracy-enhanced light stemmer for Arabic text. ACM Trans Speech Lang Process 7:1–22

    Article  Google Scholar 

  • Elrajubi O (2013) An improved Arabic light stemmer. In: 3rd International conference on research and innovation in information systems (ICRIIS’13), pp 33–38

  • Eryiğit G, Adalı E (2004) An affix stripping morphological analyzer for Turkish. In: Proceedings of the IASTED international conference artificial intelligence and applications

  • Fareed NS, Mousa HM, Elsisi AB (2013) Enhanced semantic Arabic Question answering system based on Khoja stemmer and AWN. In: 9th international computer engineering conference (ICENCO). IEEE, Giza, pp 85–91

  • Fernández A, Díaz J, Gutiérrez Y (2011) An unsupervised method to improve Spanish stemmer. In: Natural language processing and information systems. Springer, pp 221–224

  • Figuerola C, Gomez R, Rodriguez A, Berrocal J (2001) Stemming in Spanish: a first approach to its impact on information retrieval. In: Working notes of CLEF 2001 workshop. Darmstadt, Germany, pp 197–202

  • Frakes WB (1992) Stemming algorithms. In: Frakes WB, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Upper Saddle River, New Jersey, pp 131–160

  • Frakes WB, Fox CJ (2003) Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37:26–30

    Article  Google Scholar 

  • Froud H, Benslimane R, Lachkar A, Ouatik SA (2010) Stemming and similarity measures for Arabic documents clustering. In: 5th International symposium on communications and mobile network (ISVC), pp 1–4

  • Ganguly D, Leveling J, Jones G (2012) DCU@FIRE-2012: rule-based stemmers for Bengali and Hindi. In: Fourth workshop of the forum for information retrieval evaluation (FIRE 2012)

  • Gaustad T, Bouma G, Groningen R (2002) Accurate stemming of Dutch for text classification. Lang Comput 45:104–117

    Google Scholar 

  • Goldsmith J (2001) Unsupervised learning of the morphology of a natural language. J Comput Linguist 27:153–198

    Article  MathSciNet  Google Scholar 

  • Goldsmith J (2006) An algorithm for the unsupervised learning of morphology. Nat Lang Eng 12:353–371

    Article  Google Scholar 

  • Gupta V (2014) Hindi rule based stemmer for nouns. Int J Adv Res Comput Sci Softw Eng 4:62–65

    Google Scholar 

  • Gupta V, Lehal GS (2011) Punjabi language stemmer for nouns and proper names. In: Proceedings of the 2nd workshop on south and southeast Asian natural language processing (WSSANLP), pp 35–39

  • Hafer MA, Weiss SF (1974) Word segmentation by letter successor varieties. Inf Storage Retr 10:371–385

    Article  Google Scholar 

  • Hammarström H, Borin L (2011) Unsupervised learning of morphology. Comput Linguist 37:309–350

    Article  Google Scholar 

  • Harman D (1991) How effective is suffixing? J Am Soc Inf Sci 42:7–15

    Article  Google Scholar 

  • Harmanani H, Keirouz W, Raheel S (2006) A rule-based extensible stemmer for information retrieval with application to Arabic. Int Arab J Inf Technol 3:265–272

    Google Scholar 

  • Hegde Y, Kadambe S, Naduthota P (2013) Suffix stripping algorithm for Kannada information retrieval. In: International conference on advances in computing, communications and informatics (ICACCI), pp 527–533

  • Hiemstra D (2001) Using language models for information retrieval. Taaluitgeverij Neslia Paniculata

  • Honrado A, Leon R, O’Dennol R, Sinclair D (2000) A word stemming algorithm for the Spanish language. In: Proceedings of the 7th international symposium on string processing and information retrieval, pp 139–145

  • Huddleston R (1988) English grammar: an outline. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Hull DA (1996) Stemming algorithms—a case study for detailed evaluation. J Am Soc Inf Sci 47:70–84

    Article  Google Scholar 

  • Islam M, Uddin M, Khan M (2007) A Light weight stemmer for Bengali and its use in spelling checker. In: Proceedings of the 1st international conference on digital communications and computers

  • Jivani AG (2011) A comparative study of stemming algorithms. Int J Comput Technol Appl 2:1930–1938

    Google Scholar 

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning, chapter. Springer, pp 137–142

  • Jordan C, Healy J, Keselj V (2006) Swordfish: an unsupervised ngram based approach to morphological analysis. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 657–658

  • Jurka TP, Collingwood L, Boydstun AE et al (2013) RTextTools: a supervised learning package for text classification. R J 5:6–12

    Google Scholar 

  • Kalamboukis T, Nikolaidis S (1995) Suffix stripping with modern Greek. Progr Electron Libr Inf Syst 29:313–321

    Google Scholar 

  • Kalamboukis T, Nikolaidis S (1999) An evaluation of stemming algorithms with modern Greek. In: Proceedings of the 7th Hellenic conference on informatics, pp 61–70

  • Kchaou Z, Kanoun S (2008) Arabic stemming with two dictionaries. In: IEEE international conferenece on innovations in information technology, pp 688–691

  • Khoja S, Garside R (1999) Stemming Arabic text. Computing Department, Lancaster University, Lancaster

    Google Scholar 

  • Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46:604–632

    Article  MathSciNet  MATH  Google Scholar 

  • Konkol M, Konopík M (2014) Named entity recognition for highly inflectional languages: effects of various lemmatization and stemming approaches. In: Text, speech and dialogue, pp 267–274

  • Korenius T, Laurikkala J, Jarvelin K, Juhola M (2004) Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the thirteenth ACM international conference on information and knowledge management (CIKM’04), pp 625–633

  • Kraaij W, Pohlman R (1994) Porter’s stemming algorithm for Dutch. New Rev Doc Text Manag 1:25–43

    Google Scholar 

  • Kraaij W, Pohlman R (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, pp 40–48

  • Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, pp 191–202

  • Kumar D, Rana P (2010) Design and development of a stemmer for Punjabi. Int J Comput Appl 11:18–23

    Google Scholar 

  • Larkey L, Ballesteros L, Connell ME (2007) Light stemming for Arabic information retrieval. Arab Comput Morphol Text Speech Lang Technol 38:221–243

    Article  Google Scholar 

  • Larkey L, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th annual international ACM conference on research and development in information retrieval (SIGIR ’02), pp 275–282

  • Larkey L, Connell M, Abdulijaleel N (2003) Hindi CLIR in thirty days. ACM Trans Asian Lang Inf Process 2:130–142

    Article  Google Scholar 

  • Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Machine translation: from real users to research. Springer, pp 134–143

  • Lennon M, Peirce DS, Tarry BD, Willett P (1981) An evaluation of some conflation algorithms for information retrieval. J Inf Sci 3:177–183

    Article  Google Scholar 

  • Lennon M, Pierce DS, Tarry BD, Willett P (1988) An evaluation of some conflation algorithms for information retrieval. In: Document retrieval systems, pp 99–105

  • Louis A, Nenkova A (2009) Automatically evaluating content selection in summarization without human models. In: Proceedings of the conference on empirical methods in natural language processing, pp 306–314

  • Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31

    Google Scholar 

  • Lushanthan S, Weerasingha A, Hearth D (2014) Morphological analyzer and generator for Tamil language. In: International conference on advances in ICT for emerging regions (ICTer), pp 190–196

  • Mass D (1996) MPROE—Ein system zur analyse und synthese deutscher Wörter. In: Hauser R (Ed) Linguistische Verifikation. Max Niemeyer Verlag, Tübingen

  • Mahmud M, Afrin M, Razzaque M et al (2014) A rule based Bengali stemmer. In: International conference on advances in computing, communication and informatics, pp 2750–2756

  • Majumder P, Mitra M, Datta K (2007a) Statistical vs. rule-based stemming for monolingual french retrieval. Eval Multiling Multi Modal Inf Retr 4730:107–110

    Article  Google Scholar 

  • Majumder P, Mitra M, Parui SK et al (2007b) YASS: yet another suffix stripper. ACM Trans Inf Syst 25:18

    Article  Google Scholar 

  • Majumder P, Mitra M, Pal D (2008) Bulgarian, Hungarian and Czech stemming using YASS. In: Advances in multilingual and multimodal information retrieval, pp 49–56

  • Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Mayfield J, Mcnamee P (2003) Single N-gram stemming. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development of information retrieval, pp 415–416

  • Mcnamee P, Mayfield J (2004) Character n-gram tokenization for European language text retrieval. Inf Retr Boston 7:73–97

    Article  Google Scholar 

  • Melucci M, Orio N (2003) A novel method for stemmer generation based on hidden Markov models. In: Proceedings of the twelfth international conference on information and knowledge management (CIKM’03), pp 131–138

  • Méndez-Cruz C-F, Torres-Moreno J-M, Medina-Urrea A, Sierra G (2013) Extrinsic evaluation on automatic summarization tasks: testing affixality measurements for statistical word stemming. In: Advances in computational intelligence. Springer, pp 46–57

  • Meyer D, Dimitriadou E, Hornik K et al (2012) Misc functions of the department of statistics (e1071), TU Wien. R Packag 1:5–24

    Google Scholar 

  • Monz C (2003) From document retrieval to question answering. Institute for Logic, Language and Computation, Amsterdam

    MATH  Google Scholar 

  • Monz C, Rijke M (2002) Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. Eval Cross Lang Inf Retr Syst 2046:262–277

    Article  MATH  Google Scholar 

  • Moral C, Antonio A, Imbert R, Ramirez J (2014) A survey of stemming algorithms in information retrieval. Inf Res 19:1–14

    Google Scholar 

  • Nakov P (2003) Design and evaluation of inflectional stemmer for Bulgarian. In: Proceedings of workshop on Balkan language resources and tools

  • Ntais G (2006) Development of a stemmer for the Greek language. Master Thesis, Department of Computer and Systems Sciences, Stockholm University

  • Oard D, Levow G, Cabezas C (2001) CLEF experiments at Maryland? Statistical stemming and backoff translation. In: Proceedings of the workshop of cross-language evaluation forum on cross language information retrieval and evaluation. Springer, Berlin, pp 176–187

  • Open American National Corpus. http://www.anc.org/data/oanc

  • Orengo V, Huyck C (2001) A stemming algorithm for the Portuguese language. In: Proceedings of 8th internatioanl symposium on string processing and information retrieval, pp 186–193

  • Othman R (1993) Footer Malay word for document retrieval system. M.Sc. Thesis. National University of Malaysia

  • Ounis I, Amati G, Plachouras V, et al (2006) Terrier: a high performance and scalable information retrieval platform. In: Proceedings of ACM SIGIR’06 workshop on open source information retrieval (OSIR 2006)

  • Paice CD (1990) Another stemmer. ACM SIGIR Forum 24:56–61

    Article  Google Scholar 

  • Paice CD (1994) An evaluation method for stemming algorithms. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp 42–50

  • Paik J, Mitra M, Parui S, Jarvelin K (2011a) GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst 29:1–24

    Article  Google Scholar 

  • Paik JH, Pal D, Parui SK (2011c) A novel corpus-based stemming algorithm using co-occurrence statistics. In: Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’11). ACM, New York, pp 863–872

  • Paik JH, Parui SK (2011b) A fast corpus-based stemmer. ACM Trans Asian Lang Inf Process 10:1–16. doi:10.1145/1967293.1967295

    Article  Google Scholar 

  • Paik JH, Parui SK, Pal D, Robertson SE (2013) Effective and robust query-based stemming. ACM Trans Inf Syst 31:1–29. doi:10.1145/2536736.2536738

    Article  Google Scholar 

  • Patel P, Popat K, Bhattacharyya P (2010) Hybrid stemmer for Gujarati. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 51–55

  • Peng F, Ahmed N, Li X, Lu Y (2007) Context sensitive stemming for web search. In: Proceedings of the 30th annual International ACM SIGIR conference on research and development in information retrieval—SIGIR ’07. ACM Press, New York, p 639

  • Popovic M, Willet P (1992) The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci 43:384–390

    Article  Google Scholar 

  • Porter MF (1980) An algorithm for suffix stripping. Progr Electron Libr Inf Syst 14:130–137

    Google Scholar 

  • Porter MF (2001) Snowball: a language for stemming algorithms. http://snowball.tartarus.org

  • Ramachandran V, Krishnamurthi I (2012) An iterative stemmer for Tamil language. In: Proceedings of the 4th Asian conference, ACIIDS 2012, pp 197–205

  • Ramanathan A, Hegde J, Shah RM, et al (2008) Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In: International joint conference on natural language processing, pp 513–520

  • Ramanathan A, Rao D (2003) A lightweight stemmer for Hindi. In: Proceedings of the 10th conference of the European chapter of the association for computational linguistics

  • Robertson SE, Walker S, Beaulieu M (2000) Experimentation as a way of life: Okapi at TREC. Inf Process Manag 36:95–108

    Article  Google Scholar 

  • Rosell M (2003) Improving clustering of Swedish newspaper articles using stemming and compound splitting. In: NoDaLiDa 2003, Reykjavik, Iceland 2003, pp 1–7

  • Salton G, McGill M (1971) The SMART retrieval system—experiments in automatic document retrieval. Prentice Hall Inc., Englewood Cliffs

    Google Scholar 

  • Sandhya N, Lalitha YS, Sowmya V et al (2011) Analysis of stemming algorithm for text clustering. IJCSI Int J Comput Sci Issues 8:352–359

    Google Scholar 

  • Savoy J (1999) A stemming procedure and stopword list for general French corpora. J Am Soc Inf Sci 50:944–952

    Article  Google Scholar 

  • Savoy J (2006) Light stemming approaches for the French, Portuguese, German and Hungarian languages. In: Proceedings of the 2006 ACM symposium on applied computing, pp 1031–1035

  • Savoy J (2008) Searching strategies for the Hungarian language. Inf Process Manag 44:310–324

    Article  Google Scholar 

  • Savoy J, Berger P-Y (2006) Monolingual, Bilingual, and GIRT information retrieval at CLEF-2005. In: 6th workshop of the cross-language evalution forum, CLEF 2005, pp 131–140

  • Sembok T (2005) Word stemming algorithms and retrieval effectiveness in Malay and Arabic documents retrieval systems. In: Proceedings of the world academy of science, engineering and technology

  • Sever H, Bitirim Y (2003) FindStem: analysis and evaluation of a Turkish stemming algorithm. In: Proceedings of the 10th international symposium on string processing and information retrieval, pp 238–251

  • Sharifloo A, Shamsfard M (2008) A bottom up approach to persian stemming. In: Proceedings of the third international joint conference on natural language processing

  • Shrivastava M, Bhattacharyya P (2008) Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: Proceedings of international conference on NLP (ICON08)

  • Shrivastava M, Mohapatra B, Bhattacharyya P et al (2005) Morphology based natural language processing tools for Indian languages. In: Proceedings of the 4th annual international research student seminar in computer science

  • Smirnov I (2008) Overview of stemming algorithms. In: Mechanical Translation. http://thesmirnovs.org/info/stemming.pdf. Accessed 25 May 2014

  • Soares MVB, Prati RC, Monard MC (2009) Improvement on the Porter’s stemming algorithm for Portuguese. IEEE Lat Am Trans 7:472–477

    Article  Google Scholar 

  • Stein B, Potthast M (2007) Putting successor variety stemming to work. In: Advances in data analysis. Springer, pp 367–374

  • Suba K, Jiandani D, Bhattacharyya P (2011) Hybrid inflectional stemmer and rule-based derivational stemmer for Gujarati. In: Sangal R, Malik M (eds) Proceedings of the 23rd workshop on south and southeast Asian natural language processing (WSSANLP). Asian Federation of Natural Language Processing, Chiang Mai, Thailand, pp 1–8

  • Taghva K, Elkhoury R, Coombs J (2005) Arabic stemming without a root dictionary. In: Proceedings of the International conference on information technology: coding and computing (ITCC’05), pp 152–157

  • Tai S, Ong C, Abdullah N (2000) On designing an automated Malaysian stemmer for the Malay language. In: Proceedings of the fifth international workshop on information retrieval with Asian languages, pp 207–208

  • Tala F (2003) A study of stemming effects on information retrieval in Bahasa Indonesia. Master Thesis, University of Amsterdam

  • Terrier information retrieval platform. http://terrier.org

  • The lemur project. http://www.lemurproject.org

  • The R project for statistical computing. http://www.r-project.org

  • Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation. In: Association for computational linguistics, pp 514–522

  • Xapian project website. http://xapian.org

  • Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst 16:61–81

    Article  Google Scholar 

  • Yadav A, Yadav R, Pal S (2012) ISM@FIRE-2012 adhoc retrieval and morpheme extraction task. In: Post proceedings of FIRE-2012

  • Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology, pp 201–204

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vishal Gupta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, J., Gupta, V. A systematic review of text stemming techniques. Artif Intell Rev 48, 157–217 (2017). https://doi.org/10.1007/s10462-016-9498-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-016-9498-2

Keywords

Navigation