research-article

A Systematic Review of Stemmers of Indian and Non-Indian Vernacular Languages

Authors:

Mayuri A. Mehta,

Ketan KotechaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 1

Article No.: 18, Pages 1 - 51

https://doi.org/10.1145/3604612

Published: 15 January 2024 Publication History

Abstract

The stemming process is crucial and significant in the pre-processing step of natural language processing. The stemmer oversees the stemming process. It facilitates the extraction of morphological variants of a root or base word from the provided word. Over the period, several stemmers for various vernacular languages have been proposed. However, very few research studies have comprehensively investigated these available stemmers. This article makes multifold contributions. First, we discuss the various stemmers of 15 Indian and 17 non-Indian languages describing their key points, benefits, and drawbacks. All the Indian languages for which stemmers have been built are covered in this study. For the non-Indian languages, stemmers of commonly spoken languages have been covered. Second, we present a language-wise comparative analysis of stemmers based on our identified parameters. Third, we discuss the wordnets and dictionaries available for different languages. Fourth, we provide details of the datasets available for various languages. Fifth, we also provide challenges in existing stemmers and future directions for future researchers. The study presented in this article reveals that significant research has been carried out for the stemmers of influential languages such as English, Arabic, and Urdu. On the other hand, languages with d resources, such as Farsi, Polish, Odia, Amharic, and others, have received the least attention for research. Moreover, rigorous analysis reveals that most of the stemmers suffer from over-stemming errors. With a complete catalogue of available stemmers, this study aims at assisting the researchers and professionals working in the areas such as information retrieval, semantic annotation, word meaning disambiguation, and ontology learning.

References

[1]

P. Lahoti, N. Mittal, and G. Singh. 2022. A survey on NLP resources, tools, and techniques for marathi language processing. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 2 (2022), 1–34.

Digital Library

[2]

J. Baxi, P. Patel, and B. Bhatt. 2015. Morphological analyzer for gujarati using paradigm based approach with knowledge based and statistical methods. In Proceedings of the 12th International Conference on Natural Language Processing.

[3]

M. Algarni, B. Martin, T. Bell, and K. Neshatian. 2014. Simple arabic stemmer. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. 1803–1806.

Digital Library

[4]

J. B. Lovins. 1968. Development of a stemming algorithm. Mech. Translat. & Comp. Linguistics 11, 1–2 (1968), 22–31.

[5]

J. Dawson. 1974. Suffix removal and word conflation. ALLC Bulletin 2, 3 (1974), 33–46.

[6]

M. F. Porter and others. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130–137.

[7]

D. P. Chris and others. 1990. Another stemmer. In Proceedings of the ACM SIGIR Forum 1990.

[8]

Y. Jaafar, D. Namly, K. Bouzoubaa, and A. Yousfi. 2017. Enhancing arabic stemming process using resources and benchmarking tools. Journal of King Saud University - Computer and Information Sciences 29, 2 (2017), 164–170.

Digital Library

[9]

A. Ramanathan and D. D. Rao. 2003. A lightweight stemmer for hindi. Proceedings of the EACL 2003 Workshop on Computational Linguistics for South Asian Languages. 43–48.

[10]

A. K. Pandey and T. J. Siddiqui. 2008. An unsupervised hindi stemmer with heuristic improvements. In Proceedings of the SIGIR 2008 Workshop on Analytics for Noisy Unstructured Text Data. 99–105.

Digital Library

[11]

K. Suba, D. Jiandani, and P. Bhattacharyya. 2011. Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing.

[12]

P. Patel, K. Popat, and P. Bhattacharyya. 2010. Hybrid stemmer for gujarati. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing.

[13]

N. Desai and B. Dalwadi. 2016. An affix removal stemmer for gujarati text. In Proceedings of the 2016 3rd International Conference on Computing for Sustainable Global Development.

[14]

A. Al-Omari and B. Abuata. 2010. Arabic light stemmer (ARS). Journal of Engineering Science and Technology 9, 6 (2010), 702–717.

[15]

C. D. Patel and J. M. Patel. 2016. Improving a lightweight stemmer for gujarati. International Journal of Information 6, 1/2 (2016), 135--142.

[16]

C. D. Patel and J. M. Patel. 2017. GUJSTER: A rule based stemmer using dictionary approach. In Proceedings of the 2017 International Conference on Inventive Communication and Computational Technologies.

[17]

H. B. Patil and A. S. Patil. 2020. A hybrid stemmer for the affix stacking language: Marathi. In Proceedings of the Computing in Engineering and Technology, Singapore.

[18]

H. B. Patil, N. T. Mhaske, and A. S. Patil. 2018. Design and development of a dictionary based stemmer for marathi language. Smart and Innovative Trends in Next Generation Computing Technologies: Third International Conference, NGCT 2017, Dehradun, India, October 30-31, 2017, Revised Selected Papers, Part I 3. 769–777.

[19]

P. Pandey, D. Amin, and S. Govilkar. 2016. Rule based stemmer using marathi wordnet for marathi language. International Journal of Advanced Research in Computer and Communication Engineering 5, 10 (2016), 278–282.

[20]

H. B. Patil and A. S. Patil. 2017. MarS: A rule-based stemmer for morphologically rich language marathi. In Proceedings of the 2017 International Conference on Computer, Communications and Electronics.

[21]

M. R. Mahmud, M. Afrin, M. A. Razzaque, E. Miller, and J. Iwashige. 2014. A rule based bengali stemmer. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics.

[22]

S. Sarkar and S. Bandyopadhyay. 2008. Design of a rule-based stemmer for natural language text in bengali. Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages (2008), 65–72.

[23]

D. Ganguly, J. Leveling, and G. J. F. Jones. 2013. DCU@Morpheme extraction task of fire-2012: Rule-based stemmers for bengali and hindi. In Proceedings of the 4th and 5th Annual Meetings of the Forum for Information Retrieval Evaluation. 1--5.

Digital Library

[24]

V. A. Ramachandran and I. Krishnamurthi. 2012. An iterative stemmer for tamil language. In Proceedings of the Intelligent Information and Database Systems.

Digital Library

[25]

M. Thangarasu and R. Manavalan. 2013. Stemmers for tamil language: Performance analysis. International Journal of Computer Science and Engineering Technology 4, 7 (2013), 902–908.

[26]

R. Kansal, V. Goyal, and G. S. Lehal. 2012. Rule based urdu stemmer. In Proceedings of COLING 2012: Demonstration Papers. 267--276.

[27]

J. Ameta, N. Joshi, and I. Mathur. 2012. A lightweight stemmer for gujarati. arXiv:1210.5486. Retrieved from https://arxiv.org/abs/1210.5486

[28]

J. Sheth and B. Patel. 2014. Dhiya: A stemmer for morphological level analysis of gujarati language. In Proceedings of the 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques.

[29]

J. Sheth. 2017. Saaraansh: Gujarati text summarization system. International Journal of Computer Science and Information Technology & Security 7, 3 (2017), 46–53.

[30]

S. Khoja and R. Garside. 1999. Stemming Arabic text, Technical Report. Lancaster University, Computing Department, Lancaster, UK.

[31]

I. A. Al Kharashi and I. A. Al Sughaiyer. 2002. Rule merging in a rule-based arabic stemmer. In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1, Stroudsburg.

Digital Library

[32]

M. Porter, R. Boulton, and A. Macfarlane. 2013. The english (porter2) stemming algorithm (2006). Retrieved March 31, 2022 from http://snowball.tartarus.org/algorithms/english/stemmer.html

[33]

B. Abuata and A. Al-Omari. 2015. A rule-based stemmer for arabic gulf dialect. Journal of King Saud University - Computer and Information Sciences 27, 2 (2015), 104–112.

Digital Library

[34]

S. R. El-Beltagy and A. Rafea. 2011. An accuracy-enhanced light stemmer for arabic text. ACM Transactions on Speech and Language Processing 7, 2 (2011), 1--22.

Digital Library

[35]

E. T. Al-Shammari and J. Lin. 2008. Towards an error-free arabic stemming. In Proceedings of the Association for Computing Machinery. Napa Valley, California.

Digital Library

[36]

A. Mokhtaripour and S. Jahanpour. 2006. Introduction to a new farsi stemmer. In International Conference on Information and Knowledge Management, Proceedings. 826–827.

Digital Library

[37]

N. Alemayehu and P. Willett. 2002. Stemming of amharic words for information retrieval. Literary and Linguistic Computing 17, 1 (2002), 1–17.

[38]

M. Korzycki. 2012. A dictionary based stemming mechanism for polish. In Proceedings of the 9th International Workshop on Natural Language Processing and Cognitive Science, Wroc law, Poland.

[39]

A. Estahbanati, R. Javidan, and M. A. Dezfooli. 2011. Implementation of a new method for stemming in persian language. In Proceedings of the International Conference on Web Intelligence, Mining and Semantics. 1--5.

Digital Library

[40]

A. Honrado, R. Leon, R. O'Donnel, and D. Sinclair. 2000. A word stemming algorithm for the spanish language. In Proceedings of the7th International Symposium on String Processing and Information Retrieval.

[41]

A. Paul, A. Dey, and B. S. Purkayastha. 2014. An affix removal stemmer for natural language text in nepali. International Journal of Computer Applications 91, 6 (2014), 1–4.

[42]

I. Shrestha and S. S. Dhakal. 2016. A new stemmer for nepali language. In Proceedings of the 2nd International Conference on Advances in Computing, Communication, and Automation.

[43]

R. V. Alvares, A. C. B. Garcia, and I. Ferraz. 2005. STEMBR: A stemming algorithm for the brazilian portuguese language. In Proceedings of the Portuguese Conference on Artificial Intelligence.

Digital Library

[44]

R. V. Alvares and A. C. B Garcia. Evaluating stemmers for the portuguese language. Language 1, 4.

[45]

S. Y. Tai, C. S. Ong, and N. A. Abullah. 2000. On designing an automated malaysian stemmer for the malay language. In Proceedings of the 5th International Workshop on on Information Retrieval with Asian Languages.

Digital Library

[46]

U. Prajitha, C. Sreejith and P. C. R. Raj. 2013. LALITHA: A light weight malayalam stemmer using suffix stripping method. In Proceedings of the International Conference on Control Communication and Computing.

[47]

A. P. S. Kumar, P. Premchand, and A. Govardhan. 2011. TelStem: An unsupervised telugu stemmer with heuristic improvements and normalized signatures. Journal of Computational Linguistics Research 2, 1 (2011), 13–23.

[48]

S. Seal and N. Joshi. 2019. Design of an inflectional rule-based assamese stemmer. International Journal of Innovative Technology and Exploring Engineering 8, 6 (2019), 1651–1655.

[49]

B. G. Patra, K. Debbarma, S. Debbarma, D. Das, A. Das, and S. Bandyopadhyay. 2012. A light weight stemmer in kokborok. In Proceedings of the 24th Conference on Computational Linguistics and Speech Processing.

[50]

K. Nongmeikapam, B. Salam, M. Romina, N. M. Chanu, and S. Bandyopadhyay. 2011. A light weight manipuri stemmer. In Proceedings of the National Conference on Indian Language, Computing.

[51]

D. P. Sethi. 2013. Design of lightweight stemmer for odia derivational suffixes. Int. Journal of Advanced Research in Computer and Communication Engineering 2, 12 (2013), 4594–4597.

[52]

M. R. Shah, H. Shaikh, J. A. Mahar, and S. A. Mahar. 2016. Sindhi stemmer for information retrieval system using rule-based stripping approach. Sindh University Research Journal-SURJ (Science Series) 48, 4 (2016), 891–897.

[53]

B. Nathani, N. Joshi, and G. N. Purohit. 2018. A rule based light weight inflectional stemmer for sindhi devanagari using affix stripping approach. In Proceedings of the 3rd International Conference and Workshops on Recent Advances and Innovations in Engineering.

[54]

B. Nathani, N. Joshi, and G. N. Purohit. 2020. Design and development of unsupervised stemmer for sindhi language. Procedia Computer Science 167, 1 (2020), 1920–1927.

Digital Library

[55]

B. Nathani, N. Joshi, and G. N. Purohit. 2020. Rule-based derivational stemmer for sindhi devanagari using suffix stripping approach. In Proceedings of the Smart Systems and IoT: Innovations in Computing. 227–235.

[56]

D. Kumar and P. Rana. 2010. Design and development of a stemmer for punjabi. International Journal of Computer Applications 11, 12 (2010), 18–23.

[57]

V. Gupta. 2010. Automatic stemming of words for punjabi language. In Proceedings of the Advances in Signal Processing and Intelligent Recognition Systems.

[58]

M. Ali, S. Khalid, and M. Saleemi. 2019. Comprehensive stemmer for morphologically rich urdu language. The International Arab Journal of Information Technology 16, 1 (2019), 138–147.

[59]

Q.-u.-A. Akram, A. Naseer, and S. Hussain. 2009. Assas-band, an affix-exception-list based urdu stemmer. In Proceedings of the 7th Workshop on Asian Language Resources.

Digital Library

[60]

K. Darwish. 2002. Building a shallow arabic morphological analyser in one day. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages.

Digital Library

[61]

A. A. Argaw and L. Asker. 2007. An {A}mharic stemmer: Reducing words to their citation forms. Proceedings of the 45th Annual Meeting of the Association for {C}omputational {L}inguistics. 104–110.

[62]

P. Koirala and A. Shakya. 2020. A nepali rule based stemmer and its performance on different NLP applications. arXiv: 2002.09901. Retrieved from https://arxiv.org/abs/2002.09901

[63]

K. A. Ebrahim and S. Saidhbi. 2022. Designing stemmer for afaraf text using rule based approach. In Proceedings of the Innovations in Computer Science and Engineering. 281–288.

[64]

F. Idi. 1999. Building a French Stemmer using a Dictionary of French Root Words. Master's Thesis. Computer Science and Information Technology Universiti Putra Malaysia.

[65]

H. Zahid, I. Sajid, T. Saba, A. S. Almazyad, and R. Amjad. 2017. Design and development of dictionary-based stemmer for the urdu language. Journal of Theoretical and Applied Information Technology 95, 15 (2017), 3560–3569.

[66]

A. Rahimi. 2015. A new hybrid stemming algorithm for persian. arXiv: 1507.03077. Retrieved from https://arxiv.org/abs/1507.03077

[67]

S. Estahbanati and R. Javidan. 2011. A new stemmer for farsi language. In Proceedings of the CSI International Symposium on Computer Science and Software Engineering.

[68]

D. Weiss. 2005. Stempelator: A hybrid stemmer for the Polish language. Research Report. Institute of Computing Science: Poznan University of Technology, Poland.

[69]

H. Taghi-Zadeh, M. H. Sadreddini, M. H. Diyanati, and A. H. Rasekh. 2017. A new hybrid stemming method for persian language. Digital Scholarship in the Humanities 32, 1 (2017), 209–221.

[70]

C. G. Figureola, R. Gomez, Angel F. Zazo Rodriguez, and J. L. Alonso Berrocal. 2001. Stemming in spanish: A first approach to its impact on information retrieval. In Results of the CLEF 2001 Cross-Language System Evaluation Campaign. Working Notes for the CLEF 2001 Workshop. Darmstadt, Germany.

[71]

C. Sitaula. 2013. A hybrid algorithm for stemming of nepali text. Scientific Research, Intelligent Information Management 5, 4 (2013), 136--139.

[72]

S. P. Meitei, B. S. Purkayastha, and H. Mamata Devi. 2015. Development of a manipuri stemmer: A hybrid approach. In Proceedings of the International Symposium on Advanced Computing and Communication.

[73]

A. Mateen, M. K. Malik, Z. Nawaz, H. M. Danish, and M. H. Siddiqui. 2017. A hybrid stemmer of punjabi shahmukhi script. International Journal of Computer Science and Network Security 17, 8 (2017), 90–97.

[74]

N. Swapna. 2019. Root based stemmer for telugu script. International Journal of Engineering and Advanced Technology 8, 6 (2019), 2565–2568.

[75]

M. N. Al-Kabi, S. A. Kazakzeh, B. M. Abu Ata, S. A. Al-Rababah, and I. M. Alsmadi. 2015. A novel root based arabic stemmer. Journal of King Saud University - Computer and Information Sciences 27, 2 (2015), 94–103.

Digital Library

[76]

A. A. Sharifloo and M. Shamsfard. 2008. A bottom up approach to persian stemming, In Proceedings of the 3rd International Joint Conference on Natural Language Processing: Volume-II.

[77]

P. M. Dhanya, A. Sreekumar, and M. Jathavedan. 2018. Vriksh: A tree based malayalam lemmatizer using suffix replacement dictionary. International Journal of Emerging Technologies in Engineering Research 6, 1 (2018), 31–42.

[78]

A. Debbarma, B. S. Purkayastha, and P. Bhattacharya. 2014. Stemmer for resource scarce language using string similarity measure. In Proceedings of the International Conference on Reliability Optimization and Information Technology.

[79]

N. Saharia, K. M. Konwar, U. Sharma, and J. K. Kalita. 2013. An improved stemming approach using HMM for a highly inflectional language. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics.

Digital Library

[80]

B. T. Dinçer and B. Karaoğlan. 2003. Stemming in agglutinative languages: A probabilistic stemmer for turkish. In Proceedings of the International Symposium on Computer and Information Sciences.

[81]

K. Pragisha and P. C. Reghuraj. 2013. STHREE: Stemmer for malayalam using three pass algorithm. In Proceedings of the International Conference on Control Communication and Computing.

[82]

P. Gupta and S. S. Jamwal. 2021. Designing and development of stemmer of dogri using unsupervised learning. In Proceedings of the Soft Computing for Intelligent Systems. 147–156.

[83]

E. K. Cilden. 2006. Stemming Turkish Words using Snowball, Ericsim adresi: http://img.eba.gov.tr/542/7b6/2ce/3d5/995/c04/9a5/b2b/041˜…, 2006

[84]

K. Taghva, R. Beckley, and M. Sadeh. 2005. A stemming algorithm for the farsi language. In Proceedings of the International Conference on Information Technology: Coding and Computing.

Digital Library

[85]

M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi, and H. E. Williams. 2007. Stemming indonesian: A confix-stripping approach. ACM Transactions on Asian Language Information Processing 6, 4 (2007), 1–33.

Digital Library

[86]

G. Ntais. 2006. Development of a Stemmer for the Greek Language. Citeseer.

[87]

L. S. Larkey, L. Ballesteros, and M. E. Connell. 2007. Light stemming for arabic information retrieval. In Proceedings of the Arabic Computational Morphology. 221–243.

[88]

I. Boukhalfa, S. Mostefai, and N. Chekkai. 2018. A study of graph based stemmer in arabic extrinsic plagiarism detection. In Proceedings of the 2nd Mediterranean Conference on Pattern Recognition and Artificial Intelligence. 27–32.

Digital Library

[89]

T. Kanan, B. Hawashin, S. Alzubi, E. Almaita, A. Alkhatib, K. A. Maria, and M. Elbes. 2022. Improving arabic text classification using p-stemmer. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science 15, 3 (2022), 404–411.

[90]

D. H. Abd, W. Khan, K. A. Thamer, and A. J. Hussain. 2021. Arabic light stemmer based on ISRI stemmer. In Proceedings of the International Conference on Intelligent Computing.

Digital Library

[91]

H. Alshalabi, S. Tiun, N. Omar, E. A. Anaam, and Y. Saif. 2022. BPR algorithm: New broken plural rules for an arabic stemmer. Egyptian Informatics Journal 2, 3 (2022), 363–371.

[92]

H. Alshalabi, S. Tiun, N. Omar, F. N. AL-Aswadi, and K. L. Alezabi. 2022. Arabic light-based stemmer using new rules. Journal of King Saud University-Computer and Information Sciences 34, 9 (2022), 6635–6642.

Digital Library

[93]

M. K. Saad and W. M. Ashour. 2010. Arabic morphological tools for text mining. In 6th International Conference on Electrical and Computer Systems (EECS'10).

[94]

S. Mammadov, S. Rustamov, A. Mustafali, Z. Sadigov, R. Mollayev, and Z. Mammadov. 2018. Part-of-speech tagging for azerbaijani language. In Proceedings of the 2018 IEEE 12th International Conference on Application of Information and Communication Technologies.

[95]

A. Ismailov, M. M. Abdul Jalil, Z. Abdullah, and N. H. Abd Rahim. 2016. A comparative study of stemming algorithms for use with the Uzbek language. In Proceedings of the 2016 3rd International Conference on Computer and Information Sciences. 7–12.

[96]

M. M. Jalil, A. Ismailov, N. H. Abd Rahim, and Z. Abdullah. 2017. The development of the uzbek stemming algorithm. Advanced Science Letters 23, 5 (2017), 4171–4174.

[97]

M. Akasereh and J. Savoy. 2012. Retrieval effectiveness study with farsi language. In Proceedings of Conference in Information Research and Applications (CORIA'12). 25--40.

[98]

S. Estahbanati, R. Javidan, and M. Nikkhah. 2011. A new multi-phase algorithm for stemming in farsi language based on morphology. International Journal of Computer Theory and Engineering 3, 5 (2011), 623–627.

[99]

A. A. Argaw and L. Asker. 2007. An amharic stemmer: Reducing words to their citation forms. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources.

[100]

N. N. Karanikolas. 2016. Building stemmers for the polish language. In Proceedings of the 20th Pan-Hellenic Conference on Informatics. 1--4.

Digital Library

[101]

M. Klubinski. 2011. Dictionary Stemmer for Polish Language. Ph.D. Dissertation. Instytut Informatyki, Poland.

[102]

G. Ntais, S. Saroukos, E. Berki, and H. Dalianis. 2016. Development and enhancement of a stemmer for the greek language. In Proceedings of the 20th Pan-Hellenic Conference on Informatics. 1--4.

Digital Library

[103]

S. Saroukos. 2009. Enhancing a greek language stemmer-efficiency and accuracy improvements. Master's Theis. Department of Computer Sciences, University of Tampere.

[104]

A. Rashidi and M. Z. Lighvan. 2014. HPS: A hierarchical Persian stemming method. International Journal on Natural Language Computing 3, 1 (2014), 11–20

[105]

A. Fernández, J. Díaz, Y. Gutiérrez, and R. Muñoz, 2011. An unsupervised method to improve spanish stemmer. In Proceedings of the International Conference on Application of Natural Language to Information Systems.

[106]

J. Savoy. 1993. Stemming of french words based on grammatical categories. Journal of the American Society for Information Science 44, 1 (1993), 1–9

[107]

J. Savoy. 1999. A stemming procedure and stopword list for general french corpora. Journal of the American Society for Information Science 50, 10 (1999), 944–952.

Digital Library

[108]

J. Savoy. 2006. Light stemming approach for the french, portuguese, german and hungarian languages. In Proceedings of the ACM Symposium on Applied Computing. 1031–1035.

[109]

P. Majumder, M. Mitra, and K. Datta. 2006. Statistical vs. rule-based stemming for monolingual french retrieval. In Proceedings of the Workshop of the Cross-Language Evaluation Forum for European Languages.

[110]

F. C. Ekmekçioglu and P. Willett. 2000. Effectiveness of stemming for turkish text retrieval. PROGRAM-LONDON-ASLIB 34, 2 (2000), 195–200.

[111]

B. K. Bal and P. Shrestha. 2004. A morphological analyzer and a stemmer for nepali. PAN Localization, Working Papers 2007, 1 (2004), 324–331.

[112]

S. Ahmadi. 2020. KLPT–kurdish language processing toolkit. In Proceedings of the 2nd Workshop for NLP Open Source Software.

[113]

K. S. Esmaili, S. Salavati, and A. Datta. 2014. Towards kurdish information retrieval. ACM Transactions on Asian Language Information Processing 13, 2 (2014), 1–18.

Digital Library

[114]

A. M. Mustafa and T. A. Rashid. 2018. Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science 44, 1 (2018), 15–27.

Digital Library

[115]

S. Salavati, K. S. Esmaili, and F. Akhlaghian. 2013. Stemming for Kurdish information retrieval. In Proceedings of the Asia Information Retrieval Symposium.

[116]

P. Nakov. 2003. Building an inflectional stemmer for bulgarian. In Proceedings of the 4th International Conference Conference on Computer Systems and Technologies: E-Learning. 419--424.

Digital Library

[117]

P. Nakov. 2003. BulStem: Design and evaluation of inflectional stemmer for bulgarian. In Proceedings of the Workshop on Balkan Language Resources and Tools (Balkan Conference in Informatics).

[118]

R. Khoury and F. Sapsford. 2016. Latin word stemming using wiktionary. Digital Scholarship in the Humanities 31, 2 (2016), 368–373.

[119]

R. Schinke, M. Greengrass, A. M. Robertson, and P. Willett. 1996. A stemming algorithm for latin text databases. Journal of Documentation 52, 2 (1996), 172--187.

[120]

P. Majumder, M. Mitra, and D. Pal. 2007. Hungarian and czech stemming using YASS. In Proceedings of the CLEF (Working Notes).

[121]

L. Dolamic and J. Savoy. 2009. Indexing and stemming approaches for the czech language. Information Processing and Management 45, 6 (2009), 714–720.

Digital Library

[122]

M. Braschler and B. Ripplinger. 2004. How effective is stemming and decompounding for german text retrieval?. Information Retrieval 7, 3 (2004), 291–316.

Digital Library

[123]

L. Weissweiler and A. Fraser. 2017. Developing a stemmer for german based on a comparative analysis of publicly available stemmers. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology.

[124]

A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat. 2018. The rule-based sundanese stemmer. ACM Transactions on Asian and Low-Resource Language Information Processing 17, 4 (2018), 1–28.

Digital Library

[125]

J. Samuel and S. Teferra. 2018. Designing a rule based stemming algorithm for kambaata language text. International Journal of Computational Linguistics 9, 2 (2018), 41–54.

[126]

L. S. Indradjaja and S. Bressan. 2003. Automatic learning of stemming rules for the indonesian language. In Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation.

[127]

D. S. Maylawati, W. B. Zulfikar, C. Slamet, M. A. Ramdhani, and Y. A. Gerhana. 2018. An improved of stemming algorithm for mining indonesian text with slang on social media. In Proceedings of the 6th International Conference on Cyber and IT Service Management.

[128]

A. S. Rizki, A. Tjahyanto, and R. Trialih. 2019. Comparison of stemming algorithms on indonesian text processing. Telkomnika 17, 1 (2019), 95–102.

[129]

R. Setiawan, A. Kurniawan, W. Budiharto, I. H. Kartowisastro, and H. Prabowo. 2016. Flexible affix classification for stemming indonesian language. In Proceedings of the 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[130]

K. Chauhan, R. Patel, and H. Joshi. 2013. Towards improvement in gujarati text information retrieval by using effective gujarati stemmer. Journal of Information, Knowledge and Research in Computer Engineering 2, 2 (2013), 499–599.

[131]

N. Aswani and R. J. Gaizauskas. 2010. Developing morphological analysers for south asian languages: Experimenting with the hindi and gujarati languages. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). 811--815.

[132]

J. Sheth and B. Patel. 2012. Stemming techniques and naïve approach for gujarati stemmer. International Journal of Computer Applications 1, 2 (2012), 9--11.

[133]

C. K. Bhensdadia, B. Bhatt, and P. Bhattacharyya. 2010. Introduction to gujarati wordnet. In Proceedings of Third National Workshop on IndoWordNet. 1–5.

[134]

P. Panchal, N. Panchal, H. Samani, A. Complex, and E. Nagar. 2014. Development of gujarati wordnet for family of words. Int. Res. J. Comput. Sci 1, 4 (2014), 28–32.

[135]

U. Chauhan and A. Shah. 2021. Topic modeling using latent dirichlet allocation: A survey. ACM Computing Surveys 54, 7 (2021), 1–35.

Digital Library

[136]

U. Chauhan and A. Shah. 2021. Improving semantic coherence of gujarati text topic model using inflectional forms reduction and single-letter words removal. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 1 (2021), 1–18.

Digital Library

[137]

S. B. Rodzman, M. F. I. A. Ronie, N. K. Ismail, N. A. Rahman, F. Ahmad, and Z. M. Nor. 2018. Analyzing malay stemmer performance towards fuzzy logic ranking function on malay text corpus. Proceedings of the 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences. 36–41.

[138]

S. Sulaiman, K. Omar, N. Omar, M. Z. Murah, and H. A. Rahman. 2014. The effectiveness of a Jawi stemmer for retrieving relevant malay documents in jawi characters. ACM Transactions on Asian Language Information Processing 13, 2 (2014), 1–21.

Digital Library

[139]

A. V. Krishna. Malayalam Stemmer.

[140]

P. V. Kadam, B. K. Khandale, and C. N. Mahender. 2022. Design and development of marathi word stemmer. In Proceedings of the 2nd International Conference on Advances in Computer Engineering and Communication Systems. 35–48.

[141]

V. Giri, M. M. Math, and U. P. Kulkarni. 2021. MTStemmer: A multilevel stemmer for effective word pre-processing in marathi. Turkish Journal of Computer and Mathematics Education 12, 2 (2021), 1885–1894.

[142]

R. S. Patil and S. R. Kolhe. 2022. Inflectional and derivational hybrid stemmer for sentiment analysis: A case study with marathi tweets. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition.

[143]

N. S. Dash. 2004. Morphological processing of words in bangla corpus. Indian Journal of Applied Linguistics 30, 2 (2004), 63–83.

[144]

T. Ahmed, S. Hossain, M. S. Salim, A. Anjum, and K. M. A. Hasan. 2021. Gold dataset for the evaluation of bangla stemmer. In Proceedings of the 5th International Conference on Electrical Information and Communication Technology.

[145]

P. Mythilisharan, P. Laxminarayana, and A. Venkataramana. 2019. Unsupervised stemming based language model for telugu broadcast news transcription. arXiv:1908.03734. Retrieved from https://arxiv.org/abs/1908.03734

[146]

S. Bhat. 2013. Statistical stemming for kannada. In Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing.

[147]

N. Deepamala and P. R. Kumar. 2015. Kannada stemmer and its effect on Kannada documents classification. In Computational Intelligence in Data Mining - vol. 3: Proceedings of the International Conference on CIDM, 20--21.

[148]

G. Trishala and H. R. Mamatha. 2021. Implementation of stemmer and lemmatizer for a low-resource language—kannada. In Proceedings of the International Conference on Intelligent Computing, Information and Control Systems.

[149]

M. C. Padma and R. J. Prathibha. 2014. Development of morphological stemmer, analyzer and generator for Kannada nouns. In Proceedings of the Emerging Research in Electronics, Computer Science and Technology.

[150]

J. Sarmah, S. K. Sarma, and A. K. Barman. 2012. Development of assamese rule based stemmer using wordnet. In Proceedings of the 10th Global WordNet Conference.

[151]

A. Debbarma. 2012. Kokborok morphological analyzer using stemmer. In International Journal of Computer Applications 1, 9 (2012), 29--31.

[152]

S. Chaupattnaik, S. S. Nanda, and S. Mohanty. 2012. A suffix stripping algorithm for odia stemmer. International Journal of Computational Linguistics and Natural Language Processing 1, 1 (2012), 1–5.

[153]

A. A. Sattar, S. Abbasi, M. U. Rahman, A. Baig, and M. Nizamani. 2021. Sindhi stemmer using affix removal method. International Journal of Advanced Trends in Computer Science and Engineering 10, 3 (2021), 2447–2451.

[154]

H. Singh. 2021. Analyzing the punjabi language stemmers: A critical approach. In Proceedings of the International Semantic Intelligence Conference.

[155]

H. Singh. 2022. GPStemmer—a gurmukhi punjabi stemmer. In Proceedings of the Advances in Data and Information Sciences. 493–503.

[156]

N. Thabet. 2004. Stemming the qur'an. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages.

Digital Library

[157]

P. Baker, A. Hardie, T. McEnery, H. Cunningham, and R. J. Gaizauskas. 2002. EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC'02).

[158]

J. Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational linguistics 27, 2 (2001), 153–198.

Digital Library

[159]

A. Boradia. 2018. A study of different methods & techniques for stemming in gujarati text mining, 8, 11 (2018), 2178–2187.

[160]

TDIL. 2022. Indian Language Technology Proliferation and Deployment Centre. Retrieved from https://tdil-dc.in/index.php?lang=en

[161]

C. Patel and K. Gali. 2008. Part-of-speech tagging for gujarati using conditional random fields. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages. 117–122.

[162]

M. Patel and P. Balani. 2013. Clustering algorithm for gujarati language, arXiv: 1307.5393. Retrieved from https://arxiv.org/abs/1307.5393

[163]

Bhagwadgomandal. 2020. Digital Bhagwadgomandal, Powered by Gujarati Lexicon. Retrieved from http://www.bhagavadgomandal.com/index.php

[164]

Gujarati Lexicon. 2022. Gujarati Lexicon: World's Most Comprehensive Gujarati Language Resources. Retrived from https://www.gujaratilexicon.com/

[165]

S. S. Panchal, P. P. Shukla, P. R. Kolte, J. S. Kolte, and H. N. Bharathi. 2015. Gujarati wordnet – a lexical database. International Journal of Computer Applications, 116, 20 (2015), 6–8.

[166]

U. Kapadia and A. Desai. 2015. Morphological rule set and lexicon of gujarati grammar: A linguistics approach. VNSGU Journal of Science and Technology 4, 1 (2015), 127–133.

[167]

D. A. kothari. 2010. Practical Gujarati Grammar (2nd. ed.). Arunoday Publication, Ahmedaba.

[168]

S. Brock. 1973. A Brief Outline of Syriac Literature, Gorgean Press.

[169]

J. Baxi and B. Bhatt. 2002. GujMORPH-a dataset for creating gujarati morphological analyzer. In Proceedings of the 13th Language Resources and Evaluation Conference.

[170]

A. G. Jivani. 2011. A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl 2, 6 (2011), 1930–1938.

[171]

S. R. Sirsat, V. Chavan, and H. S. Mahalle. 2013. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies 4, 2 (2013), 265–269.

[172]

I. Bensalem, I. Boukhalfa, P. Rosso, L. Abouenour, K. Darwish, and S. Chikhi. 2015. Overview of the AraPlagDet PAN@FIRE2015 shared task on arabic plagiarism detection. JAPCA 1587 (2015), 111–122.

[173]

K. Dukes and N. Habash. 2010. Morphological annotation of quranic arabic. In Proceedings of the 7th International Conference on Language Resources and Evaluation, Valletta.

[174]

S. Ghwanmeh. 2012. Enhanced algorithm for extracting the root of arabic words. In Proceedings of the6th International Conference on Computer Graphics, Imaging and Visualization.

[175]

T. Zerrouki. 2022. Tashaphyne: Arabic Light Stemmer, (Jan 2022). Retrieved March 31, 2022 from https://pypi.org/project/Tashaphyne/

[176]

G. N. Alemneh. 2020. Amharic light stemmer. In Proceedings of The International Conference on Learning Representations (ICLR).

Index Terms

A Systematic Review of Stemmers of Indian and Non-Indian Vernacular Languages

Recommendations

Stemming resource-poor Indian languages

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language
Big Data Analytics
Abstract
Gujarati is an Indo-Aryan language spoken substantially by people of Gujarat state of India. It is highly and actively used for communication in Gujarat government’s educational institutes and offices, local industries, businesses as well as in ...
Statistical machine translation of Indian languages: a survey
Abstract
In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 1

January 2024

385 pages

EISSN:2375-4702

DOI:10.1145/3613498

Editor:
Imed Zitoun
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2024

Online AM: 14 June 2023

Accepted: 29 May 2023

Revised: 30 January 2023

Received: 01 August 2022

Published in TALLIP Volume 23, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
404
Total Downloads

Downloads (Last 12 months)137
Downloads (Last 6 weeks)18

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents