Abstract
Stemming is a common word conflation method that perceives stems embedded in the words and decreases them to their stem (root) by conflating all the morphologically related terms into a single term, without doing a complete morphological analysis. This article presents STEMUR, an enhanced stemming algorithm for automatic word conflation for Urdu language. In addition to handling words with prefixes and suffixes, STEMUR also handles words with infixes. Rather than using a totally unsupervised approach, we utilized the linguistic knowledge to develop a collection of patterns for Urdu infixes to enhance the accuracy of the stems and affixes acquired during the training process. Additionally, STEMUR also handles English loan words and can handle words with more than one affix. STEMUR is compared with four existing Urdu stemmers including Assas-Band and the template-based stemmer that are also implemented in this study. Results are processed on two corpora containing 89,437 and 30,907 words separately. Results show clear improvements regarding strength and accuracy of STEMUR. The use of maximum possible infix rules boosted our stemmer's accuracy up to 93.1% and helped us achieve a precision of 98.9%.
- [1] . 2012. A light weight stemmer for Urdu language: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics. 69–78.Google Scholar
- [2] . 1996. Query Based Stemming. Ph.D. Thesis. University of Waterloo.Google Scholar
- [3] . 2010. Study of Stemming Algorithms. UNLV Theses, Dissertations, Professional Papers, and Capstones. 754.Google Scholar
- [4] . 2017. A systematic review of text stemming techniques. Artificial Intelligence Review 48, 2 (2017), 157–217. Google ScholarDigital Library
- [5] . 2008. Index Compression for Information Retrieval Systems. Ph.D. Thesis. University of Coruña.Google Scholar
- [6] . 2016. A survey on Urdu and Urdu like language stemmers and stemming techniques. Artificial Intelligence Review 49, 3 (2016), 339–373. Google ScholarDigital Library
- [7] . 1982. An algorithm for suffix stripping. Program 14, 3 (1982), 130–137.Google ScholarCross Ref
- [8] . 2009. Assas-Band: An affix-exception-list based Urdu stemmer. In Proceedings of the 7th Workshop on Asian Language Resources. 40–46. Google ScholarDigital Library
- [9] . 2015. Template based affix stemmer for a morphologically rich language. International Arab Journal of Information Technology 12, 2 (2015), 146–154.Google Scholar
- [10] . 2017. Pattern-based comprehensive Urdu stemmer and short text classification. IEEE Access 6 (2017), 7374–7389.Google ScholarCross Ref
- [11] . 2019. Comprehensive stemmer for morphologically rich Urdu language. International Arab Journal of Information Technology 16, 1 (2019), 138–147.Google Scholar
- [12] . 2018. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach. Journal of Experimental & Theoretical Artificial Intelligence 30, 5 (2018), 703–723.Google Scholar
- [13] . 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1–2 (1968), 22–31.Google Scholar
- [14] . 1990. Another stemmer. ACM SIGIR Forum 24, 3 (1990), 56–61. Google ScholarDigital Library
- [15] . 1999. Stemming Arabic Text. Computing Department, Lancaster University.Google Scholar
- [16] . 2005. Arabic stemming without a root dictionary. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). 152–157. Google ScholarDigital Library
- [17] . 2015. New rules-based algorithm to improve Arabic stemming accuracy. International Journal of Knowledge Engineering and Data Mining 3, 3–4 (2015), 315–336. Google ScholarDigital Library
- [18] . 2003. Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing 2, 2 (2003), 275–282. Google ScholarDigital Library
- [19] . 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages Workshop.Google Scholar
- [20] . 2012. An unsupervised approach to develop stemmer. International Journal on Natural Language Computing 1, 2 (2012), 15–23.Google ScholarCross Ref
- [21] . 2016. A rule based stemming method for multilingual Urdu text. International Journal of Computer Applications 134, 8 (2016), 10–18.Google ScholarCross Ref
- [22] . 2016. Analysis and development of resources for Urdu text stemming. In Proceedings of the 6th Annual International Conference on Language and Technology (KICS-CLE’16).Google Scholar
- [23] . 1999. Urdu: An Essential Grammar. Psychology Press.Google Scholar
- [24] . 2011. Challenges in developing a rule based Urdu stemmer. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing. 46–51.Google Scholar
- [25] . 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279–311. Google ScholarDigital Library
- [26] . 2000. عبارت کیسے لکھیں. Maktaba Piam-e-Taaleem, Jamia Nagar, New Delhi, IndiaGoogle Scholar
- [27] . 1985. لسانیات کے بنیادی اصول. Fakhar-ud-Din Ali Ahmad Memorial Committee.Google Scholar
- [28] . 1991. Qawaed-e-Urdu. Anjuman Taraqi-e-Urdu, New Delhi, India.Google Scholar
- [29] . 1988. Darya-e-Latafat. Anjuman Taraqi-e-Urdu, New Delhi, India.Google Scholar
- [30] . 2018. (عربی کے بنیادی قواعد) لسان القرآن. Maktaba Al-Quran Academy, Faisalabad, Pakistan.Google Scholar
- [31] . 1996. (عربی گرامر) تیسیر القرآن”. Fahm-ul-Quran Institute, Lahore, Pakistan.Google Scholar
- [32] . 2007. Development of Algorithms and Computational Grammar for Urdu. Ph.D. Thesis. Pakistan Institute of Engineering and Applied Science, Islamabad, Pakistan.Google Scholar
- [33] . 1999. Introductory Urdu (3rd ed.). Volume 1. South Asia Language & Area Center University of Chicago, Chicago, IL.Google Scholar
- [34] . 2003. Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 1 (2003), 26–30. Google ScholarDigital Library
- [35] . 2013. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies 4, 2 (2013), 265–269.Google Scholar
- [36] . 2013. Effective Arabic stemmer based hybrid approach for Arabic text categorization. International Journal of Data Mining & Knowledge Management Process 3, 4 (2013), 1–14.Google ScholarCross Ref
Index Terms
- STEMUR: An Automated Word Conflation Algorithm for the Urdu Language
Recommendations
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets
Text pre-processing is a crucial step in Natural Language Processing (NLP) applications, particularly for handling informal and noisy content on social media. Word-level tokenization plays a vital role in text pre-processing by removing stop words, ...
A word sense disambiguation corpus for Urdu
AbstractThe aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an ...
Comments