skip to main content
research-article

STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

Published:09 November 2021Publication History
Skip Abstract Section

Abstract

Stemming is a common word conflation method that perceives stems embedded in the words and decreases them to their stem (root) by conflating all the morphologically related terms into a single term, without doing a complete morphological analysis. This article presents STEMUR, an enhanced stemming algorithm for automatic word conflation for Urdu language. In addition to handling words with prefixes and suffixes, STEMUR also handles words with infixes. Rather than using a totally unsupervised approach, we utilized the linguistic knowledge to develop a collection of patterns for Urdu infixes to enhance the accuracy of the stems and affixes acquired during the training process. Additionally, STEMUR also handles English loan words and can handle words with more than one affix. STEMUR is compared with four existing Urdu stemmers including Assas-Band and the template-based stemmer that are also implemented in this study. Results are processed on two corpora containing 89,437 and 30,907 words separately. Results show clear improvements regarding strength and accuracy of STEMUR. The use of maximum possible infix rules boosted our stemmer's accuracy up to 93.1% and helped us achieve a precision of 98.9%.

REFERENCES

  1. [1] Khan S. A., Anwar W., Bajwa U. I., and Wang X.. 2012. A light weight stemmer for Urdu language: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics. 6978.Google ScholarGoogle Scholar
  2. [2] Tudhope E.. 1996. Query Based Stemming. Ph.D. Thesis. University of Waterloo.Google ScholarGoogle Scholar
  3. [3] Kodimala S.. 2010. Study of Stemming Algorithms. UNLV Theses, Dissertations, Professional Papers, and Capstones. 754.Google ScholarGoogle Scholar
  4. [4] Singh J., and Gupta V.. 2017. A systematic review of text stemming techniques. Artificial Intelligence Review 48, 2 (2017), 157217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Blanco R.. 2008. Index Compression for Information Retrieval Systems. Ph.D. Thesis. University of Coruña.Google ScholarGoogle Scholar
  6. [6] Jabbar A., Iqbal S., Ghani M. U., and Hussain S.. 2016. A survey on Urdu and Urdu like language stemmers and stemming techniques. Artificial Intelligence Review 49, 3 (2016), 339373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Porter M. F.. 1982. An algorithm for suffix stripping. Program 14, 3 (1982), 130137.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Akram Q. U. A., Naseer A., and Hussain S.. 2009. Assas-Band: An affix-exception-list based Urdu stemmer. In Proceedings of the 7th Workshop on Asian Language Resources. 4046. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Khan S., Anwar W., Bajwa U., and Wang X.. 2015. Template based affix stemmer for a morphologically rich language. International Arab Journal of Information Technology 12, 2 (2015), 146154.Google ScholarGoogle Scholar
  10. [10] Ali M., Khalid S., and Aslam M. H.. 2017. Pattern-based comprehensive Urdu stemmer and short text classification. IEEE Access 6 (2017), 73747389.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Ali M., Khalid S., and Saleemi M.. 2019. Comprehensive stemmer for morphologically rich Urdu language. International Arab Journal of Information Technology 16, 1 (2019), 138147.Google ScholarGoogle Scholar
  12. [12] Abdul Jabbar, Iqbal Sajid, Akhunzada Adnan, and Abbas Qaisar. 2018. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach. Journal of Experimental & Theoretical Artificial Intelligence 30, 5 (2018), 703723.Google ScholarGoogle Scholar
  13. [13] Lovins Julie Beth. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1–2 (1968), 2231.Google ScholarGoogle Scholar
  14. [14] Paice C. D.. 1990. Another stemmer. ACM SIGIR Forum 24, 3 (1990), 5661. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Khoja S. and Garside R.. 1999. Stemming Arabic Text. Computing Department, Lancaster University.Google ScholarGoogle Scholar
  16. [16] Taghva K., Elkhoury R., and Coombs J.. 2005. Arabic stemming without a root dictionary. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). 152157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Cherif W., Madani A., and Kissi M.. 2015. New rules-based algorithm to improve Arabic stemming accuracy. International Journal of Knowledge Engineering and Data Mining 3, 3–4 (2015), 315336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Larkey L. S., Connell M. E., and Abduljaleel N.. 2003. Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing 2, 2 (2003), 275282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Ramanathan A. and Rao D.. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages Workshop.Google ScholarGoogle Scholar
  20. [20] Husain M. S.. 2012. An unsupervised approach to develop stemmer. International Journal on Natural Language Computing 1, 2 (2012), 1523.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ali M., Khalid S., Saleemi M. H., Iqbal W., Ali A., and Naqvi G.. 2016. A rule based stemming method for multilingual Urdu text. International Journal of Computer Applications 134, 8 (2016), 1018.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Jabbar A., Iqbal S., and Khan M. U. G.. 2016. Analysis and development of resources for Urdu text stemming. In Proceedings of the 6th Annual International Conference on Language and Technology (KICS-CLE’16).Google ScholarGoogle Scholar
  23. [23] Schmidt R. L.. 1999. Urdu: An Essential Grammar. Psychology Press.Google ScholarGoogle Scholar
  24. [24] Khan S. A., Anwar W., and Bajwa U.. 2011. Challenges in developing a rule based Urdu stemmer. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing. 4651.Google ScholarGoogle Scholar
  25. [25] Daud A., Khan W., and Che D.. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Khan R. H.. 2000. عبارت کیسے لکھیں. Maktaba Piam-e-Taaleem, Jamia Nagar, New Delhi, IndiaGoogle ScholarGoogle Scholar
  27. [27] Khan I. H.. 1985. لسانیات کے بنیادی اصول. Fakhar-ud-Din Ali Ahmad Memorial Committee.Google ScholarGoogle Scholar
  28. [28] Haq M. A.. 1991. Qawaed-e-Urdu. Anjuman Taraqi-e-Urdu, New Delhi, India.Google ScholarGoogle Scholar
  29. [29] Insha I. A. K.. 1988. Darya-e-Latafat. Anjuman Taraqi-e-Urdu, New Delhi, India.Google ScholarGoogle Scholar
  30. [30] Sohail A.. 2018. (عربی کے بنیادی قواعد) لسان القرآن. Maktaba Al-Quran Academy, Faisalabad, Pakistan.Google ScholarGoogle Scholar
  31. [31] Saqib A. R.. 1996. (عربی گرامر) تیسیر القرآن”. Fahm-ul-Quran Institute, Lahore, Pakistan.Google ScholarGoogle Scholar
  32. [32] Rizvi S. M. J.. 2007. Development of Algorithms and Computational Grammar for Urdu. Ph.D. Thesis. Pakistan Institute of Engineering and Applied Science, Islamabad, Pakistan.Google ScholarGoogle Scholar
  33. [33] Naim C. M.. 1999. Introductory Urdu (3rd ed.). Volume 1. South Asia Language & Area Center University of Chicago, Chicago, IL.Google ScholarGoogle Scholar
  34. [34] Frakes W. B. and Fox C. J.. 2003. Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 1 (2003), 2630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Sirsat S. R., Chavan V., and Mahalle H. S.. 2013. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies 4, 2 (2013), 265269.Google ScholarGoogle Scholar
  36. [36] Hadni M., Ouatik S. A., and Lachkar A.. 2013. Effective Arabic stemmer based hybrid approach for Arabic text categorization. International Journal of Data Mining & Knowledge Management Process 3, 4 (2013), 114.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
      March 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3494070
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 November 2021
      • Accepted: 1 July 2021
      • Revised: 1 March 2021
      • Received: 1 January 2021
      Published in tallip Volume 21, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)70
      • Downloads (Last 6 weeks)7

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format