research-article

STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

Authors:
Tayyaba Fatima

COMSATS University Islamabad, Lahore, Pakistan

COMSATS University Islamabad, Lahore, Pakistan
View Profile

,
Raees Ul Islam

COMSATS University Islamabad, Lahore, Pakistan

COMSATS University Islamabad, Lahore, Pakistan
View Profile

,
Muhammad Waqas Anwar

COMSATS University Islamabad, Lahore, Pakistan

COMSATS University Islamabad, Lahore, Pakistan
View Profile

,
M. Hasan Jamal

COMSATS University Islamabad, Lahore, Pakistan

COMSATS University Islamabad, Lahore, Pakistan
View Profile

,
M. Tayyab Chaudhry

COMSATS University Islamabad, Lahore, Pakistan

COMSATS University Islamabad, Lahore, Pakistan
View Profile

,
Zeeshan Gillani

COMSATS University Islamabad, Lahore, Pakistan

COMSATS University Islamabad, Lahore, Pakistan
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21 Issue 2Article No.: 35pp 1–20https://doi.org/10.1145/3476226

Published:09 November 2021Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Stemming is a common word conflation method that perceives stems embedded in the words and decreases them to their stem (root) by conflating all the morphologically related terms into a single term, without doing a complete morphological analysis. This article presents STEMUR, an enhanced stemming algorithm for automatic word conflation for Urdu language. In addition to handling words with prefixes and suffixes, STEMUR also handles words with infixes. Rather than using a totally unsupervised approach, we utilized the linguistic knowledge to develop a collection of patterns for Urdu infixes to enhance the accuracy of the stems and affixes acquired during the training process. Additionally, STEMUR also handles English loan words and can handle words with more than one affix. STEMUR is compared with four existing Urdu stemmers including Assas-Band and the template-based stemmer that are also implemented in this study. Results are processed on two corpora containing 89,437 and 30,907 words separately. Results show clear improvements regarding strength and accuracy of STEMUR. The use of maximum possible infix rules boosted our stemmer's accuracy up to 93.1% and helped us achieve a precision of 98.9%.

REFERENCES

[1] Khan S. A., Anwar W., Bajwa U. I., and Wang X.. 2012. A light weight stemmer for Urdu language: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics. 69–78.Google Scholar
[2] Tudhope E.. 1996. Query Based Stemming. Ph.D. Thesis. University of Waterloo.Google Scholar
[3] Kodimala S.. 2010. Study of Stemming Algorithms. UNLV Theses, Dissertations, Professional Papers, and Capstones. 754.Google Scholar
[4] Singh J., and Gupta V.. 2017. A systematic review of text stemming techniques. Artificial Intelligence Review 48, 2 (2017), 157–217. Google ScholarDigital Library
[5] Blanco R.. 2008. Index Compression for Information Retrieval Systems. Ph.D. Thesis. University of Coruña.Google Scholar
[6] Jabbar A., Iqbal S., Ghani M. U., and Hussain S.. 2016. A survey on Urdu and Urdu like language stemmers and stemming techniques. Artificial Intelligence Review 49, 3 (2016), 339–373. Google ScholarDigital Library
[7] Porter M. F.. 1982. An algorithm for suffix stripping. Program 14, 3 (1982), 130–137.Google ScholarCross Ref
[8] Akram Q. U. A., Naseer A., and Hussain S.. 2009. Assas-Band: An affix-exception-list based Urdu stemmer. In Proceedings of the 7th Workshop on Asian Language Resources. 40–46. Google ScholarDigital Library
[9] Khan S., Anwar W., Bajwa U., and Wang X.. 2015. Template based affix stemmer for a morphologically rich language. International Arab Journal of Information Technology 12, 2 (2015), 146–154.Google Scholar
[10] Ali M., Khalid S., and Aslam M. H.. 2017. Pattern-based comprehensive Urdu stemmer and short text classification. IEEE Access 6 (2017), 7374–7389.Google ScholarCross Ref
[11] Ali M., Khalid S., and Saleemi M.. 2019. Comprehensive stemmer for morphologically rich Urdu language. International Arab Journal of Information Technology 16, 1 (2019), 138–147.Google Scholar
[12] Abdul Jabbar, Iqbal Sajid, Akhunzada Adnan, and Abbas Qaisar. 2018. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach. Journal of Experimental & Theoretical Artificial Intelligence 30, 5 (2018), 703–723.Google Scholar
[13] Lovins Julie Beth. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1–2 (1968), 22–31.Google Scholar
[14] Paice C. D.. 1990. Another stemmer. ACM SIGIR Forum 24, 3 (1990), 56–61. Google ScholarDigital Library
[15] Khoja S. and Garside R.. 1999. Stemming Arabic Text. Computing Department, Lancaster University.Google Scholar
[16] Taghva K., Elkhoury R., and Coombs J.. 2005. Arabic stemming without a root dictionary. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). 152–157. Google ScholarDigital Library
[17] Cherif W., Madani A., and Kissi M.. 2015. New rules-based algorithm to improve Arabic stemming accuracy. International Journal of Knowledge Engineering and Data Mining 3, 3–4 (2015), 315–336. Google ScholarDigital Library
[18] Larkey L. S., Connell M. E., and Abduljaleel N.. 2003. Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing 2, 2 (2003), 275–282. Google ScholarDigital Library
[19] Ramanathan A. and Rao D.. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages Workshop.Google Scholar
[20] Husain M. S.. 2012. An unsupervised approach to develop stemmer. International Journal on Natural Language Computing 1, 2 (2012), 15–23.Google ScholarCross Ref
[21] Ali M., Khalid S., Saleemi M. H., Iqbal W., Ali A., and Naqvi G.. 2016. A rule based stemming method for multilingual Urdu text. International Journal of Computer Applications 134, 8 (2016), 10–18.Google ScholarCross Ref
[22] Jabbar A., Iqbal S., and Khan M. U. G.. 2016. Analysis and development of resources for Urdu text stemming. In Proceedings of the 6th Annual International Conference on Language and Technology (KICS-CLE’16).Google Scholar
[23] Schmidt R. L.. 1999. Urdu: An Essential Grammar. Psychology Press.Google Scholar
[24] Khan S. A., Anwar W., and Bajwa U.. 2011. Challenges in developing a rule based Urdu stemmer. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing. 46–51.Google Scholar
[25] Daud A., Khan W., and Che D.. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279–311. Google ScholarDigital Library
[26] Khan R. H.. 2000. عبارت کیسے لکھیں. Maktaba Piam-e-Taaleem, Jamia Nagar, New Delhi, IndiaGoogle Scholar
[27] Khan I. H.. 1985. لسانیات کے بنیادی اصول. Fakhar-ud-Din Ali Ahmad Memorial Committee.Google Scholar
[28] Haq M. A.. 1991. Qawaed-e-Urdu. Anjuman Taraqi-e-Urdu, New Delhi, India.Google Scholar
[29] Insha I. A. K.. 1988. Darya-e-Latafat. Anjuman Taraqi-e-Urdu, New Delhi, India.Google Scholar
[30] Sohail A.. 2018. (عربی کے بنیادی قواعد) لسان القرآن. Maktaba Al-Quran Academy, Faisalabad, Pakistan.Google Scholar
[31] Saqib A. R.. 1996. (عربی گرامر) تیسیر القرآن”. Fahm-ul-Quran Institute, Lahore, Pakistan.Google Scholar
[32] Rizvi S. M. J.. 2007. Development of Algorithms and Computational Grammar for Urdu. Ph.D. Thesis. Pakistan Institute of Engineering and Applied Science, Islamabad, Pakistan.Google Scholar
[33] Naim C. M.. 1999. Introductory Urdu (3rd ed.). Volume 1. South Asia Language & Area Center University of Chicago, Chicago, IL.Google Scholar
[34] Frakes W. B. and Fox C. J.. 2003. Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 1 (2003), 26–30. Google ScholarDigital Library
[35] Sirsat S. R., Chavan V., and Mahalle H. S.. 2013. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies 4, 2 (2013), 265–269.Google Scholar
[36] Hadni M., Ouatik S. A., and Lachkar A.. 2013. Effective Arabic stemmer based hybrid approach for Arabic text categorization. International Journal of Data Mining & Knowledge Management Process 3, 4 (2013), 1–14.Google ScholarCross Ref

Index Terms

STEMUR: An Automated Word Conflation Algorithm for the Urdu Language
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Read More
Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets
Text pre-processing is a crucial step in Natural Language Processing (NLP) applications, particularly for handling informal and noisy content on social media. Word-level tokenization plays a vital role in text pre-processing by removing stop words, ...
Read More
A word sense disambiguation corpus for Urdu
Abstract
The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 2
March 2022
413 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3494070
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2021
- Accepted: 1 July 2021
- Revised: 1 March 2021
- Received: 1 January 2021
Published in tallip Volume 21, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Conflation
index processing
prefix
suffix and infix stemming
Urdu language processing
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 239
  Total Downloads
- Downloads (Last 12 months)70
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets

A word sense disambiguation corpus for Urdu

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets

A word sense disambiguation corpus for Urdu

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media