research-article

Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets

Authors:

Zoya,

Seemab Latif,

Rabia Latif,

Hammad Majeed,

Nor Shahida Mohd JamailAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 10

Article No.: 234, Pages 1 - 31

https://doi.org/10.1145/3622939

Published: 13 October 2023 Publication History

Get Access

Abstract

Text pre-processing is a crucial step in Natural Language Processing (NLP) applications, particularly for handling informal and noisy content on social media. Word-level tokenization plays a vital role in text pre-processing by removing stop words, filtering irrelevant characters, and retaining relevant tokens. These tokens are essential for constructing meaningful n-grams within advanced NLP frameworks used for data modeling. However, tokenization in low-resource languages like Urdu presents challenges due to language complexity and limited resources. Conventional space-based methods and direct application of language-specific tools often result in erroneous tokens in Urdu Language Processing (ULP). This hinders language models from effectively learning language-specific and domain-specific tokens, leading to sub-optimal results for downstream tasks such as aspect mining, topic modeling, and Named Entity Recognition (NER). To address this issue for Urdu, we have proposed a data pre-processing technique that detects outliers using the Inter-Quartile-Range (IQR) method and proposed normalization algorithms for creating useful lexicons in conjunction with existing technologies. We have collected approximately 50 million Urdu tweets using the Twitter API and conducted the performance analysis of existing language-specific tokenizers (Urduhack and Space-based tokenizer). Dataset variants were created based on the language-specific tokenizers, and we performed statistical analysis tests and visualization techniques to compare tokenization results before and after applying the proposed outlier detection and normalization method. Our findings highlighted the noticeable improvement in token size distributions, handling of informal language tokens, and misspelled and lengthy tokens. The Urduhack tokenizer combined with the proposed outlier detection and normalization yielded tokens with the best-fitted distribution in ULP. Its effectiveness has been evaluated through the task of topic modeling using Non-negative Matrix Factorization (NMF) and Latent Dirichlet allocation (LDA). The results demonstrated new and distinct topics using unigram features while achieving highly coherent topics when utilizing bigram features. For the traditional space-based method, the results consistently demonstrated improved coherence and precision scores. However, the NMF topic modeling with bigram features outperformed LDA topic modeling with bigram features.

References

[1]

Syed Zain Abbas, Dr Rahman, Abdul Basit Mughal, Syed Mujtaba Haider, et al. 2022. Urdu news article recommendation model using natural language processing techniques. arxiv:2206.11862. Retrieved from https://arxiv.org/abs/2206.11862

Abstract

References

Cited By

Index Terms

Recommendations

Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding

STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

Urdu language processing: a survey

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations