skip to main content
research-article

Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets

Published: 13 October 2023 Publication History

Abstract

Text pre-processing is a crucial step in Natural Language Processing (NLP) applications, particularly for handling informal and noisy content on social media. Word-level tokenization plays a vital role in text pre-processing by removing stop words, filtering irrelevant characters, and retaining relevant tokens. These tokens are essential for constructing meaningful n-grams within advanced NLP frameworks used for data modeling. However, tokenization in low-resource languages like Urdu presents challenges due to language complexity and limited resources. Conventional space-based methods and direct application of language-specific tools often result in erroneous tokens in Urdu Language Processing (ULP). This hinders language models from effectively learning language-specific and domain-specific tokens, leading to sub-optimal results for downstream tasks such as aspect mining, topic modeling, and Named Entity Recognition (NER). To address this issue for Urdu, we have proposed a data pre-processing technique that detects outliers using the Inter-Quartile-Range (IQR) method and proposed normalization algorithms for creating useful lexicons in conjunction with existing technologies. We have collected approximately 50 million Urdu tweets using the Twitter API and conducted the performance analysis of existing language-specific tokenizers (Urduhack and Space-based tokenizer). Dataset variants were created based on the language-specific tokenizers, and we performed statistical analysis tests and visualization techniques to compare tokenization results before and after applying the proposed outlier detection and normalization method. Our findings highlighted the noticeable improvement in token size distributions, handling of informal language tokens, and misspelled and lengthy tokens. The Urduhack tokenizer combined with the proposed outlier detection and normalization yielded tokens with the best-fitted distribution in ULP. Its effectiveness has been evaluated through the task of topic modeling using Non-negative Matrix Factorization (NMF) and Latent Dirichlet allocation (LDA). The results demonstrated new and distinct topics using unigram features while achieving highly coherent topics when utilizing bigram features. For the traditional space-based method, the results consistently demonstrated improved coherence and precision scores. However, the NMF topic modeling with bigram features outperformed LDA topic modeling with bigram features.

References

[1]
Syed Zain Abbas, Dr Rahman, Abdul Basit Mughal, Syed Mujtaba Haider, et al. 2022. Urdu news article recommendation model using natural language processing techniques. arxiv:2206.11862. Retrieved from https://arxiv.org/abs/2206.11862
[2]
Muhammad Pervez Akhter, Zheng Jiangbin, Irfan Raza Naqvi, Mohammed Abdelmajeed, and Muhammad Fayyaz. 2022. Exploring deep learning approaches for Urdu text classification in product manufacturing. Enterpr. Inf. Syst. 16, 2 (2022), 223–248. DOI:
[3]
Ikram ALi. 2020. Urduhack: A Python Library for Urdu Language Processing. Retrieved from https://docs.urduhack.com/en/stable/#urduhack.
[4]
Maaz Amjad, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, and Alexander Gelbukh. 2021. Threatening language detection and target identification in Urdu tweets. IEEE Access 9 (2021), 128302–128313. DOI:
[5]
Sophia Ananiadou (Ed.). 2007. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics.
[6]
Theodore W. Anderson and Donald A. Darling. 1952. Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat. 23, 2 (1952), 193–212. DOI:
[7]
Muhammad Nabeel Asim, Muhammad Usman Ghani, Muhammad Ali Ibrahim, Waqar Mahmood, Andreas Dengel, and Sheraz Ahmed. 2021. Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification. Neural Comput. Appl. 33, 11 (2021), 5437–5469. DOI:
[8]
Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Conference of the German Society for Computational Linguistics and Language Technology, Universität Potsdam, Potsdam, Tubingen, 31–40. https://api.semanticscholar.org/CorpusID:2762657
[9]
Ralph B. D’agostino, Albert Belanger, and Ralph B. D’Agostino Jr. 1990. A suggestion for using powerful and informative tests of normality. Am. Stat. 44, 4 (1990), 316–321. DOI:
[10]
Momna Dar, Faiza Iqbal, Rabia Latif, Ayesha Altaf, and Nor Shahida Mohd Jamail. 2023. Policy-based spam detection of Tweets dataset. Electronics 12, 12 (2023). DOI:
[11]
Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: A survey. Artif. Intell. Rev. 47, 3 (2017), 279–311. DOI:
[12]
Ozan Dogan. 2021. Fitter: Fitter - Fits Distribution to Data in. One Line of Code. https://github.com/fittercommunity/fitter
[13]
Jack Dorsey. 2022. Twitter by the Numbers: Stats, Demographics & Fun Facts. Retrieved from https://www.omnicoreagency.com/twitter-statistics/#::text=Twitter
[14]
Milton Friedman. 1940. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11, 1 (1940), 86–92. DOI:
[15]
Syed Ali Hamza, Bilal Tahir, and Muhammad Amir Mehmood. 2019. Domain identification of urdu news text. In Proceedings of the 22nd International Multitopic Conference (INMIC’19). IEEE, 1–7. DOI:
[16]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. Association for Computing Machinery, New York, NY. 285 pages. DOI:
[17]
Sakshi Kalra, Yash Bansal, and Yashvardhan Sharma. 2021. Detection of abusive records by analyzing the tweets in Urdu language exploring transformer based models. In Working Notes of FIRE 2021—Forum for Information Retrieval Evaluation(CEUR Workshop Proceedings, Vol. 3159), Parth Mehta, Thomas Mandl, Prasenjit Majumder, and Mandar Mitra (Eds.). CEUR-WS.org, 799–805.
[18]
Amjad Khan. 2023. Improved multi-lingual sentiment analysis and recognition using deep learning. J. Inf. Sci. (2023). DOI:
[19]
Lal Khan, Ammar Amjad, Noman Ashraf, Hsien-Tsung Chang, and Alexander Gelbukh. 2021. Urdu sentiment analysis with deep learning methods. IEEE Access 9 (2021), 97803–97812. DOI:
[20]
Asad Khattak, Muhammad Zubair Asghar, Anam Saeed, Ibrahim A. Hameed, Syed Asif Hassan, and Shakeel Ahmad. 2021. A survey on sentiment analysis in Urdu: A resource-poor language. Egypt. Inf. J. 22, 1 (2021), 53–74. DOI:
[21]
Abdullah Faiz Ur Rahman Khiljia, Sahinur Rahman Laskara, Partha Pakraya, and Sivaji Bandyopadhyaya. 2020. Urdu fake news detection using generalized autoregressors. In Working Notes of FIRE 2020—Forum for Information Retrieval Evaluation (FIRE-WN’20)(CEUR Workshop Proceedings, Vol. 2826), Parth Mehta, Thomas Mandl, Prasenjit Majumder, and Mandar Mitra (Eds.). CEUR-WS.org, 452–457.
[22]
William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47, 260 (1952), 583–621. DOI:
[23]
Swarn Avinash Kumar, Moustafa M. Nasralla, Iván García-magariño, and Harsh Kumar. 2021. A machine-learning scraping tool for data fusion in the analysis of sentiments about pandemics for supporting business decisions with human-centric AI explanations. Peerj Comput. Sci. (2021). DOI:
[24]
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics (1977), 159–174. DOI:
[25]
Henry B. Mann and Donald R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 1 (1947), 50–60. DOI:
[26]
Souad Larabi marie sainte. 2021. Outlier detection based feature selection exploiting bio-inspired optimization algorithms. Appl. Sci. 11, 15 (2021). DOI:
[27]
George Marsaglia, Wai Wan Tsang, and Jingbo Wang. 2003. Evaluating Kolmogorov’s distribution. J. Stat. Softw. 8 (2003), 1–4. DOI:
[28]
Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 339–348. https://aclanthology.org/P12-1036
[29]
Neelam Mukhtar, Mohammad Abid Khan, and Nadia Chiragh. 2017. Effective use of evaluation measures for the validation of best classifier in Urdu sentiment analysis. Cogn. Comput. 9, 4 (2017), 446–456. DOI:
[30]
Uzma Naqvi, Abdul Majid, and Syed Ali Abbas. 2021. UTSA: Urdu text sentiment analysis using deep learning methods. IEEE Access 9 (2021), 114085–114094. DOI:
[31]
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 101–108. DOI:
[32]
Mehwish Rani, Seemab Latif, Muhaammad Ali Tahir, and Rafia Mumtaz. 2021. A survey of sentiment analysis of internet textual data and application to pakistani YouTube user comments. In Proceedings of the International Conference on Digital Futures and Transformative Technologies (ICoDT2’21). 1–6. DOI:
[33]
Sadaf Rani and Muhammad Waqas Anwar. 2020. Resource creation and evaluation of aspect based sentiment analysis in Urdu. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop. Association for Computational Linguistics, 79–84.
[34]
Imran Rasheed, Vivek Gupta, Haider Banka, and Chiranjeev Kumar. 2018. Urdu text classification: A comparative study using machine learning techniques. In Proceedings of the 13th International Conference on Digital Information Management (ICDIM’18). IEEE, 274–278. DOI:
[35]
Rabiya Rashid and Seemab Latif. 2012. A dictionary based Urdu word segmentation using maximum matching algorithm for space omission problem. In Proceedings of the International Conference on Asian Language Processing. 101–104. DOI:
[36]
Ghulam Musa Raza, Zainab Saeed Butt, Seemab Latif, and Abdul Wahid. 2021. Sentiment analysis on COVID tweets: An experimental analysis on the impact of count vectorizer and TF-IDF on sentiment predictions using deep learning models. In Proceedings of the International Conference on Digital Futures and Transformative Technologies (ICoDT2’21). 1–6. DOI:
[37]
Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 399–408. DOI:
[38]
Lqra Sana, Khushboo Nasir, Amara Urooj, Zain Ishaq, and Ibrahim A. Hameed. 2018. Bers: Bussiness-related emotion recognition system in Urdu language using machine learning. In Proceedings of the 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC’18). IEEE, 238–242. DOI:
[39]
Samuel Sanford Shapiro and Martin B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 3/4 (1965), 591–611. DOI:
[40]
Ehsan ul Haq, Sahar Rauf, Sarmad Hussain, and Kashif Javed. 2010. Corpus of aspect-based sentiment for Urdu political data. In Mexican International Conference on Artificial Intelligence. Springer, 37–40.
[41]
Yuli Vasiliev. 2020. Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press, San Francisco, CA.
[42]
Stanisław Węglarczyk. 2018. Kernel density estimation and its application. In ITM Web of Conferences, Vol. 23. EDP Sciences, Les Ulis, France, 00037. DOI:
[43]
Nianwen Xue. 2003. Chinese word segmentation as character tagging. Int. J. Comput. Ling. Chin. Lang. Process. 8 (Feb. 2003), 29–48.
[44]
Zoya, Seemab Latif, Faisal Shafait, and Rabia Latif. 2021. Analyzing LDA and NMF topic models for Urdu tweets via automatic labeling. IEEE Access 9 (2021), 127531–127547. DOI:

Cited By

View all

Index Terms

  1. Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 10
    October 2023
    226 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3627976
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2023
    Online AM: 08 September 2023
    Accepted: 24 August 2023
    Revised: 20 August 2023
    Received: 23 May 2023
    Published in TALLIP Volume 22, Issue 10

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Urdu language processing
    2. Urdu tweets
    3. Urdu text tokenization
    4. Urdu language processing tools
    5. outlier detection and removal

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 273
      Total Downloads
    • Downloads (Last 12 months)109
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media