Abstract
In recent decades, hate speech on social media platforms has been on the rise. It is highly desired to control this kind of material because it initiates unrest and harms to the society. Literature describes several forms of the hate speech and it is quite challenging to differentiate between these forms and to design an automated detection system, especially for under-resource languages. In this study, we propose a robust framework for threatening expressions and its target identification in Urdu (Nastaliq style) language. The proposed methodology presents each step in detail like data collection & annotation, cleaning & pre-processing step, and fine-tuning of Robustly Optimized Bidirectional Encoder Representations from Transformer (Urdu-RoBERTa) with grid search technique for hyper-parameters optimization. The study exploits the strength of a pre-trained Urdu-RoBERTa as a transfer learning technique with grid search fine-tuning. The proposed framework is compared with state-of-the art baseline and ten comparable models and it outperformed all for both tasks (threatening expression and target identification). Furthermore, the proposed framework obtained benchmark performance and improved the f1-score with substantial margin.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chhabra, A., Vishwakarma, D.K.: A literature survey on multimodal and multilingual automatic hate speech identification. Multimed. Syst. 1–28 (2023)
Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural language processing. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media (2017)
Delgado, R., Stefancic, J.: Images of the outsider in American law and culture: can free expression remedy systemic social ills. Cornell L. Rev. 77, 1258 (1991)
Fortuna, P., Nunes, S.: A survey on automatic detection of hate speech in text. ACM Comput. Surv. (CSUR) 51(4), 1–30 (2018)
Youtube. YouTube hate policy. https://support.google.com/youtube/answer/2801939?hl=en.2019
Twitter. Twitter_Hate Definition. https://support.twitter.com/articles/.2017
De Gibert, O., et al.: Hate speech dataset from a white supremacy forum. arXiv preprint arXiv:1809.04444 (2018)
Andročec, D.: Machine learning methods for toxic comment classification: a systematic review. Acta Universitatis Sapientiae, Informatica 12(2), 205–216 (2020)
Malmasi, S., Zampieri, M.: Challenges in discriminating profanity from hate speech. J. Exp. Theor. Artif. Intell. 30(2), 187–202 (2018)
Thompson, N.: Social Problems and Social Justice. Bloomsbury Publishing (2017)
Chen, Y., et al.: Detecting offensive language in social media to protect adolescent online safety. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing. IEEE (2012)
Ashraf, N., et al.: Individual vs. group violent threats classification in online discussions. In: Companion Proceedings of the Web Conference 2020 (2020)
Jiang, L., et al.: Intelligent control of building fire protection system using digital twins and semantic web technologies. Autom. Constr. 147, 104728 (2023)
Mazari, A.C., Boudoukhani, N., Djeffal, A.: BERT-based ensemble learning for multi-aspect hate speech detection. Cluster Comput. 1–15 (2023)
Nawaz, A., et al.: Extractive text summarization models for Urdu language. Inf. Process. Manag. 57(6), 102383 (2020)
Amjad, M., et al.: Threatening language detection and target identification in Urdu tweets. IEEE Access 9, 128302–128313 (2021)
Kalraa, S., Agrawala, M., Sharmaa, Y.: Detection of Threat Records by Analyzing the Tweets in Urdu Language Exploring Deep Learning Transformer-Based Models (2021)
Das, M., Banerjee, S., Saha, P.: Abusive and threatening language detection in Urdu using boosting based and BERT based models: a comparative approach. arXiv preprint arXiv:2111.14830 (2021)
Humayoun, M.: Abusive and threatening language detection in Urdu using supervised machine learning and feature combinations. arXiv preprint arXiv:2204.03062 (2022)
Mehmood, A., et al.: Threatening URDU language detection from tweets using machine learning. Appl. Sci. 12(20), 10342 (2022)
Hussain, S., Malik, M.S.I., Masood, N.: Identification of offensive language in Urdu using semantic and embedding models. PeerJ Computer Science 8, e1169 (2022)
Amjad, M., et al.: Automatic abusive language detection in Urdu tweets. Acta Polytechnica Hungarica 1785–8860 (2021)
Saeed, R., et al.: Detection of offensive language and its severity for low resource language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22, 1–27 (2023)
Malik, M.S.I., Cheema, U., Ignatov, D.I.: Contextual embeddings based on fine-tuned Urdu-BERT for Urdu threatening content and target identification. J. King Saud Univ.-Comput. Inf. Sci. 101606 (2023)
Malik, M.S.I., et al.: Multilingual hope speech detection: a robust framework using transfer learning of fine-tuning RoBERTa model. J. King Saud Univ.-Comput. Inf. Sci. 35(8), 101736 (2023)
Rehan, M., Malik, M.S.I., Jamjoom, M.M.: Fine-tuning transformer models using transfer learning for multilingual threatening text identification. IEEE Access (2023)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Younas, M.Z., Malik, M.S.I., Ignatov, D.I.: Automated defect identification for cell phones using language context, linguistic and smoke-word models. Expert Syst. Appl. 227, 120236 (2023)
Malik, M.S.I., Imran, T., Mamdouh, J.M.: How to detect propaganda from social media? Exploitation of semantic and fine-tuned language models. PeerJ Comput. Sci. 9, e1248 (2023)
Acknowledgments
This article is an output of a research project implemented as part of the Basic Research Program at the National Research University Higher School of Economics (HSE University). Moreover, this research was supported in part by computational resources of HPC facilities at HSE University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Malik, M.S.I. (2024). Threatening Expression and Target Identification in Under-Resource Languages Using NLP Techniques. In: Ignatov, D.I., et al. Analysis of Images, Social Networks and Texts. AIST 2023. Lecture Notes in Computer Science, vol 14486. Springer, Cham. https://doi.org/10.1007/978-3-031-54534-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-54534-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54533-7
Online ISBN: 978-3-031-54534-4
eBook Packages: Computer ScienceComputer Science (R0)