Abstract
Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.


Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Availability of data and materials
The datasets and materials generated during the current study are available from the corresponding author on reasonable request.
Code availability
The custom code developed for this research is available from the corresponding author on reasonable request.
Notes
Chink: ethnic insult usually referring to a person of Chinese descent.
References
Acheampong FA, Nunoo-Mensah H, Chen W (2021) Transformer models for text-based emotion detection: a review of BERT-based approaches. Artif Intell Rev 54(8):5789–5829
Al-Hawari F, Barham H (2021) A machine learning based help desk system for it service management. J King Saud Univ Comput Inf Sci 33(6):702–718
Al-Saqqa S, Awajan A (2019) The use of word2vec model in sentiment analysis: A survey. In: Proceedings of the 2019 international conference on artificial intelligence, robotics and control, pp 39–43
Al Sharou K, Li Z, Specia L (2021) Towards a better understanding of noise in natural language processing. In: Proceedings of the International conference on recent advances in natural language processing (RANLP 2021), pp 53–62
Allahyari M, Pouriyeh S, Assefi M, et al (2017) A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919
Alsafari S, Sadaoui S (2021) Semi-supervised self-training of hate and offensive speech from social media. Appl Artif Intell 35(15):1621–1645
Alzubaidi L, Zhang J, Humaidi AJ et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:1–74
Amini MR, Feofanov V, Pauletto L, et al (2022) Self-training: a survey. arXiv preprint arXiv:2202.12040
Barbieri F, Ballesteros M, Saggion H (2017) Are emojis predictable? arXiv preprint arXiv:1702.07285
Bashir I, Malik A, Mahmood K (2021) Social media use and information-sharing behaviour of university students. IFLA J 47(4):481–492
Benítez-Andrades JA, González-Jiménez Á, López-Brea Á et al (2022) Detecting racism and xenophobia using deep learning models on twitter data: CNN, LSTM and BERT. PeerJ Comput Sci 8:e906
Cataldo I, Lepri B, Neoh MJY et al (2021) Social media usage and development of psychiatric disorders in childhood and adolescence: a review. Front Psych 11:508595
Chai J, Li A (2019) Deep learning in natural language processing: a state-of-the-art survey. In: 2019 International Conference on Machine Learning and Cybernetics (ICMLC), IEEE, pp 1–6
Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv:1810.04805
Dhillon A, Verma GK (2020) Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell 9(2):85–112
Elias A (2021) The many forms of contemporary racism. Centre for Resilient and Inclusive Societies 6
Garg P, Pahuja S (2020) Social media: concept, role, categories, trends, social media and AI, impact on youth, careers, recommendations. In: Managing social media practices in the digital economy. IGI Global, pp 172–192
Grosfoguel R (2016) What is racism? J World-Syst Res 22(1):9–15
Gupta I, Joshi N (2021) Real-time twitter corpus labelling using automatic clustering approach. Int J Comput Digital Syst 10:519–532
Gutiérrez-Fandiño A, Armengol-Estapé J, Pàmies M, et al (2021) Maria: Spanish language models. arXiv preprint arXiv:2107.07253
Hayaty M, Muthmainah S, Ghufran SM (2020) Random and synthetic over-sampling approach to resolve data imbalance in classification. Int J Artif Intell Res 4(2):86–94
Hegazi MO, Al-Dossari Y, Al-Yahy A et al (2021) Preprocessing Arabic text on social media. Heliyon 7(2):e06191
Istaiteh O, Al-Omoush R, Tedmori S (2020) Racist and sexist hate speech detection: literature review. In: 2020 International conference on intelligent data science technologies and applications (IDSTA), IEEE, pp 95–99
Jacovi A, Shalom OS, Goldberg Y (2020) Understanding convolutional neural networks for text classification. arxiv:1809.08037
Kahn J, Lee A, Hannun A (2020) Self-training for end-to-end speech recognition. In: ICASSP 2020–2020 IEEE international conference on acoustics. IEEE, Speech and Signal Processing (ICASSP), pp 7084–7088
Kamal O, Kumar A, Vaidhya T (2021) Hostility detection in hindi leveraging pre-trained language models. In: Combating online hostile posts in regional languages during emergency situation: first international workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, Feb 8, 2021, Revised Selected Papers 1, Springer, pp 213–223
Keum BT, Valdovinos IC, Wong MJ (2023) Problematic internet use, online racism, and mental health issues among racially minoritized emerging adults in the United States. Int J Mental Health Addict, pp 1–17
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arxiv:1412.6980
Kong X, Liu X, Gu J, et al (2022) Reflash dropout in image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6002–6012
Levin I, Mamlok D (2021) Culture and society in the digital age. Information 12(2):68
Li Z, Liu F, Yang W et al (2022) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst 33(12):6999–7019. https://doi.org/10.1109/TNNLS.2021.3084827
Liu C, Zhu W, Zhang X et al (2023) Sentence part-enhanced bert with respect to downstream tasks. Complex Intell Syst 9(1):463–474
Luan Y, Lin S (2019) Research on text classification based on cnn and lstm. In: 2019 IEEE international conference on artificial intelligence and computer applications (ICAICA), IEEE, pp 352–355
MacAvaney S, Yao HR, Yang E et al (2019) Hate speech detection: challenges and solutions. PLoS ONE 14(8):e0221152
Madukwe K, Gao X, Xue B (2020) In data we trust: A critical analysis of hate speech detection datasets. In: Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics, Online, pp 150–161, https://doi.org/10.18653/v1/2020.alw-1.18, https://aclanthology.org/2020.alw-1.18
Maslej-Krešňáková V, Sarnovskỳ M, Butka P et al (2020) Comparison of deep learning models and various text pre-processing techniques for the toxic comments classification. Appl Sci 10(23):8631
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mossie Z, Wang JH (2020) Vulnerable community identification using hate speech detection on social media. Inf Process Manag 57(3):102087
Ozduzen O, Korkut U, Ozduzen C (2021) Refugees are not welcome: digital racism, online place-making and the evolving categorization of Syrians in Turkey. New Med Soc 23(11):3349–3369
Paramesh S, Shreedhara K (2019) It help desk incident classification using classifier ensembles. ICTACT J Soft Comput 9(04):1980–1987
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Prechelt L (2002) Early stopping-but when? Neural networks: tricks of the trade. Springer, Cham, pp 55–69
Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084
Soni S, Chouhan SS, Rathore SS (2023) Textconvonet: a convolutional neural network based architecture for text classification. Appl Intell 53(11):14249–14268
Thaiprayoon S, Unger H, Kubek M (2020) Graph and centroid-based word clustering. In: Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, pp 163–168
Todorov T, Porter C (2020) Race and racism. Theories of race and racism. Routledge, New York, pp 68–74
Vale KMO, Gorgônio AC, Flavius Da Luz EG et al (2021) An efficient approach to select instances in self-training and co-training semi-supervised methods. IEEE Access 10:7254–7276
Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109(2):373–440
Vanetik N, Mimoun E (2022) Detection of racist language in French tweets. Information 13(7):318
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Waseem Z, Hovy D (2016) Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop, pp 88–93
Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316
Yao G, Lei T, Zhong J (2019) A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett 118:14–22
Yu T, Zhu H (2020) Hyper-parameter optimization: a review of algorithms and applications. arXiv preprint arXiv:2003.05689
Zhu X, Goldberg AB (2022) Introduction to semi-supervised learning. Springer, Cham
Zoph B, Ghiasi G, Lin TY, et al (2020) Rethinking pre-training and self-training. arxiv:2006.06882
Acknowledgements
We acknowledge Bassma Ncir for her contribution to the initial version of the manuscript, including reviewing and editing.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
IE: Conceived and designed the study, conducted experiments, analyzed data, and wrote the manuscript. SH, NSN, JK: Assisted in experimental design, data analysis, and manuscript writing. YN, AH, FZE: Review and Editing.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Not applicable.
Additional information
Responsible editor: Mark Last.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
El Miqdadi, I., Hourri, S., El Idrysy, F.Z. et al. Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN. Data Min Knowl Disc 38, 3805–3830 (2024). https://doi.org/10.1007/s10618-024-01059-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-024-01059-2