Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN

El Miqdadi, Ikram; Hourri, Soufiane; El Idrysy, Fatima Zahra; Hayati, Assia; Namir, Yassine; Nikolov, Nikola S.; Kharroubi, Jamal

doi:10.1007/s10618-024-01059-2

Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN

Published: 11 July 2024

Volume 38, pages 3805–3830, (2024)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ikram El Miqdadi ORCID: orcid.org/0009-0006-0162-5784¹,
Soufiane Hourri^1,2,
Fatima Zahra El Idrysy¹,
Assia Hayati¹,
Yassine Namir¹,
Nikola S. Nikolov³ &
…
Jamal Kharroubi¹

289 Accesses
1 Altmetric
Explore all metrics

Abstract

Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BERT Model-Based Approach for Detecting Racism and Xenophobia on Twitter Data

Kicking Prejudice: Large Language Models for Racism Classification in Soccer Discourse on Social Media

COVID-19-FAKES: A Twitter (Arabic/English) Dataset for Detecting Misleading Information on COVID-19

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Availability of data and materials

The datasets and materials generated during the current study are available from the corresponding author on reasonable request.

Code availability

The custom code developed for this research is available from the corresponding author on reasonable request.

Notes

References

Acheampong FA, Nunoo-Mensah H, Chen W (2021) Transformer models for text-based emotion detection: a review of BERT-based approaches. Artif Intell Rev 54(8):5789–5829
Article Google Scholar
Al-Hawari F, Barham H (2021) A machine learning based help desk system for it service management. J King Saud Univ Comput Inf Sci 33(6):702–718
Google Scholar
Al-Saqqa S, Awajan A (2019) The use of word2vec model in sentiment analysis: A survey. In: Proceedings of the 2019 international conference on artificial intelligence, robotics and control, pp 39–43
Al Sharou K, Li Z, Specia L (2021) Towards a better understanding of noise in natural language processing. In: Proceedings of the International conference on recent advances in natural language processing (RANLP 2021), pp 53–62
Allahyari M, Pouriyeh S, Assefi M, et al (2017) A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919
Alsafari S, Sadaoui S (2021) Semi-supervised self-training of hate and offensive speech from social media. Appl Artif Intell 35(15):1621–1645
Article Google Scholar
Alzubaidi L, Zhang J, Humaidi AJ et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:1–74
Article Google Scholar
Amini MR, Feofanov V, Pauletto L, et al (2022) Self-training: a survey. arXiv preprint arXiv:2202.12040
Barbieri F, Ballesteros M, Saggion H (2017) Are emojis predictable? arXiv preprint arXiv:1702.07285
Bashir I, Malik A, Mahmood K (2021) Social media use and information-sharing behaviour of university students. IFLA J 47(4):481–492
Article Google Scholar
Benítez-Andrades JA, González-Jiménez Á, López-Brea Á et al (2022) Detecting racism and xenophobia using deep learning models on twitter data: CNN, LSTM and BERT. PeerJ Comput Sci 8:e906
Article Google Scholar
Cataldo I, Lepri B, Neoh MJY et al (2021) Social media usage and development of psychiatric disorders in childhood and adolescence: a review. Front Psych 11:508595
Article Google Scholar
Chai J, Li A (2019) Deep learning in natural language processing: a state-of-the-art survey. In: 2019 International Conference on Machine Learning and Cybernetics (ICMLC), IEEE, pp 1–6
Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv:1810.04805
Dhillon A, Verma GK (2020) Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell 9(2):85–112
Article Google Scholar
Elias A (2021) The many forms of contemporary racism. Centre for Resilient and Inclusive Societies 6
Garg P, Pahuja S (2020) Social media: concept, role, categories, trends, social media and AI, impact on youth, careers, recommendations. In: Managing social media practices in the digital economy. IGI Global, pp 172–192
Grosfoguel R (2016) What is racism? J World-Syst Res 22(1):9–15
Article Google Scholar
Gupta I, Joshi N (2021) Real-time twitter corpus labelling using automatic clustering approach. Int J Comput Digital Syst 10:519–532
Article Google Scholar
Gutiérrez-Fandiño A, Armengol-Estapé J, Pàmies M, et al (2021) Maria: Spanish language models. arXiv preprint arXiv:2107.07253
Hayaty M, Muthmainah S, Ghufran SM (2020) Random and synthetic over-sampling approach to resolve data imbalance in classification. Int J Artif Intell Res 4(2):86–94
Article Google Scholar
Hegazi MO, Al-Dossari Y, Al-Yahy A et al (2021) Preprocessing Arabic text on social media. Heliyon 7(2):e06191
Article Google Scholar
Istaiteh O, Al-Omoush R, Tedmori S (2020) Racist and sexist hate speech detection: literature review. In: 2020 International conference on intelligent data science technologies and applications (IDSTA), IEEE, pp 95–99
Jacovi A, Shalom OS, Goldberg Y (2020) Understanding convolutional neural networks for text classification. arxiv:1809.08037
Kahn J, Lee A, Hannun A (2020) Self-training for end-to-end speech recognition. In: ICASSP 2020–2020 IEEE international conference on acoustics. IEEE, Speech and Signal Processing (ICASSP), pp 7084–7088
Kamal O, Kumar A, Vaidhya T (2021) Hostility detection in hindi leveraging pre-trained language models. In: Combating online hostile posts in regional languages during emergency situation: first international workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, Feb 8, 2021, Revised Selected Papers 1, Springer, pp 213–223
Keum BT, Valdovinos IC, Wong MJ (2023) Problematic internet use, online racism, and mental health issues among racially minoritized emerging adults in the United States. Int J Mental Health Addict, pp 1–17
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arxiv:1412.6980
Kong X, Liu X, Gu J, et al (2022) Reflash dropout in image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6002–6012
Levin I, Mamlok D (2021) Culture and society in the digital age. Information 12(2):68
Article Google Scholar
Li Z, Liu F, Yang W et al (2022) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst 33(12):6999–7019. https://doi.org/10.1109/TNNLS.2021.3084827
Article MathSciNet Google Scholar
Liu C, Zhu W, Zhang X et al (2023) Sentence part-enhanced bert with respect to downstream tasks. Complex Intell Syst 9(1):463–474
Article Google Scholar
Luan Y, Lin S (2019) Research on text classification based on cnn and lstm. In: 2019 IEEE international conference on artificial intelligence and computer applications (ICAICA), IEEE, pp 352–355
MacAvaney S, Yao HR, Yang E et al (2019) Hate speech detection: challenges and solutions. PLoS ONE 14(8):e0221152
Article Google Scholar
Madukwe K, Gao X, Xue B (2020) In data we trust: A critical analysis of hate speech detection datasets. In: Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics, Online, pp 150–161, https://doi.org/10.18653/v1/2020.alw-1.18, https://aclanthology.org/2020.alw-1.18
Maslej-Krešňáková V, Sarnovskỳ M, Butka P et al (2020) Comparison of deep learning models and various text pre-processing techniques for the toxic comments classification. Appl Sci 10(23):8631
Article Google Scholar
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mossie Z, Wang JH (2020) Vulnerable community identification using hate speech detection on social media. Inf Process Manag 57(3):102087
Article Google Scholar
Ozduzen O, Korkut U, Ozduzen C (2021) Refugees are not welcome: digital racism, online place-making and the evolving categorization of Syrians in Turkey. New Med Soc 23(11):3349–3369
Article Google Scholar
Paramesh S, Shreedhara K (2019) It help desk incident classification using classifier ensembles. ICTACT J Soft Comput 9(04):1980–1987
Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Prechelt L (2002) Early stopping-but when? Neural networks: tricks of the trade. Springer, Cham, pp 55–69
Google Scholar
Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084
Soni S, Chouhan SS, Rathore SS (2023) Textconvonet: a convolutional neural network based architecture for text classification. Appl Intell 53(11):14249–14268
Article Google Scholar
Thaiprayoon S, Unger H, Kubek M (2020) Graph and centroid-based word clustering. In: Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, pp 163–168
Todorov T, Porter C (2020) Race and racism. Theories of race and racism. Routledge, New York, pp 68–74
Chapter Google Scholar
Vale KMO, Gorgônio AC, Flavius Da Luz EG et al (2021) An efficient approach to select instances in self-training and co-training semi-supervised methods. IEEE Access 10:7254–7276
Article Google Scholar
Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109(2):373–440
Article MathSciNet Google Scholar
Vanetik N, Mimoun E (2022) Detection of racist language in French tweets. Information 13(7):318
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Waseem Z, Hovy D (2016) Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop, pp 88–93
Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316
Article Google Scholar
Yao G, Lei T, Zhong J (2019) A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett 118:14–22
Article Google Scholar
Yu T, Zhu H (2020) Hyper-parameter optimization: a review of algorithms and applications. arXiv preprint arXiv:2003.05689
Zhu X, Goldberg AB (2022) Introduction to semi-supervised learning. Springer, Cham
Google Scholar
Zoph B, Ghiasi G, Lin TY, et al (2020) Rethinking pre-training and self-training. arxiv:2006.06882

Download references

Acknowledgements

We acknowledge Bassma Ncir for her contribution to the initial version of the manuscript, including reviewing and editing.

Funding

Not applicable.

Author information

Authors and Affiliations

Laboratory of Intelligent Systems and Applications, University of Sidi Mohamed Ben Abdellah, Route d’Imouzzer, 2202, Fez, Fez-Meknes, Morocco
Ikram El Miqdadi, Soufiane Hourri, Fatima Zahra El Idrysy, Assia Hayati, Yassine Namir & Jamal Kharroubi
Laboratory of Process, Industrial Signals and Computer Science, University of Cadi Ayyad, Route Dar Si Aissa, 46000, Safi, Marrakech-Safi, Morocco
Soufiane Hourri
Department of Computer Science and Information Systems, University of Limerick, Limerick, V94 T9PX, Ireland
Nikola S. Nikolov

Authors

Ikram El Miqdadi
View author publications
You can also search for this author inPubMed Google Scholar
Soufiane Hourri
View author publications
You can also search for this author inPubMed Google Scholar
Fatima Zahra El Idrysy
View author publications
You can also search for this author inPubMed Google Scholar
Assia Hayati
View author publications
You can also search for this author inPubMed Google Scholar
Yassine Namir
View author publications
You can also search for this author inPubMed Google Scholar
Nikola S. Nikolov
View author publications
You can also search for this author inPubMed Google Scholar
Jamal Kharroubi
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

IE: Conceived and designed the study, conducted experiments, analyzed data, and wrote the manuscript. SH, NSN, JK: Assisted in experimental design, data analysis, and manuscript writing. YN, AH, FZE: Review and Editing.

Corresponding author

Correspondence to Ikram El Miqdadi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Not applicable.

Additional information

Responsible editor: Mark Last.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

El Miqdadi, I., Hourri, S., El Idrysy, F.Z. et al. Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN. Data Min Knowl Disc 38, 3805–3830 (2024). https://doi.org/10.1007/s10618-024-01059-2

Download citation

Received: 19 August 2023
Accepted: 30 June 2024
Published: 11 July 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10618-024-01059-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BERT Model-Based Approach for Detecting Racism and Xenophobia on Twitter Data

Kicking Prejudice: Large Language Models for Racism Classification in Soccer Discourse on Social Media

COVID-19-FAKES: A Twitter (Arabic/English) Dataset for Detecting Misleading Information on COVID-19

Explore related subjects

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now