Skip to main content

Advertisement

Log in

Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Availability of data and materials

The datasets and materials generated during the current study are available from the corresponding author on reasonable request.

Code availability

The custom code developed for this research is available from the corresponding author on reasonable request.

Notes

  1. https://www.demandsage.com/social-media-users/.

  2. https://www.berlitz.com/blog/most-spoken-languages-world.

  3. https://github.com/ikram280/racism/blob/49c56171c93d9e9d96c0a3abb64b0dc5c4ca543d/racism%20keywords/racist%20keywords.xlsx.

  4. https://hatebase.org/.

  5. Chink: ethnic insult usually referring to a person of Chinese descent.

  6. https://www.tweepy.org/.

  7. https://pypi.org/project/facebook-scraper/.

  8. https://www.facebook.com/help/203805466323736?cms_id=203805466323736.

  9. https://developers.facebook.com/docs/development/release/data-handling-questions/questions-preview.

  10. https://developers.facebook.com/docs/graph-api/.

  11. https://www.europarl.europa.eu/RegData/etudes/BRIE/2023/745691/EPRS_BRI(2023)745691_EN.pdf.

  12. https://github.com/ikram280/racism/tree/7a1854435f1cfbc3ad3829719d11142038f5536d/racism_dataset.

  13. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html.

  14. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html.

  15. https://github.com/ikram280/racism/tree/7a1854435f1cfbc3ad3829719d11142038f5536d/abbreviations.

  16. https://docs.python.org/3/library/re.html.

  17. https://spacy.io/.

  18. https://tedboy.github.io/nlps/generated/generated/nltk.stem.ISRIStemmer.html.

  19. https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1.

  20. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html.

References

  • Acheampong FA, Nunoo-Mensah H, Chen W (2021) Transformer models for text-based emotion detection: a review of BERT-based approaches. Artif Intell Rev 54(8):5789–5829

    Article  Google Scholar 

  • Al-Hawari F, Barham H (2021) A machine learning based help desk system for it service management. J King Saud Univ Comput Inf Sci 33(6):702–718

    Google Scholar 

  • Al-Saqqa S, Awajan A (2019) The use of word2vec model in sentiment analysis: A survey. In: Proceedings of the 2019 international conference on artificial intelligence, robotics and control, pp 39–43

  • Al Sharou K, Li Z, Specia L (2021) Towards a better understanding of noise in natural language processing. In: Proceedings of the International conference on recent advances in natural language processing (RANLP 2021), pp 53–62

  • Allahyari M, Pouriyeh S, Assefi M, et al (2017) A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919

  • Alsafari S, Sadaoui S (2021) Semi-supervised self-training of hate and offensive speech from social media. Appl Artif Intell 35(15):1621–1645

    Article  Google Scholar 

  • Alzubaidi L, Zhang J, Humaidi AJ et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:1–74

    Article  Google Scholar 

  • Amini MR, Feofanov V, Pauletto L, et al (2022) Self-training: a survey. arXiv preprint arXiv:2202.12040

  • Barbieri F, Ballesteros M, Saggion H (2017) Are emojis predictable? arXiv preprint arXiv:1702.07285

  • Bashir I, Malik A, Mahmood K (2021) Social media use and information-sharing behaviour of university students. IFLA J 47(4):481–492

    Article  Google Scholar 

  • Benítez-Andrades JA, González-Jiménez Á, López-Brea Á et al (2022) Detecting racism and xenophobia using deep learning models on twitter data: CNN, LSTM and BERT. PeerJ Comput Sci 8:e906

    Article  Google Scholar 

  • Cataldo I, Lepri B, Neoh MJY et al (2021) Social media usage and development of psychiatric disorders in childhood and adolescence: a review. Front Psych 11:508595

    Article  Google Scholar 

  • Chai J, Li A (2019) Deep learning in natural language processing: a state-of-the-art survey. In: 2019 International Conference on Machine Learning and Cybernetics (ICMLC), IEEE, pp 1–6

  • Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv:1810.04805

  • Dhillon A, Verma GK (2020) Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell 9(2):85–112

    Article  Google Scholar 

  • Elias A (2021) The many forms of contemporary racism. Centre for Resilient and Inclusive Societies 6

  • Garg P, Pahuja S (2020) Social media: concept, role, categories, trends, social media and AI, impact on youth, careers, recommendations. In: Managing social media practices in the digital economy. IGI Global, pp 172–192

  • Grosfoguel R (2016) What is racism? J World-Syst Res 22(1):9–15

    Article  Google Scholar 

  • Gupta I, Joshi N (2021) Real-time twitter corpus labelling using automatic clustering approach. Int J Comput Digital Syst 10:519–532

    Article  Google Scholar 

  • Gutiérrez-Fandiño A, Armengol-Estapé J, Pàmies M, et al (2021) Maria: Spanish language models. arXiv preprint arXiv:2107.07253

  • Hayaty M, Muthmainah S, Ghufran SM (2020) Random and synthetic over-sampling approach to resolve data imbalance in classification. Int J Artif Intell Res 4(2):86–94

    Article  Google Scholar 

  • Hegazi MO, Al-Dossari Y, Al-Yahy A et al (2021) Preprocessing Arabic text on social media. Heliyon 7(2):e06191

    Article  Google Scholar 

  • Istaiteh O, Al-Omoush R, Tedmori S (2020) Racist and sexist hate speech detection: literature review. In: 2020 International conference on intelligent data science technologies and applications (IDSTA), IEEE, pp 95–99

  • Jacovi A, Shalom OS, Goldberg Y (2020) Understanding convolutional neural networks for text classification. arxiv:1809.08037

  • Kahn J, Lee A, Hannun A (2020) Self-training for end-to-end speech recognition. In: ICASSP 2020–2020 IEEE international conference on acoustics. IEEE, Speech and Signal Processing (ICASSP), pp 7084–7088

  • Kamal O, Kumar A, Vaidhya T (2021) Hostility detection in hindi leveraging pre-trained language models. In: Combating online hostile posts in regional languages during emergency situation: first international workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, Feb 8, 2021, Revised Selected Papers 1, Springer, pp 213–223

  • Keum BT, Valdovinos IC, Wong MJ (2023) Problematic internet use, online racism, and mental health issues among racially minoritized emerging adults in the United States. Int J Mental Health Addict, pp 1–17

  • Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arxiv:1412.6980

  • Kong X, Liu X, Gu J, et al (2022) Reflash dropout in image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6002–6012

  • Levin I, Mamlok D (2021) Culture and society in the digital age. Information 12(2):68

    Article  Google Scholar 

  • Li Z, Liu F, Yang W et al (2022) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst 33(12):6999–7019. https://doi.org/10.1109/TNNLS.2021.3084827

    Article  MathSciNet  Google Scholar 

  • Liu C, Zhu W, Zhang X et al (2023) Sentence part-enhanced bert with respect to downstream tasks. Complex Intell Syst 9(1):463–474

    Article  Google Scholar 

  • Luan Y, Lin S (2019) Research on text classification based on cnn and lstm. In: 2019 IEEE international conference on artificial intelligence and computer applications (ICAICA), IEEE, pp 352–355

  • MacAvaney S, Yao HR, Yang E et al (2019) Hate speech detection: challenges and solutions. PLoS ONE 14(8):e0221152

    Article  Google Scholar 

  • Madukwe K, Gao X, Xue B (2020) In data we trust: A critical analysis of hate speech detection datasets. In: Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics, Online, pp 150–161, https://doi.org/10.18653/v1/2020.alw-1.18, https://aclanthology.org/2020.alw-1.18

  • Maslej-Krešňáková V, Sarnovskỳ M, Butka P et al (2020) Comparison of deep learning models and various text pre-processing techniques for the toxic comments classification. Appl Sci 10(23):8631

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Mossie Z, Wang JH (2020) Vulnerable community identification using hate speech detection on social media. Inf Process Manag 57(3):102087

    Article  Google Scholar 

  • Ozduzen O, Korkut U, Ozduzen C (2021) Refugees are not welcome: digital racism, online place-making and the evolving categorization of Syrians in Turkey. New Med Soc 23(11):3349–3369

    Article  Google Scholar 

  • Paramesh S, Shreedhara K (2019) It help desk incident classification using classifier ensembles. ICTACT J Soft Comput 9(04):1980–1987

    Google Scholar 

  • Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  • Prechelt L (2002) Early stopping-but when? Neural networks: tricks of the trade. Springer, Cham, pp 55–69

    Google Scholar 

  • Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084

  • Soni S, Chouhan SS, Rathore SS (2023) Textconvonet: a convolutional neural network based architecture for text classification. Appl Intell 53(11):14249–14268

    Article  Google Scholar 

  • Thaiprayoon S, Unger H, Kubek M (2020) Graph and centroid-based word clustering. In: Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, pp 163–168

  • Todorov T, Porter C (2020) Race and racism. Theories of race and racism. Routledge, New York, pp 68–74

    Chapter  Google Scholar 

  • Vale KMO, Gorgônio AC, Flavius Da Luz EG et al (2021) An efficient approach to select instances in self-training and co-training semi-supervised methods. IEEE Access 10:7254–7276

    Article  Google Scholar 

  • Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109(2):373–440

    Article  MathSciNet  Google Scholar 

  • Vanetik N, Mimoun E (2022) Detection of racist language in French tweets. Information 13(7):318

    Article  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

  • Waseem Z, Hovy D (2016) Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop, pp 88–93

  • Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316

    Article  Google Scholar 

  • Yao G, Lei T, Zhong J (2019) A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett 118:14–22

    Article  Google Scholar 

  • Yu T, Zhu H (2020) Hyper-parameter optimization: a review of algorithms and applications. arXiv preprint arXiv:2003.05689

  • Zhu X, Goldberg AB (2022) Introduction to semi-supervised learning. Springer, Cham

    Google Scholar 

  • Zoph B, Ghiasi G, Lin TY, et al (2020) Rethinking pre-training and self-training. arxiv:2006.06882

Download references

Acknowledgements

We acknowledge Bassma Ncir for her contribution to the initial version of the manuscript, including reviewing and editing.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

IE: Conceived and designed the study, conducted experiments, analyzed data, and wrote the manuscript. SH, NSN, JK: Assisted in experimental design, data analysis, and manuscript writing. YN, AH, FZE: Review and Editing.

Corresponding author

Correspondence to Ikram El Miqdadi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Not applicable.

Additional information

Responsible editor: Mark Last.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El Miqdadi, I., Hourri, S., El Idrysy, F.Z. et al. Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN. Data Min Knowl Disc 38, 3805–3830 (2024). https://doi.org/10.1007/s10618-024-01059-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-024-01059-2

Keywords