Skip to main content
Log in

Application-specific word embeddings for hate and offensive language detection

  • 1183: Multimedia Processing to Tackle the Dark Side of Social Life
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

For the task of hate speech and offensive language detection, this paper explores the potential advantages of using small datasets to develop efficient word embeddings used in models for deep learning. We investigate the impact of feature vectors generated by four selected word embedding techniques (word2vec, wang2vec, fastText, and GloVe) applied to text datasets with size in the order of a billion tokens. After training the classifiers using pre-trained word embeddings, we compare the classification performance with the results from using feature vectors generated from small datasets with size in the order of thousands of tokens. Using numerical examples, we show that the word embeddings with the smallest size yield slightly worse accuracy values but, in combination with smaller training times, such embeddings lead to non-dominated solutions. That fact has an immediate application in significantly reducing training time at a small penalty in classification accuracy. We explore two ways to rank the studied alternatives based on performance factors and on PROMETHEE-II scores. According to both rankings, GloVe is the best method for NILC-embedding, and fastText is the best method for dataset-specific embedding. It is expected that specific word embedding should yield a better fit to a particular dataset, which should yield shorter training and better accuracy. However, the obtained results indicate that NILC-embeddings would lead to an equally good fit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Training time involves CNN training, exclusively, and to stress that, we often refer to it as CNN training time throughout the manuscript. The word-embedding training is assumed to be performed a priori and only once: we either use pre-trained word embeddings, or train our own models. In either case, the training time of the word-embedding model is not taken into account in this study.

  2. The term “word embeddings” refers to the representation of words as vectors containing real numbers. These representations should include some knowledge of positional information among words.

  3. Token are atomic units of data used for text analysis. In general, any string delimited by spaces or punctuation marks is considered as a token.

  4. http://inf.ufrgs.br/~rppelle/hatedetector/, last accessed in October 23rd, 2020.

  5. https://rdm.inesctec.pt/id/dataset/cs-2017-008, last accessed in October 23rd, 2020.

  6. http://nilc.icmc.usp.br/embeddings, last accessed in October 24th, 2020.

  7. https://github.com/wlin12/wang2vec, last accessed in October 24th, 2020.

  8. https://fasttext.cc/docs/en/support.html, last accessed in October 24th, 2020.

  9. https://github.com/stanfordnlp/GloVe, last accessed in October 24th, 2020.

  10. https://github.com/samcaetano/hatespeech_detector, last accessed in July 25th, 2019.

References

  1. Abadi M, et al. (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467

  2. Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, Manicardi S (2016) A comparison between preprocessing techniques for sentiment analysis in Twitter. In: 2nd international workshop on knowledge discovery on the WEB. Cagliari, Italy

  3. Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  4. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146

    Article  Google Scholar 

  5. Brans J-P, Mareschal B (2005) Promethee Methods. Springer, New York, pp 163–186

    MATH  Google Scholar 

  6. Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung Y-H, Strope B, Kurzweil R (2018) Universal sentence encoder. arXiv:1803.11175

  7. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP Copenhagen, Denmark, September 9-11, 2017. Association for Computational Linguistics, p 2017

  8. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  9. de Pelle R, Moreira V (2017) Offensive comments in the brazilian web: a dataset and baseline results, SP, Brazil

  10. Devlin J, Chang M. -W., Lee K, Toutanova K (2019) BERT: Pre-training Of deep bidirectional transformers for language understanding. In: 4171–4186. Association for Computational Linguistics

  11. Dhiman H, Deb D (2020) Multi-criteria decision-making: An overview Decision and Control, vol 253. Springer, Singapore

    Google Scholar 

  12. Ezeibe C (2021) Hate speech and election violence in Nigeria. Journal of Asian and African Studies 56(4):919–935

    Article  Google Scholar 

  13. Fortuna P (2017) Automatic detection of hatespeech in text: An overview of the topic and dataset annotation with hierarchical classes. Master’s thesis. https://hdl.handle.net/10216/106028, Faculty of Engineering, University of Porto. Porto, Portugal

  14. Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv 51(4):1–30

    Article  Google Scholar 

  15. Fortuna P, Rocha da Silva J, Soler-Company J, Wanner L, Nunes S (2019) A hierarchically-labeled Portuguese hate speech dataset. In: Proceedings of the Third Workshop on Abusive Language Online. Association for Computational Linguistics, Italy, pp 94–104

  16. Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. http://www.deeplearningbook.org MIT Press

  17. Hartmann N, Fonseca E, Shulby C, Treviso M, Rodrigues J, Aluísio S. (2017) Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: Anais do XI simposio brasileiro de tecnologia da informação e da linguagem humana, Porto Alegre, RS, Brasiĺ, pp 122–131. SBC

  18. Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Spain, pp 427–431

  19. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Qatar, pp 1746–1751

  20. Leite J, Silva D, Bontcheva K, Scarton C (2020) Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, China, pp 914–924

  21. Lima C, Dal Bianco G (2019) Extração de característica para identificação de discurso de ódio em documentos. In: Anais da XV escola regional de banco de dados, Porto Alegre, RS, Brasil, pp 61–70. SBC

  22. Ling W, Dyer C, Black A, Trancoso I (2015) Two/Too simple adaptations of word2Vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Colorado, pp 1299–1304

  23. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representation in vector space. arXiv:1301.3781v3

  24. Pari C, Nunes G, Gomes J (2019) Avaliação de técnicas de word embedding na tarefa de detecção de discurso de ódio. In: Anais do XVI encontro nacional de inteligência artificial e computacional, porto alegre, RS, Brasil, pp 1020–1031. SBC

  25. Pennington J, Socher R, Manning C (2014) GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Qatar, pp 1532–1543

  26. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, Louisiana, pp 2227–2237

  27. Petrolito R, Dell’Orletta F (2018) Word embeddings in sentiment analysis, Turin, Italy

  28. Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V (2021) Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval 55(2):477–523

    Article  Google Scholar 

  29. Pugliero F (2018) Como ódio viralizou no Brasil Available at: https://www.dw.com/pt-br/como-o-odio-viralizou-no-brasil/a-45097506 Accessed: October 23rd

  30. Rodrigues J, Branco A, Neale S, Silva J (2016) LX-DSEmVectors: Distributional semantics models for Portuguese language. In: 12Th international conference on computational processing of the portuguese, PROPOR. Tomar, Portugal

  31. Roy P, Tripathy A, Das T, Gao X-Z (2020) A framework for hate speech detection using deep convolutional neural network. IEEE Access 8:204951–204962

    Article  Google Scholar 

  32. Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404:132306

    Article  MathSciNet  Google Scholar 

  33. Silva S, Serapião A (2018) Detecção de discurso de ódio em português usando CNN combinada a vetores de palavras. In: Symposium on knowledge discovery, mining and learning, KDMILE. São Paulo, Brazil, p 2018

  34. Spertus E (1997) Smokey: Automatic recognition of hostile messages. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, AAAI’97/IAAI’97. AAAI Press, pp 1058–1065

  35. Thireou T, Reczko M (2007) Bidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Trans Comput Biol Bioinformatics 4(3):441–446

    Article  Google Scholar 

  36. Vargas F, de Góes F, Carvalho I, Benevenuto F, Pardo T (2021) Contextual lexicon-based approach for hate speech and offensive language detection. arXiv:2104.12265

  37. Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv:1903.08983

  38. Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv:1510.03820

Download references

Acknowledgment

This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq – Brazil), research grants 432997/2018-0, 310841/2019-4 and 440074/2020-7, and by Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ– Brazil), research grant 210.364/2018 and 203.111/2018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claver P. Soto.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Soto, C.P., Nunes, G.M.S., Gomes, J.G.R.C. et al. Application-specific word embeddings for hate and offensive language detection. Multimed Tools Appl 81, 27111–27136 (2022). https://doi.org/10.1007/s11042-021-11880-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11880-2

Keywords

Navigation