Application-specific word embeddings for hate and offensive language detection

Soto, Claver P.; Nunes, Gustavo M. S.; Gomes, José Gabriel R. C.; Nedjah, Nadia

doi:10.1007/s11042-021-11880-2

Application-specific word embeddings for hate and offensive language detection

1183: Multimedia Processing to Tackle the Dark Side of Social Life
Published: 25 January 2022

Volume 81, pages 27111–27136, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

512 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

For the task of hate speech and offensive language detection, this paper explores the potential advantages of using small datasets to develop efficient word embeddings used in models for deep learning. We investigate the impact of feature vectors generated by four selected word embedding techniques (word2vec, wang2vec, fastText, and GloVe) applied to text datasets with size in the order of a billion tokens. After training the classifiers using pre-trained word embeddings, we compare the classification performance with the results from using feature vectors generated from small datasets with size in the order of thousands of tokens. Using numerical examples, we show that the word embeddings with the smallest size yield slightly worse accuracy values but, in combination with smaller training times, such embeddings lead to non-dominated solutions. That fact has an immediate application in significantly reducing training time at a small penalty in classification accuracy. We explore two ways to rank the studied alternatives based on performance factors and on PROMETHEE-II scores. According to both rankings, GloVe is the best method for NILC-embedding, and fastText is the best method for dataset-specific embedding. It is expected that specific word embedding should yield a better fit to a particular dataset, which should yield shorter training and better accuracy. However, the obtained results indicate that NILC-embeddings would lead to an equally good fit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fake news, disinformation and misinformation in social media: a review

Article 09 February 2023

Detection and moderation of detrimental content on social media platforms: current status and future directions

Article 05 September 2022

FakeBERT: Fake news detection in social media with a BERT-based deep learning approach

Article 07 January 2021

Notes

Training time involves CNN training, exclusively, and to stress that, we often refer to it as CNN training time throughout the manuscript. The word-embedding training is assumed to be performed a priori and only once: we either use pre-trained word embeddings, or train our own models. In either case, the training time of the word-embedding model is not taken into account in this study.
The term “word embeddings” refers to the representation of words as vectors containing real numbers. These representations should include some knowledge of positional information among words.
Token are atomic units of data used for text analysis. In general, any string delimited by spaces or punctuation marks is considered as a token.
http://inf.ufrgs.br/~rppelle/hatedetector/, last accessed in October 23rd, 2020.
https://rdm.inesctec.pt/id/dataset/cs-2017-008, last accessed in October 23rd, 2020.
http://nilc.icmc.usp.br/embeddings, last accessed in October 24th, 2020.
https://github.com/wlin12/wang2vec, last accessed in October 24th, 2020.
https://fasttext.cc/docs/en/support.html, last accessed in October 24th, 2020.
https://github.com/stanfordnlp/GloVe, last accessed in October 24th, 2020.
https://github.com/samcaetano/hatespeech_detector, last accessed in July 25th, 2019.

References

Abadi M, et al. (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, Manicardi S (2016) A comparison between preprocessing techniques for sentiment analysis in Twitter. In: 2nd international workshop on knowledge discovery on the WEB. Cagliari, Italy
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146
Article Google Scholar
Brans J-P, Mareschal B (2005) Promethee Methods. Springer, New York, pp 163–186
MATH Google Scholar
Cer D, Yang Y, Kong S-Y, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung Y-H, Strope B, Kurzweil R (2018) Universal sentence encoder. arXiv:1803.11175
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP Copenhagen, Denmark, September 9-11, 2017. Association for Computational Linguistics, p 2017
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
MATH Google Scholar
de Pelle R, Moreira V (2017) Offensive comments in the brazilian web: a dataset and baseline results, SP, Brazil
Devlin J, Chang M. -W., Lee K, Toutanova K (2019) BERT: Pre-training Of deep bidirectional transformers for language understanding. In: 4171–4186. Association for Computational Linguistics
Dhiman H, Deb D (2020) Multi-criteria decision-making: An overview Decision and Control, vol 253. Springer, Singapore
Google Scholar
Ezeibe C (2021) Hate speech and election violence in Nigeria. Journal of Asian and African Studies 56(4):919–935
Article Google Scholar
Fortuna P (2017) Automatic detection of hatespeech in text: An overview of the topic and dataset annotation with hierarchical classes. Master’s thesis. https://hdl.handle.net/10216/106028, Faculty of Engineering, University of Porto. Porto, Portugal
Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv 51(4):1–30
Article Google Scholar
Fortuna P, Rocha da Silva J, Soler-Company J, Wanner L, Nunes S (2019) A hierarchically-labeled Portuguese hate speech dataset. In: Proceedings of the Third Workshop on Abusive Language Online. Association for Computational Linguistics, Italy, pp 94–104
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. http://www.deeplearningbook.org MIT Press
Hartmann N, Fonseca E, Shulby C, Treviso M, Rodrigues J, Aluísio S. (2017) Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: Anais do XI simposio brasileiro de tecnologia da informação e da linguagem humana, Porto Alegre, RS, Brasiĺ, pp 122–131. SBC
Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Spain, pp 427–431
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Qatar, pp 1746–1751
Leite J, Silva D, Bontcheva K, Scarton C (2020) Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, China, pp 914–924
Lima C, Dal Bianco G (2019) Extração de característica para identificação de discurso de ódio em documentos. In: Anais da XV escola regional de banco de dados, Porto Alegre, RS, Brasil, pp 61–70. SBC
Ling W, Dyer C, Black A, Trancoso I (2015) Two/Too simple adaptations of word2Vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Colorado, pp 1299–1304
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representation in vector space. arXiv:1301.3781v3
Pari C, Nunes G, Gomes J (2019) Avaliação de técnicas de word embedding na tarefa de detecção de discurso de ódio. In: Anais do XVI encontro nacional de inteligência artificial e computacional, porto alegre, RS, Brasil, pp 1020–1031. SBC
Pennington J, Socher R, Manning C (2014) GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Qatar, pp 1532–1543
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, Louisiana, pp 2227–2237
Petrolito R, Dell’Orletta F (2018) Word embeddings in sentiment analysis, Turin, Italy
Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V (2021) Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval 55(2):477–523
Article Google Scholar
Pugliero F (2018) Como ódio viralizou no Brasil Available at: https://www.dw.com/pt-br/como-o-odio-viralizou-no-brasil/a-45097506 Accessed: October 23rd
Rodrigues J, Branco A, Neale S, Silva J (2016) LX-DSEmVectors: Distributional semantics models for Portuguese language. In: 12Th international conference on computational processing of the portuguese, PROPOR. Tomar, Portugal
Roy P, Tripathy A, Das T, Gao X-Z (2020) A framework for hate speech detection using deep convolutional neural network. IEEE Access 8:204951–204962
Article Google Scholar
Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404:132306
Article MathSciNet Google Scholar
Silva S, Serapião A (2018) Detecção de discurso de ódio em português usando CNN combinada a vetores de palavras. In: Symposium on knowledge discovery, mining and learning, KDMILE. São Paulo, Brazil, p 2018
Spertus E (1997) Smokey: Automatic recognition of hostile messages. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, AAAI’97/IAAI’97. AAAI Press, pp 1058–1065
Thireou T, Reczko M (2007) Bidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Trans Comput Biol Bioinformatics 4(3):441–446
Article Google Scholar
Vargas F, de Góes F, Carvalho I, Benevenuto F, Pardo T (2021) Contextual lexicon-based approach for hate speech and offensive language detection. arXiv:2104.12265
Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv:1903.08983
Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv:1510.03820

Download references

Acknowledgment

This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq – Brazil), research grants 432997/2018-0, 310841/2019-4 and 440074/2020-7, and by Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ– Brazil), research grant 210.364/2018 and 203.111/2018.

Author information

Authors and Affiliations

Department of Computation, Federal Rural University of Rio de Janeiro, Rio de Janeiro, Brazil
Claver P. Soto
Electrical Engineering Program, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Claver P. Soto, Gustavo M. S. Nunes & José Gabriel R. C. Gomes
Department of Electronics Engineering and Telecommunications, Engineering Faculty, State University of Rio de Janeiro, Rio de Janeiro, Brazil
Nadia Nedjah

Authors

Claver P. Soto
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo M. S. Nunes
View author publications
You can also search for this author in PubMed Google Scholar
José Gabriel R. C. Gomes
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Nedjah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claver P. Soto.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Soto, C.P., Nunes, G.M.S., Gomes, J.G.R.C. et al. Application-specific word embeddings for hate and offensive language detection. Multimed Tools Appl 81, 27111–27136 (2022). https://doi.org/10.1007/s11042-021-11880-2

Download citation

Received: 12 November 2020
Revised: 18 September 2021
Accepted: 23 December 2021
Published: 25 January 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11042-021-11880-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application-specific word embeddings for hate and offensive language detection

Abstract

Access this article

Similar content being viewed by others

Fake news, disinformation and misinformation in social media: a review

Detection and moderation of detrimental content on social media platforms: current status and future directions

FakeBERT: Fake news detection in social media with a BERT-based deep learning approach

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application-specific word embeddings for hate and offensive language detection

Abstract

Access this article

Similar content being viewed by others

Fake news, disinformation and misinformation in social media: a review

Detection and moderation of detrimental content on social media platforms: current status and future directions

FakeBERT: Fake news detection in social media with a BERT-based deep learning approach

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation