Abstract
The proliferation of social media has created new norms in society. Incidents of abuse, hate, harassment and misogyny are widely spread across social media platforms. With the advancements in machine learning techniques, advanced text mining methods have been developed to analyse text data. Social media data poses additional challenges to these methods due to their nature of short content and the presence of ambiguity, errors and noises in content. In the past decade, machine learning researchers have focused on finding solutions dealing with these challenges. Outcomes of these methods boost the social media monitoring capability and can assist policymakers and governments to focus on key issues. This chapter will review various types of machine learning techniques including the currently popular deep learning methods that can be used in the analysis of social media data for identifying abusive content.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
J.W. Howard, Free speech and hate speech. Annu. Rev. Polit. Sci. Annu. Rev. 22, 93–109 (2019). https://doi.org/10.1146/annurev-polisci-051517-012343
A. D’Sa, I. Illina, D. Fohr, BERT and fastText embeddings for automatic detection of toxic speech, in 2020 International Multi-Conference on: “Organization of Knowledge and Advanced Technologies” (OCTA), 1–5 (2020), https://doi.org/10.1109/OCTA49274.2020.9151853
M. Sap, D. Card, S. Gabriel, Y. Choi N. Smith, The risk of racial bias in hate speech detection, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL (2019), pp. 1668–1678, https://doi.org/10.18653/v1/P19-1163
T. Balasubramaniam, R. Nayak, M.A. Bashar, Understanding the spatio-temporal topic dynamics of covid-19 using nonnegative tensor factorization: a case study, in Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI). Institute of Electrical and Electronics Engineers Inc., United States of America, pp. 1218-1225, https://doi.org/10.1109/SSCI47803.2020.9308265
A. Obadimu, E. Mead, N. Mead, Identifying latent toxic features on youtube using non-negative matrix factorization, in The Ninth International Conference on Social Media Technologies, Communication, and Informatics: Valencia, Spain, International Academy, Research, and Industry Association (2019), pp. 25–31
Z. Ashktorab, “The continuum of harm” taxonomy of cyberbullying mitigation and prevention, in Online Harassment. Human–Computer Interaction Series, ed. by J. Golbeck (Springer, Cham, 2018), https://doi.org/10.1007/978-3-319-78583-7_9
E. Raisi, B. Huang, Weakly supervised cyberbullying detection with participant-vocabulary consistency. Soc. Netw. Anal. Min. 8(1), 1–17 (2018). https://doi.org/10.1007/s13278-018-0517-y
A. Al-Hassan, H. Al-Dossari, Detection of hate speech in Arabic tweets using deep learning. Multimedia Syst. (2021). https://doi.org/10.1007/s00530-020-00742-w
M. Mozafari, R. Farahbakhsh, N. Crespi, Hate speech detection and racial bias mitigation in social media based on BERT model. PloS One 15(8), e0237861–e0237861 (2020), https://doi.org/10.1371/journal.pone.0237861
M. Anzovino, E. Fersini, P. Rosso, Automatic identification and classification of misogynistic language on twitter, in Natural Language Processing and Information Systems. NLDB 2018. Lecture Notes in Computer Science, ed. by M. Silberztein, F. Atigui, E. Kornyshova, Métais, E., F. Meziane, vol. 10859 (Springer, Cham, 2018), https://doi.org/10.1007/978-3-319-91947-8_6
J. Sekeres, O. Ormandjieva, C. Suen, J. Hamel, Advanced data preprocessing for detecting cybercrime in text-based online interactions, in Pattern Recognition and Artificial Intelligence. ICPRAI 2020, ed. by Y. Lu, N. Vincent, P.C. Yuen, W.S. Zheng, F. Cheriet, C.Y. Suen. Lecture Notes in Computer Science, vol. 12068. (Springer, Cham, 2020), https://doi.org/10.1007/978-3-030-59830-3_36
P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep learning for hate speech detection in tweets (2017). https://doi.org/10.1145/3041021.3054223
S. Boberg, L. Frischlich, T. Schatto-Eckrodt, F. Wintterlin, T. Quandt, Between overload and indifference: detection of fake accounts and social bots by community managers, in Disinformation in Open Online Media. MISDOOM 2019, ed. by C. Grimme, M. Preuss, F. Takes, A. Waldherr. Lecture Notes in Computer Science, vol. 12021. (Springer, Cham, 2020), https://doi.org/10.1007/978-3-030-39627-5_2
S. Cresci, Detecting malicious social bots: story of a never-ending clash, in Disinformation in Open Online Media. MISDOOM 2019, ed. by C. Grimme, M. Preuss, F. Takes, A. Waldherr. Lecture Notes in Computer Science, vol. 12021. (Springer, Cham, 2020), https://doi.org/10.1007/978-3-030-39627-5_7 (
L. Floridi, M. Chiriatti, GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020). https://doi.org/10.1007/s11023-020-09548-1
J. Vig, Visualizing attention in transformer-based language representation models (2019)
C. Hardaker, Social media interventions and the language of political campaigns: from online petitions to platform policy changes, in Professional Communication. Communicating in Professions and Organizations, ed. by L. Mullany (Palgrave Macmillan, Cham, 2020), pp. 227–247, https://doi.org/10.1007/978-3-030-41668-3_12
M. Naldi, A conversation analysis of interactions in personal finance forums, in Text Analytics. JADT 2018. Studies in Classification, Data Analysis, and Knowledge Organization, ed. by D.F. Iezzi, D. Mayaffre, M. Misuraca (Springer, Cham, 2020), https://doi.org/10.1007/978-3-030-52680-1_6
L. Mullany, L. Trickett, The language of ‘misogyny hate crime’: politics, policy and policing, in Professional Communication. Communicating in Professions and Organizations, ed. by L. Mullany (Palgrave Macmillan, Cham, 2020), https://doi.org/10.1007/978-3-030-41668-3_13
J. Pereira-Kohatsu, L. Quijano-Sánchez, F. Liberatore, M. Camacho-Collados, Detecting and monitoring hate speech in twitter. Sensors (Basel, Switzerland) 19(21), 4654 (2019). https://doi.org/10.3390/s19214654
A. Walker, K. Lyall, D. Silva, G. Craigie, R. Mayshak, B. Costa, S. Hyder, A. Bentley, Male victims of female-perpetrated intimate partner violence, help-seeking, and reporting behaviors: a qualitative study. Psychol. Men Masculinity 21(2), 213–223 (2020). https://doi.org/10.1037/men0000222
N. Ersotelos, M. Bottarelli, H. Al-Khateeb, G. Epiphaniou, Z. Alhaboby, P. Pillai, A. Aggoun, Blockchain and IoMT against Physical Abuse: bullying in schools as a case study. J. Sens. Actuator Netw. 10(1), 1 (2021). https://doi.org/10.3390/jsan10010001
K. Saha, E. Chandrasekharan, M. De Choudhury, Prevalence and psychological effects of hateful speech in online college communities, in Proceedings of the 10th ACM Conference on Web Science (2019), pp. 255–264, https://doi.org/10.1145/3292522.3326032
B. Haddad, Z. Orabe, A. Al-Abood, N. Ghneim, Arabic offensive language detection with attention-based deep neural networks, in Language Resources and Evaluation Conference, European Language Resources (2020), pp. 76–81. https://www.aclweb.org/anthology/2020.osact-1.12.pdf
M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the GermEval 2018 shared task on the identification of offensive language. in Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria September 21, 2018. Vienna, Austria: Austrian Academy of Sciences, 2018 (2018), pp. 1–10
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: a lite BERT for self-supervised learning of language representations (2019), https://arxiv.org/abs/1909.11942v6
J. Salminen, S. Sengän, J. Corporan, S. Jung, B. Jansen, Topic-driven toxicity: exploring the relationship between online toxicity and news topics. PloS One 15(2), e0228723 (2020). https://doi.org/10.1371/journal.pone.0228723
A. Workman, E. Kruger, T. Dune, Policing victims of partner violence during COVID-19: a qualitative content study on Australian grey literature. Polic. Soc. 1–21 (2021), https://doi.org/10.1080/10439463.2021.1888951
D. Ging, E. Siapera, Gender Hate Online Understanding the New Anti-Feminism, 1st edn. (Springer International Publishing, 2019), https://doi.org/10.1007/978-3-319-96226-9
F. Ye, C. Chen, Z. Zheng, Deep autoencoder-like nonnegative matrix factorization for community detection, in Proceedings of the 27th ACM International Conference on Information and Knowledge Management (2018), pp. 1393–1402, https://doi.org/10.1145/3269206.3271697
J. Risch, R. Krestel, Toxic comment detection in online discussions, in Deep Learning-Based Approaches for Sentiment Analysis. Algorithms for Intelligent Systems, ed. by B. Agarwal, R. Nayak, N. Mittal, S. Patnaik (Springer, Singapore, 2020), https://doi.org/10.1007/978-981-15-1216-2_4
E. Dixon, Automation and harassment detection, in Online Harassment. Human–Computer Interaction Series, ed. by J. Golbeck (Springer, Cham, 2018), https://doi.org/10.1007/978-3-319-78583-7_5
J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding (2018), https://arxiv.org/pdf/1810.04805.pdf
M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, vol. 1 (2018), pp. 2227–2237, https://doi.org/10.18653/v1/N18-1202
C. Aggarwal, C. Zhai, Mining Text Data, 1st edn. (Springer, New York, 2012). https://doi.org/10.1007/978-1-4614-3223-4
I. El-Khair, Term weighting, in Encyclopedia of Database Systems, ed. by L. LIU, M. ÖZSU (Springer, Boston, MA, 2009), https://doi.org/10.1007/978-0-387-39940-9_943
A. Zimek (ed.), Clustering High-Dimensional Data in Data Clustering (Chapman and Hall/CRC, 2019), pp. 201–230
Purude University, Predictive modeling & machine learning laboratory (2016)
A. Egg, Locality-sensitive hashing (LSH) (2017)
I. Kwok, Y. Wang, Locate the hate: detecting tweets against blacks, in Twenty-Seventh AAAI Conference on Artificial Intelligence (2013), pp. 1621–1622. https://dl.acm.org/doi/10.5555/2891460.2891697
M. Molina-González, F. Plaza-del Arco, M. Martïn-Valdivia, L. Ureña López, Ensemble learning to detect aggressiveness in mexican spanish tweets, in Proceedings of the First Workshop for Iberian Languages Evaluation Forum (IberLEF 2019), CEUR WS Proceedings (2019), pp. 495–501. http://ceur-ws.org/Vol-2421/MEX-A3T_paper_1
Y. Li, A. Algarni, N. Zhong, Mining positive and negative patterns for relevance feature discovery, in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, Washington, 2010), pp.753–762, https://doi.org/10.1145/1835804.1835900
L. Silva, M. Mondal, D. Correa, F. Benevenuto, I. Weber, Analyzing the targets of hate in online social media, in Tenth International AAAI Conference on Web and Social Media (2016), https://arxiv.org/pdf/1603.07709.pdf
G. Kovács, P. Alonso, R. Saini Challenges of hate speech detection in social media: data scarcity, and leveraging external resources. SN Comput. Sci. 2(2), (2021), https://doi.org/10.1007/s42979-021-00457-3
W. Mohotti, R. Nayak, Efficient outlier detection in text corpus using rare frequency and ranking. ACM Trans. Knowl. Discov. Data 14(6) (2020), https://doi.org/10.1145/3399712
D. Schabus, M. Skowron, M. Trapp, One million posts: a data set of german online discussions, in Proceedings of SIGIR ’17, August 07-11 (2017), pp. 1241–1244, https://doi.org/10.1145/3077136.3080711
Z. Zhang, L. Luo, Hate speech detection: a solved problem? The Challenging Case of Long Tail on Twitter (2018)
W. Wang, L. Chen, K. Thirunarayan, A. Sheth, Cursing in english on twitter, in Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (ACM, 2014), pp. 415–425
S. MacAvaney, H. Yao, E. Yang, K. Russell, N. Goharian, O. Frieder, Hate speech detection: challenges and solutions. PloS One 14(8), e0221152–e0221152 (2019). https://doi.org/10.1371/journal.pone.0221152
O. Makhnytkina, A. Matveev, D. Bogoradnikova, I. Lizunova, A. Maltseva, N. Shilkina, Detection of toxic language in short text messages, in Speech and Computer SPECOM 2020, ed. by A. Karpov, R. Potapova. Lecture Notes in Computer Science, vol. 12335. (Springer, Cham, 2020), https://doi.org/10.1007/978-3-030-60276-5_31
L. Xie, X. Zhang, Gate-fusion transformer for multimodal sentiment analysis, in Pattern Recognition and Artificial Intelligence. ICPRAI 2020, ed. by Y. Lu, N. Vincent, P.C. Yuen, W.S. Zheng, F. Cheriet, C.Y Suen. Lecture Notes in Computer Science, 12068. Springer, Cham, 2020), https://doi.org/10.1007/978-3-030-59830-3_3
A. D’Sa, I. Illina, D. Fohr, Towards non-toxic landscapes: automatic toxic comment detection using DNN (2019), pp. 21–25, https://arxiv.org/ftp/arxiv/papers/1911/1911.08395.pdf
J. Risch R. Krestel, Aggression Identification Using Deep Learning and Data Augmentation, ACL (2018), pp. 150–158, https://www.aclweb.org/anthology/W18-4418
M.A. Bashar, R. Nayak, N. Suzor, Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set. Knowl. Inf. Syst. 62(10), 4029–4054 (2020). https://doi.org/10.1007/s10115-020-01481-0
E. Pamungkas, V. Basile, V. Patti, Misogyny detection in twitter: a multilingual and cross-domain study. Inf. Process. Manag. 57(6), 102360 (2020). https://doi.org/10.1016/j.ipm.2020.102360
S. Zimmerman, C. Fox, U. Krushwitz, Improving hate speech detection with deep learning ensembles (2018)
W. Dai, T. Yu, Z. Liu, P. Fung, Kungfupanda at SemEval-2020 Task 12: BERT-based multi-task, learning for offensive language detection, https://arxiv.org/abs/2004.13432
T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language (2017), https://arxiv.org/abs/1703.04009
G. Xiang, B. Fan, L. Wang, J. Hong, C. Rose, Detecting offensive tweets via topical feature discovery over a large scale twitter corpus, in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (ACM, 2012), pp. 1980–1984
M.A. Bashar, R. Nayak, QutNocturnal@HASOC’19: CNN for hate speech and offensive content identification in Hindi language, in Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, vol. 2517, ed. by P. Mehta, P. Rosso, P. Majumder, M. Mitra (Sun SITE Central Europe, Germany, 2019), pp. 237–245
Y. Kim, Convolutional neural networks for sentence classification, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1746–1751. https://arxiv.org/pdf/1408.5882.pdf
W. Wang, B. Bi, M. Yan, C. Wu, Z. Bao, J. Xia, L. Peng, L. Si, StructBERT: incorporating language structures into pre-training for deep language understanding (2019)
D. Gordeev, V. Potapov, Toxicity in texts and images on the internet, in Speech and Computer. SPECOM 2020, ed. by A. Karpov, R. Potapova. Lecture Notes in Computer Science, vol. 12335 (Springer, Cham, 2020), pp. 156–165, https://doi.org/10.1007/978-3-030-60276-5_16
N. Reimers, I. Gurevych, Sentence-BERT: sentence embeddings using siamese BERT-networks (2019), https://arxiv.org/pdf/1908.10084.pdf
V. Sinh, N. Minh, A study on self-attention mechanism for AMR-to-text generation, in Natural Language Processing and Information Systems. NLDB 2019, ed. by E. Métais, F. Meziane, S. Vadera, V. Sugumaran, M. Saraee. Lecture Notes in Computer Science, vol. 11608. (Springer, Cham, 2019), https://doi.org/10.1007/978-3-030-23281-8_27
T. Wullach, A. Adler, E. Minkov, Towards hate speech detection at large via deep generative modeling. IEEE Int. Comput. (2020). https://doi.org/10.1109/MIC.2020.3033161
T. Wolf, V. Sanh, J. Chaumond, C. Delangue, TransferTransfo: a transfer learning approach for neural network based conversational agents (2019)
M. Mozafari, R. Farahbakhsh, N. Crespi, A BERT-based transfer learning approach for hate speech detection in online social media (2019), https://arxiv.org/pdf/1910.12574.pdf
S. Swamy, A. Jamatia, B. Gambäck, Studying generalisability across abusive language detection datasets, in Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL),Association for Computational Linguistics (2019), pp 940–950, https://doi.org/10.18653/v1/K19-1088
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need (2017)
A. Koratana, K. Hu, Toxic speech detection, in 32nd Conference on Neural Information Processing Systems (2018)
K. Clark, U. Khandelwal, O. Levy, C. Manning, What does BERT look at? An analysis of BERT’s attention (2019), https://arxiv.org/abs/1906.04341
R. Cao, R. Lee, HateGAN: adversarial generative-based data augmentation for hate speech detection, in Proceedings of the 28th International Conference on Computational Linguistics (2020), pp. 6327–6338. https://doi.org/10.18653/v1/2020.coling-main.557
S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning based text classification: a comprehensive review (2020), https://arxiv.org/pdf/2004.03705.pdf
M.A. Bashar, R. Nayak, TAnoGAN: time series anomaly detection with generative adversarial networks, in Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI). Institute of Electrical and Electronics Engineers Inc., United States of America (2020), pp. 1778–1785, https://doi.org/10.1109/SSCI47803.2020.9308512
J. Chen, S. Yan, K.C. Wong, Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis. Neural Comput. Appl. 32, 10809–10818 (2020). https://doi.org/10.1007/s00521-018-3442-0
M.A. Bashar, R. Nayak, K. Luong, T. Balasubramaniam, Progressive domain adaptation for detecting hate speech on social media with small training set and its application to COVID-19 concerned posts. Soc. Netw. Anal. Min. 11, 69 (2021). https://doi.org/10.1007/s13278-021-00780-w
S. Ghosh, A. Mondal, K. Singh, J. Maiti, P. Mitra, Potential threat detection from industrial accident reports using text mining, in Intelligent Computing and Communication. ICICC 2019. Advances in Intelligent Systems and Computing, vol. 1034 (Springer, Singapore, 2020), pp. 109–123, https://doi.org/10.1007/978-981-15-1084-7_12
S. Aghazadeh, A. Burns, J. Chu, H. Feigenblatt, E. Laribee, L. Maynard, A. Meyers, J. O’Brien, L. Rufus, GamerGate: a case study in online harassment, in Online Harassment. Human–Computer Interaction Series, ed. by J. Golbeck (Springer, Cham. 2018), https://doi.org/10.1007/978-3-319-78583-7_8
N. Harriman, N. Shortland, M. Su, T. Cote, M. Testa, E. Savoia, Youth exposure to hate in the online space: an exploratory analysis. Int. J. Environ. Res. Public Health 17(22), 1–14 (2020). https://doi.org/10.3390/ijerph17228531
A. Lytos, T. Lagkas, P. Sarigiannidis, K. Bontcheva, The evolution of argumentation mining: from models to social media and emerging tools. Inf. Process. Manag. 56(6), 102055 (2019). https://doi.org/10.1016/j.ipm.2019.10205
C. Blaya, Cyberhate: a review and content analysis of intervention strategies. Aggress. Violent Behav. 45, 163–172 (2019). https://doi.org/10.1016/j.avb.2018.05.006
S. Dowlagar, R. Mamidi, HASOCOne@FIRE-HASOC2020: Using BERT and multilingual BERT models for hate speech detection (2021), https://arxiv.org/pdf/2101.09007.pdf
M. Bashar, R. Nayak, N. Suzor, B. Weir, Misogynistic tweet detection: modelling cnn with small datasets (2020). https://doi.org/10.1007/978-981-13-6661-1_1
M. Bashar, R. Nayak, Active learning for effectively fine-tuning transfer learning to downstream task. ACM Trans. Intell. Syst. Technol. 12(2), 1–24 (2021), https://doi.org/10.1145/3446343
A. de los Riscos, L. D’Haro, ToxicBot: a conversational agent to fight online hate speech, in Conversational dialogue systems for the next decade, ed. by L.F. D’Haro, Z. Callejas, S. Nakamura. Lecture Notes in Electrical Engineering, vol. 704. (Springer, Singapore, 2021), https://doi.org/10.1007/978-981-15-8395-7_2
J. Salminen, M. Hopf, S. Chowdhury, S. Jung, H. Almerekhi, B. Jansen, Developing an online hate classifier for multiple social media platforms. Hum.-Centric Comput. Inf. Sci. 10(1), 1–34 (2020), https://doi.org/10.1186/s13673-019-0205-6
T. Balasubramaniam, R. Nayak, K. Luong, M.A. Bashar, Identifying covid-19 misinformation tweets and learning their spatio-temporal topic dynamics using nonnegative coupled matrix tensor factorization. Soc. Netw. Anal. Min. 11(1), 57 (2021). https://doi.org/10.1007/s13278-021-00767-7
Acknowledgements
I would like to acknowledge my research team, especially Dr Md Abul Bashar, who has been conducting research on this topic for a few years.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Nayak, R., Baek, H.S. (2022). Machine Learning for Identifying Abusive Content in Text Data. In: Virvou, M., Tsihrintzis, G.A., Jain, L.C. (eds) Advances in Selected Artificial Intelligence Areas. Learning and Analytics in Intelligent Systems, vol 24. Springer, Cham. https://doi.org/10.1007/978-3-030-93052-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-93052-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93051-6
Online ISBN: 978-3-030-93052-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)