On the Importance of Word Embedding in Automated Harmful Information Detection

Mohtaj, Salar; Möller, Sebastian

doi:10.1007/978-3-031-16270-1_21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13502))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

Abstract

Social media have been growing rapidly during past years. They changed different aspects of human life, especially how people communicate and also how people access information. However, along with the important benefits, social media causes a number of significant challenges since they were introduced. Spreading of fake news and hate speech are among the most challenging issues which have attracted a lot of attention by researchers in past years. Different models based on natural language processing are developed to combat these phenomena and stop them in the early stages before mass spreading. Considering the difficulty of the task of automated harmful information detection (i.e., fake news and hate speech detection), every single step of the detection process could have a sensible impact on the performance of models. In this paper, we study the importance of word embedding on the overall performance of deep neural network architecture on the detection of fake news and hate speech on social media. We test various approaches for converting raw input text into vectors, from random weighting to state-of-the-art contextual word embedding models. In addition, to compare different word embedding approaches, we also analyze different strategies to get the vectors from contextual word embedding models (i.e., get the weights from the last layer, against averaging weights of the last layers). Our results show that XLNet embedding outperforms the other embedding approaches on both tasks related to harmful information identification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

F-DenseCNN: feature-based dense convolutional neural networks and swift text word embeddings for enhanced hate speech prediction

Article 24 September 2024

Comparative Performance of Multi-level Pre-trained Embeddings on CNN, LSTM and CNN-LSTM for Hate Speech and Offensive Language Detection

Hate speech recognition in multilingual text: hinglish documents

Article 13 March 2023

Notes

References

Asghari, H., Fatemi, O., Mohtaj, S., Faili, H., Rosso, P.: On the use of word embedding for cross language plagiarism detection. Intell. Data Anal. 23(3), 661–680 (2019). https://doi.org/10.3233/IDA-183985
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. In: Wu, D., Carpuat, M., Carreras, X., Vecchi, E.M. (eds.) Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pp. 103–111. Association for Computational Linguistics (2014)
Google Scholar
Demartini, G., Mizzaro, S., Spina, D.: Human-in-the-loop artificial intelligence for fighting online misinformation: challenges and opportunities. IEEE Data Eng. Bull. 43(3), 65–74 (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Ethayarajh, K.: How contextual are contextualized word representations? comparing the geometry of BERT, ELMO, and GPT-2 embeddings. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019, pp. 55–65. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1006
Frenda, S., Ghanem, B., Montes-y-Gómez, M., Rosso, P.: Online hate speech against women: automatic identification of misogyny and sexism on twitter. J. Intell. Fuzzy Syst. 36(5), 4743–4752 (2019). https://doi.org/10.3233/JIFS-179023
Article Google Scholar
Ghanem, B., Ponzetto, S.P., Rosso, P., Rangel, F.: Fakeflow: fake news detection by modeling the flow of affective information. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, 19–23 April 2021, pp. 679–689. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.eacl-main.56
Giachanou, A., Rosso, P.: The battle against online harmful information: the cases of fake news and hate speech. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (eds.) CIKM 2020: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, 19–23 October 2020, pp. 3503–3504. ACM (2020). https://doi.org/10.1145/3340531.3412169
Gitari, N.D., Zhang, Z., Damien, H., Long, J.: A lexicon-based approach for hate speech detection. Int. J. Multimedia Ubiquit. Eng. 10(4), 215–230 (2015). https://doi.org/10.14257/ijmue.2015.10.4.21
Article Google Scholar
Haber, J., Poesio, M.: Word sense distance in human similarity judgements and contextualised word embeddings. In: Proceedings of the Probability and Meaning Conference (PaM 2020), pp. 128–145. Association for Computational Linguistics, Gothenburg, June 2020. https://aclanthology.org/2020.pam-1.17
Hu, L., et al.: Compare to the knowledge: graph neural fake news detection with external knowledge. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 August 2021, pp. 754–763. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.62
Jain, M., Goel, P., Singla, P., Tehlan, R.: Comparison of various word embeddings for hate-speech detection. In: Khanna, A., Gupta, D., Pólkowski, Z., Bhattacharyya, S., Castillo, O. (eds.) Data Analytics and Management. LNDECT, vol. 54, pp. 251–265. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-8335-3_21
Chapter Google Scholar
Jhaver, S., Birman, I., Gilbert, E., Bruckman, A.S.: Human-machine collaboration for content regulation: the case of reddit automoderator. ACM Trans. Comput. Hum. Interact. 26(5), 31:1–31:35 (2019). https://doi.org/10.1145/3338243
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692
Modha, S., et al.: Overview of the HASOC subtrack at FIRE 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Ganguly, D., Gangopadhyay, S., Mitra, M., Majumder, P. (eds.) FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, India, 13–17 December 2021, pp. 1–3. ACM (2021). https://doi.org/10.1145/3503162.3503176
Mohtaj, S., Schmitt, V., Möller, S.: A feature extraction based model for hate speech identification. CoRR abs/2201.04227 (2022). https://arxiv.org/abs/2201.04227
Mohtaj, S., Woloszyn, V., Möller, S.: TUB at HASOC 2020: character based LSTM for hate speech detection in Indo-European languages. In: Mehta, P., Mandl, T., Majumder, P., Mitra, M. (eds.) Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, 16–20 December 2020. CEUR Workshop Proceedings, vol. 2826, pp. 298–303. CEUR-WS.org (2020). https://ceur-ws.org/Vol-2826/T2-26.pdf
Pan, J.Z., Pavlova, S., Li, C., Li, N., Li, Y., Liu, J.: Content based fake news detection using knowledge graphs. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 669–683. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00671-6_39
Chapter Google Scholar
Patwa, P., et al.: Fighting an infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3
Chapter Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. ACL (2014). https://doi.org/10.3115/v1/d14-1162
Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, 1–6 June 2018, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-1202
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Google Scholar
Sharma, D.K., Garg, S.: IFND: a benchmark dataset for fake news detection. Complex Intell. Syst. 1–21 (2021)
Google Scholar
Shu, K., Mahudeswaran, D., Liu, H.: FakeNewsTracker: a tool for fake news collection, detection, and visualization. Comput. Math. Organ. Theory 25(1), 60–71 (2018). https://doi.org/10.1007/s10588-018-09280-3
Article Google Scholar
Verma, P.K., Agrawal, P., Amorim, I., Prodan, R.: Welfake: word embedding over linguistic features for fake news detection. IEEE Trans. Comput. Soc. Syst. 8(4), 881–893 (2021)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, 16–20 November 2020, pp. 38–45. Association for Computational Linguistics (2020)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: XLNET: generalized autoregressive pretraining for language understanding. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 5754–5764 (2019). https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html
Zhang, X., Ghorbani, A.A.: An overview of online fake news: characterization, detection, and discussion. Inf. Process. Manag. 57(2), 102025 (2020). https://doi.org/10.1016/j.ipm.2019.03.004
Article Google Scholar
Zulqarnain, M., Ghazali, R., Hassim, Y.M.M., Rehan, M.: A comparative review on deep learning models for text classification. Indonesian J. Electr. Eng. Comput. Sci. 19(1), 325–335 (2020)
Article Google Scholar

Download references

Acknowledgment

This research was funded in part by the German Federal Ministry of Education and Research (BMBF) under grant number 01IS17043 (project ILSFAS).

Author information

Authors and Affiliations

Technische Universität Berlin, Berlin, Germany
Salar Mohtaj & Sebastian Möller
German Research Centre for Artificial Intelligence (DFKI), Labor Berlin, Germany
Salar Mohtaj & Sebastian Möller

Authors

Salar Mohtaj
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Möller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salar Mohtaj .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mohtaj, S., Möller, S. (2022). On the Importance of Word Embedding in Automated Harmful Information Detection. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-16270-1_21
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16269-5
Online ISBN: 978-3-031-16270-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Importance of Word Embedding in Automated Harmful Information Detection