The Impact of Pre-processing on the Performance of Automated Fake News Detection

Mohtaj, Salar; Möller, Sebastian

doi:10.1007/978-3-031-13643-6_7

Salar Mohtaj^17,18 &
Sebastian Möller^17,18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13390))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1077 Accesses

Abstract

Fake news spreading through social media has become a serious problem in recent years, especially after the United States presidential election in 2016. Accordingly, more attention has been paid to this issue by scientists to develop automated tools to combat those pieces of information that contain misinformation, using natural language processing methods. Although the performance of fake news detection models has increased by using more complex architectures and state-of-the-art models, less attention has been paid to the impact of pre-processing on the overall performance of such models. In this study, we focus on investigating the impact of pre-processing, especially removing URLs on the performance of fake news detection systems. We compared the performance of fake news detection in tweets as a text classification task, using support vector machine, long short-term memory networks, and BERT pre-trained model. In addition to URLs, we analyzed the impact of different approaches for dealing with emojis and Twitter handles on the performance of the models. Our results show URLs could be good clues for identifying fake news, despite the fact that they are usually removed in pre-processing step.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://competitions.codalab.org/competitions/26655.
2.
https://pypi.org/project/demoji/.
3.
https://pypi.org/project/tweepy/.
4.
https://pypi.org/project/beautifulsoup4/.
5.
https://huggingface.co/.
6.
More details about the implementation and parameters can be found in the GitHub repository of the project at https://github.com/salarmohtaj/FakeNews_Detection_Twitter.

References

Alam, S., Yao, N.: The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput. Math. Organ. Theory 25(3), 319–335 (2019). https://doi.org/10.1007/s10588-018-9266-8
Article Google Scholar
Ayedh, A., Tan, G., Alwesabi, K., Rajeh, H.: The effect of preprocessing on arabic document categorization. Algorithms 9(2), 27 (2016). https://doi.org/10.3390/a9020027
Article MathSciNet Google Scholar
Chen, B., et al.: Transformer-based language model fine-tuning methods for COVID-19 fake news detection. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 83–92. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_9
Chapter Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Article MATH Google Scholar
Datta, A., Si, S.: A supervised machine learning approach to fake news identification. In: Hemanth, D.J., Shakya, S., Baig, Z. (eds.) ICICI 2019. LNDECT, vol. 38, pp. 197–204. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-34080-3_22
Chapter Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Gupta, A., Sukumaran, R., John, K., Teki, S.: Hostility detection and covid-19 fake news detection in social media. CoRR abs/2101.05953 (2021). https://arxiv.org/abs/2101.05953
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Kaliyar, R.K., Goswami, A., Narang, P.: FakeBERT: fake news detection in social media with a BERT-based deep learning approach. Multimedia Tools and Appl. 80(8), 11765–11788 (2021). https://doi.org/10.1007/s11042-020-10183-2
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980
Mohtaj, S., Schmitt, V., Möller, S.: A feature extraction based model for hate speech identification. CoRR abs/2201.04227 (2022). https://arxiv.org/abs/2201.04227
Mohtaj, S., Woloszyn, V., Möller, S.: TUB at HASOC 2020: Character based LSTM for hate speech detection in Indo-European languages. In: Mehta, P., Mandl, T., Majumder, P., Mitra, M. (eds.) Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, 16–20 December 2020. CEUR Workshop Proceedings, vol. 2826, pp. 298–303. CEUR-WS.org (2020). http://ceur-ws.org/Vol-2826/T2-26.pdf
Patwa, P., et al.: Fighting an infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3
Chapter Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. ACL (2014). https://doi.org/10.3115/v1/d14-1162
Pimpalkar, A.P., Raj, R.J.R.: Influence of pre-processing strategies on the performance of ML classifiers exploiting TF-IDF and bow features. ADCAIJ: Adv. Distrib. Comput. Artif. Intell. J. 9(2), 49 (2020)
Article Google Scholar
Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Assoc. Inf. Sci. Technol. 62(12), 2512–2527 (2011). https://doi.org/10.1002/asi.21630
Article Google Scholar
Uysal, A.K., Günal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014). https://doi.org/10.1016/j.ipm.2013.08.006
Article Google Scholar
Yang, Y., Zheng, L., Zhang, J., Cui, Q., Li, Z., Yu, P.S.: TI-CNN: convolutional neural networks for fake news detection. CoRR abs/1806.00749 (2018). http://arxiv.org/abs/1806.00749

Download references

Acknowledgment

This research was funded in part by the German Federal Ministry of Education and Research (BMBF) under grant number 01IS17043 (project ILSFAS).

Author information

Authors and Affiliations

Technische Universität Berlin, Berlin, Germany
Salar Mohtaj & Sebastian Möller
German Research Centre for Artificial Intelligence (DFKI), Labor Berlin, Berlin, Germany
Salar Mohtaj & Sebastian Möller

Authors

Salar Mohtaj
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Möller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salar Mohtaj .

Editor information

Editors and Affiliations

University of Bologna, Forlì, Italy
Alberto Barrón-Cedeño
University of Padua, Padova, Italy
Giovanni Da San Martino
University of Bologna, Bologna, Italy
Mirko Degli Esposti
Instituto di Scienza e Tecnologie dell' Informazione “Alessandro Faedo”, Pisa, Italy
Fabrizio Sebastiani
University of Glasgow, Glasgow, UK
Craig Macdonald
University Milano-Bicocca, Milan, Italy
Gabriella Pasi
TU Wien, Vienna, Austria
Allan Hanbury
Leipzig University, Leipzig, Germany
Martin Potthast
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mohtaj, S., Möller, S. (2022). The Impact of Pre-processing on the Performance of Automated Fake News Detection. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-13643-6_7
Published: 25 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Impact of Pre-processing on the Performance of Automated Fake News Detection