Abstract
Fake news spreading through social media has become a serious problem in recent years, especially after the United States presidential election in 2016. Accordingly, more attention has been paid to this issue by scientists to develop automated tools to combat those pieces of information that contain misinformation, using natural language processing methods. Although the performance of fake news detection models has increased by using more complex architectures and state-of-the-art models, less attention has been paid to the impact of pre-processing on the overall performance of such models. In this study, we focus on investigating the impact of pre-processing, especially removing URLs on the performance of fake news detection systems. We compared the performance of fake news detection in tweets as a text classification task, using support vector machine, long short-term memory networks, and BERT pre-trained model. In addition to URLs, we analyzed the impact of different approaches for dealing with emojis and Twitter handles on the performance of the models. Our results show URLs could be good clues for identifying fake news, despite the fact that they are usually removed in pre-processing step.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
More details about the implementation and parameters can be found in the GitHub repository of the project at https://github.com/salarmohtaj/FakeNews_Detection_Twitter.
References
Alam, S., Yao, N.: The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput. Math. Organ. Theory 25(3), 319–335 (2019). https://doi.org/10.1007/s10588-018-9266-8
Ayedh, A., Tan, G., Alwesabi, K., Rajeh, H.: The effect of preprocessing on arabic document categorization. Algorithms 9(2), 27 (2016). https://doi.org/10.3390/a9020027
Chen, B., et al.: Transformer-based language model fine-tuning methods for COVID-19 fake news detection. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 83–92. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_9
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Datta, A., Si, S.: A supervised machine learning approach to fake news identification. In: Hemanth, D.J., Shakya, S., Baig, Z. (eds.) ICICI 2019. LNDECT, vol. 38, pp. 197–204. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-34080-3_22
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Gupta, A., Sukumaran, R., John, K., Teki, S.: Hostility detection and covid-19 fake news detection in social media. CoRR abs/2101.05953 (2021). https://arxiv.org/abs/2101.05953
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Kaliyar, R.K., Goswami, A., Narang, P.: FakeBERT: fake news detection in social media with a BERT-based deep learning approach. Multimedia Tools and Appl. 80(8), 11765–11788 (2021). https://doi.org/10.1007/s11042-020-10183-2
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980
Mohtaj, S., Schmitt, V., Möller, S.: A feature extraction based model for hate speech identification. CoRR abs/2201.04227 (2022). https://arxiv.org/abs/2201.04227
Mohtaj, S., Woloszyn, V., Möller, S.: TUB at HASOC 2020: Character based LSTM for hate speech detection in Indo-European languages. In: Mehta, P., Mandl, T., Majumder, P., Mitra, M. (eds.) Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, 16–20 December 2020. CEUR Workshop Proceedings, vol. 2826, pp. 298–303. CEUR-WS.org (2020). http://ceur-ws.org/Vol-2826/T2-26.pdf
Patwa, P., et al.: Fighting an infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. ACL (2014). https://doi.org/10.3115/v1/d14-1162
Pimpalkar, A.P., Raj, R.J.R.: Influence of pre-processing strategies on the performance of ML classifiers exploiting TF-IDF and bow features. ADCAIJ: Adv. Distrib. Comput. Artif. Intell. J. 9(2), 49 (2020)
Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Assoc. Inf. Sci. Technol. 62(12), 2512–2527 (2011). https://doi.org/10.1002/asi.21630
Uysal, A.K., Günal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014). https://doi.org/10.1016/j.ipm.2013.08.006
Yang, Y., Zheng, L., Zhang, J., Cui, Q., Li, Z., Yu, P.S.: TI-CNN: convolutional neural networks for fake news detection. CoRR abs/1806.00749 (2018). http://arxiv.org/abs/1806.00749
Acknowledgment
This research was funded in part by the German Federal Ministry of Education and Research (BMBF) under grant number 01IS17043 (project ILSFAS).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mohtaj, S., Möller, S. (2022). The Impact of Pre-processing on the Performance of Automated Fake News Detection. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-13643-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)