Abstract
With the current pandemic, it is imperative to stay up to date with the news and many sources contribute to this purpose. However, there is also misinformation and fake news that spreads within society. In this work, a machine learning approach to detect fake news related to COVID-19 is developed. Specifically, Doc2Vec language model is used to transform text documents into vector representations, and handcrafted features like document length, the proportion of personal pronouns, and punctuation are included as complementary features as well. Then, Principal Component Analysis (PCA) is performed on the original feature vectors to reduce dimensionality. Both, the original and reduced data are fed to various machine learning models and finally compared in terms of accuracy, precision, recall, and execution time. The results indicate that the reduced set of features had minimal accuracy impact. However, the execution times are greatly reduced in most cases, specifically at testing time, indicating that dimensionality reduction can be useful on projects already in production that would need model inference on large volumes of documents to detect fake news.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akkoyunlu, S., Manfredotti, C., Cornuéjols, A., Darcel, N., Delaere, F.: Exploring eating behaviours modelling for user clustering. In: HealthRecSys@ RecSys 2018 colocated with ACM Recsys 2018 (ACM Conference Series on Recommender Systems), pp. 46–51 (2018)
Alam, F., et al.: Fighting the covid-19 infodemic in social media: a holistic perspective and a call to arms. In: Proceeding of the Fifteenth International AAAI Conference on Web and Social Media (ICWSM 2021) (2021)
Almatarneh, S., Gamallo, P., ALshargabi, B., Al-Khassawneh, Y., Alzubi, R.: Comparing traditional machine learning methods for covid-19 fake news. In: 2021 22nd International Arab Conference on Information Technology (ACIT), pp. 1–4 (2021). https://doi.org/10.1109/ACIT53391.2021.9677453
Bang, Y., Ishii, E., Cahyawijaya, S., Ji, Z., Fung, P.: Model generalization on COVID-19 fake news detection. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 128–140. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_13
De Magistris, G., Russo, S., Roma, P., Starczewski, J.T., Napoli, C.: An explainable fake news detector based on named entity recognition and stance classification applied to covid-19. Information 13(3) (2022). https://doi.org/10.3390/info13030137, https://www.mdpi.com/2078-2489/13/3/137
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Felber, T.: Constraint 2021: machine learning models for COVID-19 fake news detection shared task, January 2021. CoRR, abs/2101.03717
Fletcher, T.: Support vector machines explained, January 2009. https://www.cs.ucl.ac.uk/staff//T.Fletcher/
Jolliffe, I.T.: Principal Component Analysis for Special Types of Data. In: Principal Component Analysis, pp. 338–372. Springer, New York (2002). https://doi.org/10.1007/0-387-22440-8_13
Kobayashi, S., Yokoi, S., Suzuki, J., Inui, K.: Efficient estimation of influence of a training instance. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Stroudsburg, PA, USA, pp. 41–47. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.sustainlp-1.6, https://www.aclweb.org/anthology/2020.sustainlp-1.6
Koirala, A.: Covid-19 fake news classification with deep learning, October 2020. https://doi.org/10.13140/RG.2.2.26509.56805
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite Bert for self-supervised learning of language representations. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=H1eA7AEtvS
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Li, F.H., Huang, M., Yang, Y., Zhu, X.: Learning to identify review spam. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. https://arxiv.org/abs/1907.11692
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Paper, D.: Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python. Apress (2019). https://books.google.com.ec/books?id=kqy-DwAAQBAJ
Patwa, P., et al.: Fighting an Infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3
Peluffo-Ordóñez, D.H., Rodríguez-Sotelo, J.L., Castellanos-Domínguez, G.: Estudio comparativo de métodos de selección de características de inferencia supervisada y no supervisada. TecnoLógicas pp. 149–166 (2009)
Rencher, A.C.: Methods of Multivariate Analysis (2002). https://doi.org/10.1002/0471271357
Saenz, J.A., Kalathur Gopal, S.R., Shukla, D.: Covid-19 fake news infodemic research dataset (covid19-fnir dataset) (2021). https://doi.org/10.21227/b5bt-5244, https://dx.doi.org/10.21227/b5bt-5244
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Shushkevich, E., Cardiff, J.: TUDublin team at Constraint@AAAI2021 - COVID19 Fake News Detection, January 2021. https://arxiv.org/abs/2101.05701
Vijjali, R., Potluri, P., Kumar, S., Teki, S.: Two stage transformer model for COVID-19 fake news detection and fact checking, November 2020. https://arxiv.org/abs/2011.13253
Wong, T.T., Yeh, P.Y.: Reliable accuracy estimates from \(k\)-fold cross validation. IEEE Trans. Knowl. Data Eng. 32(8), 1586–1594 (2020). https://doi.org/10.1109/TKDE.2019.2912815
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mejia, H., Chipantiza, C., Llumiquinga, J., Amaro, I.R., Fonseca-Delgado, R. (2022). COVID-19 Fake News Detection Using Joint Doc2Vec and Text Features with PCA. In: Guarda, T., Portela, F., Augusto, M.F. (eds) Advanced Research in Technologies, Information, Innovation and Sustainability. ARTIIS 2022. Communications in Computer and Information Science, vol 1675. Springer, Cham. https://doi.org/10.1007/978-3-031-20319-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-20319-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20318-3
Online ISBN: 978-3-031-20319-0
eBook Packages: Computer ScienceComputer Science (R0)