COVID-19 Fake News Detection Using Joint Doc2Vec and Text Features with PCA

Mejia, Hector; Chipantiza, Carlos; Llumiquinga, Jose; Amaro, Isidro R.; Fonseca-Delgado, Rigoberto

doi:10.1007/978-3-031-20319-0_24

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1675))

Included in the following conference series:

International Conference on Advanced Research in Technologies, Information, Innovation and Sustainability

958 Accesses

Abstract

With the current pandemic, it is imperative to stay up to date with the news and many sources contribute to this purpose. However, there is also misinformation and fake news that spreads within society. In this work, a machine learning approach to detect fake news related to COVID-19 is developed. Specifically, Doc2Vec language model is used to transform text documents into vector representations, and handcrafted features like document length, the proportion of personal pronouns, and punctuation are included as complementary features as well. Then, Principal Component Analysis (PCA) is performed on the original feature vectors to reduce dimensionality. Both, the original and reduced data are fed to various machine learning models and finally compared in terms of accuracy, precision, recall, and execution time. The results indicate that the reduced set of features had minimal accuracy impact. However, the execution times are greatly reduced in most cases, specifically at testing time, indicating that dimensionality reduction can be useful on projects already in production that would need model inference on large volumes of documents to detect fake news.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CoviFacts—A COVID-19 Fake News Detector Using Natural Language Processing

Analysis of contextual features’ granularity for fake news detection

Article 15 November 2023

ConFake: fake news identification using content based features

Article 19 June 2023

References

Akkoyunlu, S., Manfredotti, C., Cornuéjols, A., Darcel, N., Delaere, F.: Exploring eating behaviours modelling for user clustering. In: HealthRecSys@ RecSys 2018 colocated with ACM Recsys 2018 (ACM Conference Series on Recommender Systems), pp. 46–51 (2018)
Google Scholar
Alam, F., et al.: Fighting the covid-19 infodemic in social media: a holistic perspective and a call to arms. In: Proceeding of the Fifteenth International AAAI Conference on Web and Social Media (ICWSM 2021) (2021)
Google Scholar
Almatarneh, S., Gamallo, P., ALshargabi, B., Al-Khassawneh, Y., Alzubi, R.: Comparing traditional machine learning methods for covid-19 fake news. In: 2021 22nd International Arab Conference on Information Technology (ACIT), pp. 1–4 (2021). https://doi.org/10.1109/ACIT53391.2021.9677453
Bang, Y., Ishii, E., Cahyawijaya, S., Ji, Z., Fung, P.: Model generalization on COVID-19 fake news detection. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 128–140. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_13
Chapter Google Scholar
De Magistris, G., Russo, S., Roma, P., Starczewski, J.T., Napoli, C.: An explainable fake news detector based on named entity recognition and stance classification applied to covid-19. Information 13(3) (2022). https://doi.org/10.3390/info13030137, https://www.mdpi.com/2078-2489/13/3/137
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Felber, T.: Constraint 2021: machine learning models for COVID-19 fake news detection shared task, January 2021. CoRR, abs/2101.03717
Google Scholar
Fletcher, T.: Support vector machines explained, January 2009. https://www.cs.ucl.ac.uk/staff//T.Fletcher/
Jolliffe, I.T.: Principal Component Analysis for Special Types of Data. In: Principal Component Analysis, pp. 338–372. Springer, New York (2002). https://doi.org/10.1007/0-387-22440-8_13
Kobayashi, S., Yokoi, S., Suzuki, J., Inui, K.: Efficient estimation of influence of a training instance. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Stroudsburg, PA, USA, pp. 41–47. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.sustainlp-1.6, https://www.aclweb.org/anthology/2020.sustainlp-1.6
Koirala, A.: Covid-19 fake news classification with deep learning, October 2020. https://doi.org/10.13140/RG.2.2.26509.56805
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite Bert for self-supervised learning of language representations. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=H1eA7AEtvS
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Google Scholar
Li, F.H., Huang, M., Yang, Y., Zhu, X.: Learning to identify review spam. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Google Scholar
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. https://arxiv.org/abs/1907.11692
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Paper, D.: Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python. Apress (2019). https://books.google.com.ec/books?id=kqy-DwAAQBAJ
Patwa, P., et al.: Fighting an Infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3
Chapter Google Scholar
Peluffo-Ordóñez, D.H., Rodríguez-Sotelo, J.L., Castellanos-Domínguez, G.: Estudio comparativo de métodos de selección de características de inferencia supervisada y no supervisada. TecnoLógicas pp. 149–166 (2009)
Google Scholar
Rencher, A.C.: Methods of Multivariate Analysis (2002). https://doi.org/10.1002/0471271357
Saenz, J.A., Kalathur Gopal, S.R., Shukla, D.: Covid-19 fake news infodemic research dataset (covid19-fnir dataset) (2021). https://doi.org/10.21227/b5bt-5244, https://dx.doi.org/10.21227/b5bt-5244
Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5
Chapter Google Scholar
Shushkevich, E., Cardiff, J.: TUDublin team at Constraint@AAAI2021 - COVID19 Fake News Detection, January 2021. https://arxiv.org/abs/2101.05701
Vijjali, R., Potluri, P., Kumar, S., Teki, S.: Two stage transformer model for COVID-19 fake news detection and fact checking, November 2020. https://arxiv.org/abs/2011.13253
Wong, T.T., Yeh, P.Y.: Reliable accuracy estimates from $k$-fold cross validation. IEEE Trans. Knowl. Data Eng. 32(8), 1586–1594 (2020). https://doi.org/10.1109/TKDE.2019.2912815
Article Google Scholar

Download references

Author information

Authors and Affiliations

Yachay Tech, School of Mathematical and Computational Sciences, San Miguel de Urcuqui, 100119, Imbabura, Ecuador
Hector Mejia, Carlos Chipantiza, Jose Llumiquinga, Isidro R. Amaro & Rigoberto Fonseca-Delgado
Factored AI, Palo Alto, CA, USA
Hector Mejia & Carlos Chipantiza

Authors

Hector Mejia
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Chipantiza
View author publications
You can also search for this author in PubMed Google Scholar
Jose Llumiquinga
View author publications
You can also search for this author in PubMed Google Scholar
Isidro R. Amaro
View author publications
You can also search for this author in PubMed Google Scholar
Rigoberto Fonseca-Delgado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Chipantiza .

Editor information

Editors and Affiliations

Universidad Estatal Península de Santa, La Libertad, Ecuador
Teresa Guarda
University of Minho, Guimarães, Portugal
Filipe Portela
BITrum Research Group, Leon, Spain
Maria Fernanda Augusto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mejia, H., Chipantiza, C., Llumiquinga, J., Amaro, I.R., Fonseca-Delgado, R. (2022). COVID-19 Fake News Detection Using Joint Doc2Vec and Text Features with PCA. In: Guarda, T., Portela, F., Augusto, M.F. (eds) Advanced Research in Technologies, Information, Innovation and Sustainability. ARTIIS 2022. Communications in Computer and Information Science, vol 1675. Springer, Cham. https://doi.org/10.1007/978-3-031-20319-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-20319-0_24
Published: 25 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20318-3
Online ISBN: 978-3-031-20319-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

COVID-19 Fake News Detection Using Joint Doc2Vec and Text Features with PCA