Skip to main content

COVID-19 Fake News Detection Using Joint Doc2Vec and Text Features with PCA

  • Conference paper
  • First Online:
Advanced Research in Technologies, Information, Innovation and Sustainability (ARTIIS 2022)

Abstract

With the current pandemic, it is imperative to stay up to date with the news and many sources contribute to this purpose. However, there is also misinformation and fake news that spreads within society. In this work, a machine learning approach to detect fake news related to COVID-19 is developed. Specifically, Doc2Vec language model is used to transform text documents into vector representations, and handcrafted features like document length, the proportion of personal pronouns, and punctuation are included as complementary features as well. Then, Principal Component Analysis (PCA) is performed on the original feature vectors to reduce dimensionality. Both, the original and reduced data are fed to various machine learning models and finally compared in terms of accuracy, precision, recall, and execution time. The results indicate that the reduced set of features had minimal accuracy impact. However, the execution times are greatly reduced in most cases, specifically at testing time, indicating that dimensionality reduction can be useful on projects already in production that would need model inference on large volumes of documents to detect fake news.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Akkoyunlu, S., Manfredotti, C., Cornuéjols, A., Darcel, N., Delaere, F.: Exploring eating behaviours modelling for user clustering. In: HealthRecSys@ RecSys 2018 colocated with ACM Recsys 2018 (ACM Conference Series on Recommender Systems), pp. 46–51 (2018)

    Google Scholar 

  2. Alam, F., et al.: Fighting the covid-19 infodemic in social media: a holistic perspective and a call to arms. In: Proceeding of the Fifteenth International AAAI Conference on Web and Social Media (ICWSM 2021) (2021)

    Google Scholar 

  3. Almatarneh, S., Gamallo, P., ALshargabi, B., Al-Khassawneh, Y., Alzubi, R.: Comparing traditional machine learning methods for covid-19 fake news. In: 2021 22nd International Arab Conference on Information Technology (ACIT), pp. 1–4 (2021). https://doi.org/10.1109/ACIT53391.2021.9677453

  4. Bang, Y., Ishii, E., Cahyawijaya, S., Ji, Z., Fung, P.: Model generalization on COVID-19 fake news detection. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 128–140. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_13

    Chapter  Google Scholar 

  5. De Magistris, G., Russo, S., Roma, P., Starczewski, J.T., Napoli, C.: An explainable fake news detector based on named entity recognition and stance classification applied to covid-19. Information 13(3) (2022). https://doi.org/10.3390/info13030137, https://www.mdpi.com/2078-2489/13/3/137

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  7. Felber, T.: Constraint 2021: machine learning models for COVID-19 fake news detection shared task, January 2021. CoRR, abs/2101.03717

    Google Scholar 

  8. Fletcher, T.: Support vector machines explained, January 2009. https://www.cs.ucl.ac.uk/staff//T.Fletcher/

  9. Jolliffe, I.T.: Principal Component Analysis for Special Types of Data. In: Principal Component Analysis, pp. 338–372. Springer, New York (2002). https://doi.org/10.1007/0-387-22440-8_13

  10. Kobayashi, S., Yokoi, S., Suzuki, J., Inui, K.: Efficient estimation of influence of a training instance. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Stroudsburg, PA, USA, pp. 41–47. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.sustainlp-1.6, https://www.aclweb.org/anthology/2020.sustainlp-1.6

  11. Koirala, A.: Covid-19 fake news classification with deep learning, October 2020. https://doi.org/10.13140/RG.2.2.26509.56805

  12. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite Bert for self-supervised learning of language representations. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=H1eA7AEtvS

  13. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)

    Google Scholar 

  14. Li, F.H., Huang, M., Yang, Y., Zhu, X.: Learning to identify review spam. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)

    Google Scholar 

  15. Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. https://arxiv.org/abs/1907.11692

  16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  17. Paper, D.: Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python. Apress (2019). https://books.google.com.ec/books?id=kqy-DwAAQBAJ

  18. Patwa, P., et al.: Fighting an Infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3

    Chapter  Google Scholar 

  19. Peluffo-Ordóñez, D.H., Rodríguez-Sotelo, J.L., Castellanos-Domínguez, G.: Estudio comparativo de métodos de selección de características de inferencia supervisada y no supervisada. TecnoLógicas pp. 149–166 (2009)

    Google Scholar 

  20. Rencher, A.C.: Methods of Multivariate Analysis (2002). https://doi.org/10.1002/0471271357

  21. Saenz, J.A., Kalathur Gopal, S.R., Shukla, D.: Covid-19 fake news infodemic research dataset (covid19-fnir dataset) (2021). https://doi.org/10.21227/b5bt-5244, https://dx.doi.org/10.21227/b5bt-5244

  22. Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5

    Chapter  Google Scholar 

  23. Shushkevich, E., Cardiff, J.: TUDublin team at Constraint@AAAI2021 - COVID19 Fake News Detection, January 2021. https://arxiv.org/abs/2101.05701

  24. Vijjali, R., Potluri, P., Kumar, S., Teki, S.: Two stage transformer model for COVID-19 fake news detection and fact checking, November 2020. https://arxiv.org/abs/2011.13253

  25. Wong, T.T., Yeh, P.Y.: Reliable accuracy estimates from \(k\)-fold cross validation. IEEE Trans. Knowl. Data Eng. 32(8), 1586–1594 (2020). https://doi.org/10.1109/TKDE.2019.2912815

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Chipantiza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mejia, H., Chipantiza, C., Llumiquinga, J., Amaro, I.R., Fonseca-Delgado, R. (2022). COVID-19 Fake News Detection Using Joint Doc2Vec and Text Features with PCA. In: Guarda, T., Portela, F., Augusto, M.F. (eds) Advanced Research in Technologies, Information, Innovation and Sustainability. ARTIIS 2022. Communications in Computer and Information Science, vol 1675. Springer, Cham. https://doi.org/10.1007/978-3-031-20319-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20319-0_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20318-3

  • Online ISBN: 978-3-031-20319-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics