Abstract
The COVID-19 pandemic has recently shed light on the potential for social media as a means of spreading mis-, dis-, and malinformation. This paper investigates embedding and cluster-based topic modelling to characterise the COVID-19 infodemic on South African Twitter, which has largely remained unstudied during the COVID-19 pandemic. The best performing model is able to identify specific misinformation narratives, but these narratives are mostly found within more general topics. A more fine-grained model is trained, and is able to much better isolate rumour/misinformation topics from more general topics. Finally, the paper makes several suggestions for dealing with the multilingual and code-switched nature of South African Twitter, as well as for the exploration and development of new dynamic topic modeling approaches that could be especially valuable for tracing the development of specific misinformation or rumour narratives over time. The paper presents novel insights and results on the application of a combination of data mining, machine learning and optimisation for addressing the pressing issue of misleading information on social media.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdelminaam, D.S., Ismail, F.H., Taha, M., Taha, A., Houssein, E.H., Nabil, A.: CoAID-DEEP: an optimized intelligent framework for automated detecting COVID-19 misleading information on twitter. IEEE ACCESS 9 (2021). https://doi.org/10.1109/ACCESS.2021.3058066
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Campello, R.J., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discovery Data (TKDD) 10(1), 1–51 (2015)
Cui, L., Lee, D.: CoAID: COVID-19 healthcare misinformation dataset. Comput. Res. Repository (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)
Filby, S., van der Zee, K., van Walbeek, C.: The temporary ban on tobacco sales in south Africa: lessons for endgame strategies. Tob. Control (2021)
Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure. arXiv preprint: arXiv:2203.05794 (2022)
Grootendorst, M.: BERTopic algorithm (2023). https://maartengr.github.io/BERTopic/algorithm/algorithm.html. Accessed 20 Oct 2021
Hayawi, K., Shahriar, S., Serhani, M., Taleb, I., Mathew, S.: ANTi-Vax: a novel twitter dataset for COVID-19 vaccine misinformation detection. Publ. Health 203, 23–30 (2022)
Kaliyar, R.K., Goswami, A., Narang, P.: A hybrid model for effective fake news detection with a novel COVID-19 dataset. In: ICAART (2), pp. 1066–1072 (2021)
Kemp, S.: Digital 2022: South Africa (2022). https://datareportal.com/reports/digital-2022-south-africa. Accessed 30 Aug 2022
Matzopoulos, R., Walls, H., Cook, S., London, L.: South Africa’s COVID-19 alcohol sales ban: the potential for better policy-making. Int. J. Health Policy Manag. 9(11), 486 (2020)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint: arXiv:1802.03426 (2018)
Memon, S.A., Carley, K.M.: Characterizing COVID-19 misinformation communities using a novel twitter dataset. Comput. Res. Repository (2020)
Mutanga, M.B., Abayomi, A.: Tweeting on COVID-19 pandemic in south Africa: LDA-based topic modelling approach. Afr. J. Sci. Technol. Innov. Dev. 12, 1–10 (2020)
Patwa, P., et al.: Fighting an infodemic: COVID-19 fake news dataset. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 21–29. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5_3
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong, November 2019. https://doi.org/10.18653/v1/D19-1410, https://aclanthology.org/D19-1410
Strydom, I.F., Grobler, J.: Transformers for COVID-19 misinformation detection on twitter: a south African case study. In: Giuseppe, N., et al. (eds.) Machine Learning, Optimization, and Data Science: 7th International Conference (LOD 2022), pp. 197–210. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25599-1_15
Acknowledgements
This work is based on the research supported in part by the National Research Foundation of South Africa (Grant number: 129340).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Strydom, I.F., Grobler, J. (2023). Topic Modelling for Characterizing COVID-19 Misinformation on Twitter: A South African Case Study. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023. ICCSA 2023. Lecture Notes in Computer Science, vol 13957. Springer, Cham. https://doi.org/10.1007/978-3-031-36808-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-36808-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36807-3
Online ISBN: 978-3-031-36808-0
eBook Packages: Computer ScienceComputer Science (R0)