skip to main content
10.1145/3477495.3531745acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines

Published: 07 July 2022 Publication History

Abstract

Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by some state-of-the-art models.

Supplementary Material

MP4 File (SIGIR22-rs1891.mp4)
A presentation describing the CAVES dataset, which consists of tweets for explainable multi-label classification and corresponding summaries.

References

[1]
Erika Bonnevie, Allison Gallegos-Jeffrey, Jaclyn Goldbarg, Brian Byrd, and Joseph Smyser. 2021. Quantifying the rise of vaccine opposition on Twitter during the COVID-19 pandemic. Journal of communication in healthcare 14, 1 (2021), 12--19.
[2]
Ziqiang Cao, Chengyao Chen, Wenjie Li, Sujian Li, Furu Wei, and Ming Zhou. 2016. Tgsum: Build tweet guided multi-document summarization dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
[3]
Liviu-Adrian Cotfas, Camelia Delcea, Ioan Roxin, Corina Ioan??, Dana Simona Gherai, and Federico Tajariol. 2021. The longest month: Analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement. IEEE Access 9 (2021), 33203--33223.
[4]
Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. 2020. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4443--4458.
[5]
Kuldeep Dhama, Khan Sharun, Ruchi Tiwari, Manish Dhawan, Talha Bin Emran, Ali A Rabaan, and Saad Alhumaid. 2021. COVID-19 vaccine hesitancy--reasons and solutions to achieve a successful global vaccination campaign to tackle the ongoing pandemic. Human Vaccines & Immunotherapeutics 17, 10 (2021), 3495--3499.
[6]
Soumi Dutta, Vibhash Chandra, Kanav Mehra, Asit Kumar Das, Tanmoy Chakraborty, and Saptarshi Ghosh. 2018. Ensemble Algorithms for Microblog Summarization. IEEE Intelligent Systems 33, 3 (2018), 4--14.
[7]
Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research 22 (2004), 457--479.
[8]
Alexander R Fabbri, Wojciech Kry?ci'ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391--409.
[9]
Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749 (2019).
[10]
Keith Gunaratne, Eric A Coomes, and Hourmazd Haghbayan. 2019. Temporal trends in anti-vaccine discourse on Twitter. Vaccine 37, 35 (2019), 4867--4871.
[11]
Ruifang He, Liangliang Zhao, and Huanyu Liu. 2020. TWEETSUM: Event oriented social summarization dataset. In Proceedings of the 28th International Conference on Computational Linguistics. 5731--5736.
[12]
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1--9.
[13]
Neil F Johnson, Nicolas Velásquez, Nicholas Johnson Restrepo, Rhys Leahy, Nicholas Gabriel, Sara El Oud, Minzhang Zheng, Pedro Manrique, Stefan Wuchty, and Yonatan Lupu. 2020. The online competition between pro-and antivaccination views. Nature 582, 7811 (2020), 230--233.
[14]
David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5, Apr (2004), 361--397.
[15]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871--7880.
[16]
Irene Li, Tianxiao Li, Yixin Li, Ruihai Dong, and Toyotaro Suzumura. 2021. Heterogeneous Graph Neural Networks for Multi-label Text Classification. arXiv preprint arXiv:2103.14620 (2021).
[17]
Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2021. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14867--14875.
[18]
Richard McCreadie, Cody Buntain, and Ian Soboroff. 2019. Trec incident streams: Finding actionable information on social media. (2019).
[19]
Tanushree Mitra, Scott Counts, and James W Pennebaker. 2016. Understanding anti-vaccination attitudes in social media. In Tenth International AAAI Conference on Web and Social Media.
[20]
Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 task 1: Affect in tweets. In Proceedings of the 12th international workshop on semantic evaluation. 1--17.
[21]
James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable Prediction of Medical Codes from Clinical Text. In Proceedings of NAACL-HLT. 1101--1111.
[22]
Martin Müller, Marcel Salathé, and Per E Kummervold. 2020. Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv preprint arXiv:2005.07503 (2020).
[23]
Martin M Müller and Marcel Salathé. 2019. Crowdbreaks: tracking health trends using public social media data and crowdsourcing. Frontiers in public health 7 (2019), 81.
[24]
Minh-Tien Nguyen, Dac Viet Lai, Huy Tien Nguyen, and Minh Le Nguyen. 2018. Tsix: a human-involved-creation dataset for tweet summarization. In Proc. International Conference on Language Resources and Evaluation (LREC).
[25]
Tasmiah Nuzhath, Samia Tasnim, Rahul Kumar Sanjwal, Nusrat Fahmida Trisha, Mariya Rahman, SM Farabi Mahmud, Arif Arman, Susmita Chakraborty, and Md Mahbub Hossain. 2020. COVID-19 vaccination hesitancy, misinformation and conspiracy theories on social media: A content analysis of Twitter data. (2020).
[26]
Rebecca Passonneau. 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. (2006).
[27]
Elise Paul, Andrew Steptoe, and Daisy Fancourt. 2021. Attitudes towards vaccines and intention to vaccinate against COVID-19: Implications for public health communications. The Lancet Regional Health-Europe 1 (2021), 100012.
[28]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[29]
Soham Poddar, Mainack Mondal, Janardan Misra, Niloy Ganguly, and Saptarshi Ghosh. 2022. Winds of Change: Impact of COVID-19 on Vaccine-related Opinions of Twitter users. In Proceedings of the Sixteenth International AAAI Conference on Web and Social Media (ICWSM'22).
[30]
SV Praveen, Rajesh Ittamalla, and Gerard Deepak. 2021. Analyzing the attitude of Indian citizens towards COVID-19 vaccine--A text analytics study. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 15, 2 (2021), 595--599.
[31]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
[32]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should i trust you?" Explaining the predictions of any classifier. In Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[33]
Koustav Rudra, Subham Ghosh, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. 2015. Extracting situational information from microblogs during disaster events: a classification-summarization approach. In Proceedings of the 24th ACM international on conference on information and knowledge management. 583--592.
[34]
Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratification of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 145--158.
[35]
Shadi Shahsavari, Pavan Holur, Tianyi Wang, Timothy R Tangherlini, and Vwani Roychowdhury. 2020. Conspiracy in the time of corona: Automatic detection of emerging COVID-19 conspiracy theories in social media and the news. Journal of computational social science 3, 2 (2020), 279--317.
[36]
Kalyani Sonawane, Catherine L Troisi, and Ashish A Deshmukh. 2021. COVID19 vaccination in the UK: Addressing vaccine hesitancy. The Lancet Regional Health--Europe 1 (2021).
[37]
AJMC Staff. 2021. A Timeline of COVID-19 Developments in 2020. AJMC (2021). https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020
[38]
AJMC Staff. 2021. A Timeline of COVID-19 Vaccine Developments in 2021. AJMC (2021). https://www.ajmc.com/view/a-timeline-of-covid-19-vaccinedevelopments-in-2021
[39]
Piotr Szyma'ski and Tomasz Kajdanowicz. 2017. A network perspective on stratification of multi-label data. In First International Workshop on Learning with Imbalanced Domains: Theory and Applications. PMLR, 22--35.
[40]
Piotr Szyma'ski and Tomasz Kajdanowicz. 2017. A scikit-based Python environment for performing multi-label classification. arXiv preprint arXiv:1702.01460 (2017).
[41]
Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben, and Ujwal Gadiraju. 2013. Groundhog day: near-duplicate detection on twitter. In Proceedings of the 22nd international conference on World Wide Web. 1273--1284.
[42]
Gianmarco Troiano and Alessandra Nardi. 2021. Vaccine hesitancy in the era of COVID-19. Public health 194 (2021), 245--251.
[43]
Haonan Wang, Yang Gao, Yu Bai, Mirella Lapata, and Heyan Huang. 2021. Exploring Explainable Selection to Control Abstractive Summarization. In Proc. AAAI Conference on Artificial Intelligence. 13933--13941.
[44]
Xiaoyi Yuan, Ross J Schuchard, and Andrew T Crooks. 2019. Examining emergent communities and social bots within the polarized online vaccination debate in Twitter. Social media+ society 5, 3 (2019), 2056305119865465.
[45]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328--11339.
[46]
Zijian Zhang, Koustav Rudra, and Avishek Anand. 2021. Explain and predict, and then predict again. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 418--426.
[47]
Hao Zheng and Mirella Lapata. 2019. Sentence Centrality Revisited for Unsupervised Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6236--6247.

Cited By

View all
  • (2025)PORTRAIT: A Hybrid Approach to Create Extractive Ground-truth Summary for Disaster EventACM Transactions on the Web10.1145/371190819:1(1-36)Online publication date: 15-Feb-2025
  • (2025)ATSumm: Auxiliary information enhanced approach for abstractive disaster tweet summarization with sparse training dataKnowledge-Based Systems10.1016/j.knosys.2025.112969311(112969)Online publication date: Feb-2025
  • (2025)Enhancing multilabel classification for unbalanced COVID-19 vaccination hesitancy tweets using ensemble learningComputers in Biology and Medicine10.1016/j.compbiomed.2024.109437184(109437)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
        July 2022
        3569 pages
        ISBN:9781450387323
        DOI:10.1145/3477495
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 July 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. anti-vaccine concerns
        2. covid-19
        3. dataset
        4. explainable classification
        5. multi-label classification
        6. summarization
        7. tweets

        Qualifiers

        • Research-article

        Funding Sources

        • Accenture Corporation
        • DRDO Government of India

        Conference

        SIGIR '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 792 of 3,983 submissions, 20%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)59
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 28 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)PORTRAIT: A Hybrid Approach to Create Extractive Ground-truth Summary for Disaster EventACM Transactions on the Web10.1145/371190819:1(1-36)Online publication date: 15-Feb-2025
        • (2025)ATSumm: Auxiliary information enhanced approach for abstractive disaster tweet summarization with sparse training dataKnowledge-Based Systems10.1016/j.knosys.2025.112969311(112969)Online publication date: Feb-2025
        • (2025)Enhancing multilabel classification for unbalanced COVID-19 vaccination hesitancy tweets using ensemble learningComputers in Biology and Medicine10.1016/j.compbiomed.2024.109437184(109437)Online publication date: Jan-2025
        • (2024)“Double vaccinated, 5G boosted!”: Learning Attitudes towards COVID-19 Vaccination from Social MediaACM Transactions on the Web10.1145/370265419:1(1-24)Online publication date: 4-Dec-2024
        • (2024)MuLX-QA: Classifying Multi-Labels and Extracting Rationale Spans in Social Media PostsACM Transactions on the Web10.1145/365330318:3(1-26)Online publication date: 6-May-2024
        • (2024)Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE QuestionsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657882(1073-1082)Online publication date: 10-Jul-2024
        • (2024)Development and validation of VaxConcerns: A taxonomy of vaccine concerns and misinformation with Crowdsource-ViabilityVaccine10.1016/j.vaccine.2024.02.08142:10(2672-2679)Online publication date: Apr-2024
        • (2024)ADSumm: annotated ground-truth summary datasets for disaster tweet summarizationSocial Network Analysis and Mining10.1007/s13278-024-01323-914:1Online publication date: 5-Aug-2024
        • (2024)Utilizing the Twitter social media to identify transportation-related grievances in Indian citiesSocial Network Analysis and Mining10.1007/s13278-024-01278-x14:1Online publication date: 17-Jun-2024
        • (2024)ICPR 2024 Competition on Multilingual Claim-Span IdentificationPattern Recognition. Competitions10.1007/978-3-031-80139-6_10(134-144)Online publication date: 30-Nov-2024
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media