research-article

Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification

Authors:

Matloob Khushi,

Adam G. DunnAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2573 - 2581

https://doi.org/10.1145/3485447.3512129

Published: 25 April 2022 Publication History

Abstract

In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMCNET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods.

References

[1]

Rhys Biddle, Aditya Joshi, Shaowu Liu, Cecile Paris, and Guandong Xu. 2020. Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter. In Proceedings of The Web Conference 2020. 1217–1227.

Digital Library

[2]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.

Digital Library

[3]

Lauren E Charles-Smith, Tera L Reynolds, Mark A Cameron, Mike Conway, Eric HY Lau, Jennifer M Olsen, Julie A Pavlin, Mika Shigematsu, Laura C Streichert, Katie J Suda, 2015. Using social media for actionable disease surveillance and outbreak management: a systematic literature review. PloS one 10, 10 (2015), e0139701.

[4]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.

Digital Library

[5]

Munmun De Choudhury and Sushovan De. 2014. Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In Eighth international AAAI conference on weblogs and social media.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

[7]

Adam G Dunn, Kenneth D Mandl, and Enrico Coiera. 2018. Social media interventions for precision public health: promises and risks. NPJ digital medicine 1, 1 (2018), 1–4.

[8]

Manas Gaur, Amanuel Alambo, Joy Prakash Sain, Ugur Kursuncu, Krishnaprasad Thirunarayan, Ramakanth Kavuluru, Amit Sheth, Randy Welton, and Jyotishman Pathak. 2019. Knowledge-aware assessment of severity of suicide risk for early intervention. In The World Wide Web Conference. 514–525.

Digital Library

[9]

Su Golder, Gill Norman, and Yoon K Loke. 2015. Systematic review on the prevalence, frequency and comparative value of adverse events data in social media. British journal of clinical pharmacology 80, 4 (2015), 878–888.

[10]

Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. KNN model-based approach in classification. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”. Springer, 986–996.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 arXiv:https://doi.org/10.1162/neco.1997.9.8.1735

Digital Library

[12]

Adith Iyer, Aditya Joshi, Sarvnaz Karimi, Ross Sparks, and Cecile Paris. 2019. Figurative usage detection of symptom words to improve personal health mention detection. arXiv preprint arXiv:1906.05466(2019).

[13]

Keyuan Jiang, Shichao Feng, Qunhao Song, Ricardo A Calix, Matrika Gupta, and Gordon R Bernard. 2018. Identifying tweets of personal health experience through word embedding and LSTM neural network. BMC bioinformatics 19, 8 (2018), 210.

[14]

Payam Karisani and Eugene Agichtein. 2018. Did you really just have a heart attack? Towards robust detection of personal health mentions in social media. In Proceedings of the 2018 World Wide Web Conference. 137–146.

[15]

Donna M Kazemi, Brian Borsari, Maureen J Levine, and Beau Dooley. 2017. Systematic review of surveillance by social media platforms for illicit drug use. Journal of Public Health 39, 4 (2017), 763–776.

[16]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. arXiv preprint arXiv:1408.5882(2014).

[17]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for Stochastic Optimization. arXiv preprint arXiv:1412.6980(2014).

[18]

Alex Lamb, Michael Paul, and Mark Dredze. 2013. Separating fact from fear: Tracking flu infections on twitter. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 789–795.

[19]

Changsheng Liu and Rebecca Hwa. 2018. Heuristically informed unsupervised idiom usage recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1723–1731.

[20]

Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426(2018).

[21]

Saif Mohammad. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 174–184.

[22]

Usman Naseem, Adam G Dunn, Matloob Khushi, and Jinman Kim. 2021. Benchmarking for biomedical natural language processing tasks with a domain specific albert. arXiv preprint arXiv:2107.04374(2021).

[23]

Usman Naseem, Matloob Khushi, Jinman Kim, and Adam G Dunn. 2021. Classifying vaccine sentiment tweets by modelling domain-specific representation and commonsense knowledge into context-aware attentive GRU. arXiv preprint arXiv:2106.09589(2021).

[24]

Usman Naseem, Imran Razzak, Matloob Khushi, Peter W Eklund, and Jinman Kim. 2021. Covidsenti: A large-scale benchmark Twitter data set for COVID-19 sentiment analysis. IEEE Transactions on Computational Social Systems (2021).

[25]

Albert Park and Mike Conway. 2017. Tracking health related discussions on Reddit for public health applications. In AMIA Annual Symposium Proceedings, Vol. 2017. American Medical Informatics Association, 1362.

[26]

Albert Park, Mike Conway, and Annie T Chen. 2018. Examining thematic similarity, difference, and membership in three online mental health communities from Reddit: a text mining and visualization approach. Computers in human behavior 78 (2018), 98–112.

[27]

S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 3(1991), 660–674.

[28]

Abeed Sarker, Karen O’connor, Rachel Ginn, Matthew Scotch, Karen Smith, Dan Malone, and Graciela Gonzalez. 2016. Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug safety 39, 3 (2016), 231–240.

[29]

Sanja Scepanovic, Enrique Martin-Lopez, Daniele Quercia, and Khan Baykaner. 2020. Extracting medical entities from social media. In Proceedings of the ACM Conference on Health, Inference, and Learning. 170–181.

Digital Library

[30]

Guangyao Shen, Jiang Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui Hu, Tat-Seng Chua, and Wenwu Zhu. 2017. Depression detection via harvesting social media: a multimodal dictionary learning solution. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3838–3844.

[31]

Antonio Jimeno Yepes, Andrew MacKinlay, and Bo Han. 2015. Investigating public health surveillance using Twitter. In Proceedings of BioNLP 15. 164–170.

Cited By

Tahir BAmir Mehmood M(2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3537168
Naseem UKim JKhush MDunn AAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)A Linguistic Grounding-Infused Contrastive Learning Approach for Health Mention Classification on Social MediaProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635763(529-537)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635763
Naseem UKhushi MKim JDunn A(2024)Hybrid Text Representation for Explainable Suicide Risk Identification on Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.318498411:4(4663-4672)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2022.3184984
Show More Cited By

Index Terms

Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification
1. Applied computing
  1. Life and medical sciences
    1. Health care information systems
    2. Health informatics
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter
WWW '20: Proceedings of The Web Conference 2020

Harnessing data from social media to monitor health events is a promising avenue for public health surveillance. A key step is the detection of reports of a disease (referred to as ‘health mention classification’) amongst tweets that mention disease ...
Improving Health Mention Classification Through Emphasising Literal Meanings: A Study Towards Diversity and Generalisation for Public Health Surveillance
WWW '23: Proceedings of the ACM Web Conference 2023

People often use disease or symptom terms on social media and online forums in ways other than to describe their health. Thus the NLP health mention classification (HMC) task aims to identify posts where users are discussing health conditions literally, ...
A Linguistic Grounding-Infused Contrastive Learning Approach for Health Mention Classification on Social Media
WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

Social media users use disease and symptoms words in different ways, including describing their personal health experiences figuratively or in other general discussions. The health mention classification (HMC) task aims to separate how people use terms, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
589
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)11

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tahir BAmir Mehmood M(2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3537168
Naseem UKim JKhush MDunn AAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)A Linguistic Grounding-Infused Contrastive Learning Approach for Health Mention Classification on Social MediaProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635763(529-537)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635763
Naseem UKhushi MKim JDunn A(2024)Hybrid Text Representation for Explainable Suicide Risk Identification on Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.318498411:4(4663-4672)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2022.3184984
Meng DPhuntsho TGonsalves T(2024)Enhancing Health Mention Classification Through Reexamining Misclassified Samples and Robust Fine-Tuning Pre-Trained Language ModelsIEEE Access10.1109/ACCESS.2024.351038812(190445-190453)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3510388
Hassan EAbd El-Hafeez TShams M(2024)Optimizing classification of diseases through language model analysis of symptomsScientific Reports10.1038/s41598-024-51615-514:1Online publication date: 17-Jan-2024
https://doi.org/10.1038/s41598-024-51615-5
Naseem UKhushi MKim JDunn A(2023)RHMD: A Real-World Dataset for Health Mention Classification on RedditIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.318688310:5(2325-2334)Online publication date: Oct-2023
https://doi.org/10.1109/TCSS.2022.3186883
Naseem UKim JKhushi MDunn A(2023)Robust Identification of Figurative Language in Personal Health Mentions on TwitterIEEE Transactions on Artificial Intelligence10.1109/TAI.2022.31754694:2(362-372)Online publication date: Apr-2023
https://doi.org/10.1109/TAI.2022.3175469
Krishna CAnoop V(2023)Figurative Health-mention Classification from Social Media using Graph Convolutional Networks2023 9th International Conference on Smart Computing and Communications (ICSCC)10.1109/ICSCC59169.2023.10334990(570-575)Online publication date: 17-Aug-2023
https://doi.org/10.1109/ICSCC59169.2023.10334990
Tijare P(2023)Event Labeling Approach for Twitter Datasets Leveraging N-grams, Topics, and Machine Learning Algorithms for Enhanced Event Detection2023 4th International Conference on Communication, Computing and Industry 6.0 (C216)10.1109/C2I659362.2023.10430550(1-6)Online publication date: 15-Dec-2023
https://doi.org/10.1109/C2I659362.2023.10430550
Kalantari NPayandeh AZampieri MMotti V(2023)Understanding the Language of ADHD and Autism Communities on Social Media2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386833(2188-2195)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386833
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten