skip to main content
10.1145/3485447.3512129acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification

Published: 25 April 2022 Publication History

Abstract

In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMCNET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods.

References

[1]
Rhys Biddle, Aditya Joshi, Shaowu Liu, Cecile Paris, and Guandong Xu. 2020. Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter. In Proceedings of The Web Conference 2020. 1217–1227.
[2]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
[3]
Lauren E Charles-Smith, Tera L Reynolds, Mark A Cameron, Mike Conway, Eric HY Lau, Jennifer M Olsen, Julie A Pavlin, Mika Shigematsu, Laura C Streichert, Katie J Suda, 2015. Using social media for actionable disease surveillance and outbreak management: a systematic literature review. PloS one 10, 10 (2015), e0139701.
[4]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
[5]
Munmun De Choudhury and Sushovan De. 2014. Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In Eighth international AAAI conference on weblogs and social media.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[7]
Adam G Dunn, Kenneth D Mandl, and Enrico Coiera. 2018. Social media interventions for precision public health: promises and risks. NPJ digital medicine 1, 1 (2018), 1–4.
[8]
Manas Gaur, Amanuel Alambo, Joy Prakash Sain, Ugur Kursuncu, Krishnaprasad Thirunarayan, Ramakanth Kavuluru, Amit Sheth, Randy Welton, and Jyotishman Pathak. 2019. Knowledge-aware assessment of severity of suicide risk for early intervention. In The World Wide Web Conference. 514–525.
[9]
Su Golder, Gill Norman, and Yoon K Loke. 2015. Systematic review on the prevalence, frequency and comparative value of adverse events data in social media. British journal of clinical pharmacology 80, 4 (2015), 878–888.
[10]
Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. KNN model-based approach in classification. In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”. Springer, 986–996.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 arXiv:https://doi.org/10.1162/neco.1997.9.8.1735
[12]
Adith Iyer, Aditya Joshi, Sarvnaz Karimi, Ross Sparks, and Cecile Paris. 2019. Figurative usage detection of symptom words to improve personal health mention detection. arXiv preprint arXiv:1906.05466(2019).
[13]
Keyuan Jiang, Shichao Feng, Qunhao Song, Ricardo A Calix, Matrika Gupta, and Gordon R Bernard. 2018. Identifying tweets of personal health experience through word embedding and LSTM neural network. BMC bioinformatics 19, 8 (2018), 210.
[14]
Payam Karisani and Eugene Agichtein. 2018. Did you really just have a heart attack? Towards robust detection of personal health mentions in social media. In Proceedings of the 2018 World Wide Web Conference. 137–146.
[15]
Donna M Kazemi, Brian Borsari, Maureen J Levine, and Beau Dooley. 2017. Systematic review of surveillance by social media platforms for illicit drug use. Journal of Public Health 39, 4 (2017), 763–776.
[16]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. arXiv preprint arXiv:1408.5882(2014).
[17]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for Stochastic Optimization. arXiv preprint arXiv:1412.6980(2014).
[18]
Alex Lamb, Michael Paul, and Mark Dredze. 2013. Separating fact from fear: Tracking flu infections on twitter. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 789–795.
[19]
Changsheng Liu and Rebecca Hwa. 2018. Heuristically informed unsupervised idiom usage recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1723–1731.
[20]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426(2018).
[21]
Saif Mohammad. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 174–184.
[22]
Usman Naseem, Adam G Dunn, Matloob Khushi, and Jinman Kim. 2021. Benchmarking for biomedical natural language processing tasks with a domain specific albert. arXiv preprint arXiv:2107.04374(2021).
[23]
Usman Naseem, Matloob Khushi, Jinman Kim, and Adam G Dunn. 2021. Classifying vaccine sentiment tweets by modelling domain-specific representation and commonsense knowledge into context-aware attentive GRU. arXiv preprint arXiv:2106.09589(2021).
[24]
Usman Naseem, Imran Razzak, Matloob Khushi, Peter W Eklund, and Jinman Kim. 2021. Covidsenti: A large-scale benchmark Twitter data set for COVID-19 sentiment analysis. IEEE Transactions on Computational Social Systems (2021).
[25]
Albert Park and Mike Conway. 2017. Tracking health related discussions on Reddit for public health applications. In AMIA Annual Symposium Proceedings, Vol. 2017. American Medical Informatics Association, 1362.
[26]
Albert Park, Mike Conway, and Annie T Chen. 2018. Examining thematic similarity, difference, and membership in three online mental health communities from Reddit: a text mining and visualization approach. Computers in human behavior 78 (2018), 98–112.
[27]
S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 3(1991), 660–674.
[28]
Abeed Sarker, Karen O’connor, Rachel Ginn, Matthew Scotch, Karen Smith, Dan Malone, and Graciela Gonzalez. 2016. Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug safety 39, 3 (2016), 231–240.
[29]
Sanja Scepanovic, Enrique Martin-Lopez, Daniele Quercia, and Khan Baykaner. 2020. Extracting medical entities from social media. In Proceedings of the ACM Conference on Health, Inference, and Learning. 170–181.
[30]
Guangyao Shen, Jiang Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui Hu, Tat-Seng Chua, and Wenwu Zhu. 2017. Depression detection via harvesting social media: a multimodal dictionary learning solution. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3838–3844.
[31]
Antonio Jimeno Yepes, Andrew MacKinlay, and Bo Han. 2015. Investigating public health surveillance using Twitter. In Proceedings of BioNLP 15. 164–170.

Cited By

View all
  • (2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
  • (2024)A Linguistic Grounding-Infused Contrastive Learning Approach for Health Mention Classification on Social MediaProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635763(529-537)Online publication date: 4-Mar-2024
  • (2024)Hybrid Text Representation for Explainable Suicide Risk Identification on Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.318498411:4(4663-4672)Online publication date: Aug-2024
  • Show More Cited By

Index Terms

  1. Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '22: Proceedings of the ACM Web Conference 2022
          April 2022
          3764 pages
          ISBN:9781450390965
          DOI:10.1145/3485447
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 April 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Health Mention Classification
          2. Public Health Surveillance
          3. Reddit

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          WWW '22
          Sponsor:
          WWW '22: The ACM Web Conference 2022
          April 25 - 29, 2022
          Virtual Event, Lyon, France

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)96
          • Downloads (Last 6 weeks)11
          Reflects downloads up to 28 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
          • (2024)A Linguistic Grounding-Infused Contrastive Learning Approach for Health Mention Classification on Social MediaProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635763(529-537)Online publication date: 4-Mar-2024
          • (2024)Hybrid Text Representation for Explainable Suicide Risk Identification on Social MediaIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.318498411:4(4663-4672)Online publication date: Aug-2024
          • (2024)Enhancing Health Mention Classification Through Reexamining Misclassified Samples and Robust Fine-Tuning Pre-Trained Language ModelsIEEE Access10.1109/ACCESS.2024.351038812(190445-190453)Online publication date: 2024
          • (2024)Optimizing classification of diseases through language model analysis of symptomsScientific Reports10.1038/s41598-024-51615-514:1Online publication date: 17-Jan-2024
          • (2023)RHMD: A Real-World Dataset for Health Mention Classification on RedditIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.318688310:5(2325-2334)Online publication date: Oct-2023
          • (2023)Robust Identification of Figurative Language in Personal Health Mentions on TwitterIEEE Transactions on Artificial Intelligence10.1109/TAI.2022.31754694:2(362-372)Online publication date: Apr-2023
          • (2023)Figurative Health-mention Classification from Social Media using Graph Convolutional Networks2023 9th International Conference on Smart Computing and Communications (ICSCC)10.1109/ICSCC59169.2023.10334990(570-575)Online publication date: 17-Aug-2023
          • (2023)Event Labeling Approach for Twitter Datasets Leveraging N-grams, Topics, and Machine Learning Algorithms for Enhanced Event Detection2023 4th International Conference on Communication, Computing and Industry 6.0 (C216)10.1109/C2I659362.2023.10430550(1-6)Online publication date: 15-Dec-2023
          • (2023)Understanding the Language of ADHD and Autism Communities on Social Media2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386833(2188-2195)Online publication date: 15-Dec-2023
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media