Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

Tekumalla, Ramya; Banda, Juan M.

doi:10.1007/s00521-021-06614-2

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

S.I. : LatinX in AI Research
Published: 29 October 2021

Volume 35, pages 18161–18169, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

2100 Accesses
3 Altmetric
Explore all metrics

Abstract

Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming and not scalable. In this work, we demonstrate the feasibility of applying weak supervision (noisy labeling) to select drug data, and build machine learning models using large amounts of noisy labeled data instead of limited gold standard labelled sets. Our results demonstrate the models built with large amounts of noisy data achieve similar performance than models trained on limited gold standard datasets, hence demonstrating that weak supervision helps reduce the need to rely on manual annotation, allowing more data to be easily labeled and useful for downstream machine learning applications, in this case drug mention identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction

Article Open access 13 June 2018

Multi-task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Availability of data and material

The dataset used for training the models is available via Zenodo [17]. The trained models are available on request.

References

Sarker A, Ginn R, Nikfarjam A et al (2015) Utilizing social media data for pharmacovigilance: A review. J Biomed Inform 54:202–212. https://doi.org/10.1016/j.jbi.2015.02.004
Article Google Scholar
Zhou Z-H (2017) A brief introduction to weakly supervised learning. Natl Sci Rev 5:44–53. https://doi.org/10.1093/nsr/nwx106
Article Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv [cs.CL]
Cocos A, Fiks AG, Masino AJ (2017) Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts. J Am Med Inform Assoc 24:813–821. https://doi.org/10.1093/jamia/ocw180
Article Google Scholar
Nikfarjam A, Sarker A, O’Connor K et al (2015) Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 22:671–681. https://doi.org/10.1093/jamia/ocu041
Article Google Scholar
Ratner A, Varma P, Hancock B, et al (2019) Weak Supervision: A New Programming Paradigm for Machine Learning. http://ai.stanford.edu/blog/weak-supervision/. Accessed 23 Nov 2020
Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn 2:343–370. https://doi.org/10.1007/BF00116829
Article Google Scholar
Bishop CM (1995) Training with noise is equivalent to Tikhonov regularization. Neural Comput 7:108–116. https://doi.org/10.1162/neco.1995.7.1.108
Article Google Scholar
Simon HU (1996) General bounds on the number of examples needed for learning probabilistic concepts. J Comput System Sci 52:239–254. https://doi.org/10.1006/jcss.1996.0019
Article MathSciNet MATH Google Scholar
Aslam JA, Decatur SE (1996) On the sample complexity of noise-tolerant learning. Inf Process Lett 57:189–195. https://doi.org/10.1016/0020-0190(96)00006-3
Article MathSciNet Google Scholar
Agarwal V, Podchiyska T, Banda JM et al (2016) Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc 23:1166–1173. https://doi.org/10.1093/jamia/ocw028
Article Google Scholar
Lindquist M (2007) The need for definitions in pharmacovigilance. Drug Saf 30:825–830. https://doi.org/10.2165/00002018-200730100-00001
Article Google Scholar
O’Connor K, Pimpalkhute P, Nikfarjam A et al (2014) Pharmacovigilance on twitter? Mining tweets for adverse drug reactions. AMIA Annu Symp Proc 2014:924–933
Google Scholar
Pain J, Levacher J, Quinquenel A, Belz A (2016) Analysis of Twitter data for postmarketing surveillance in pharmacovigilance. In: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT). pp 94–101
Tekumalla R, Banda JM (2020) Characterizing drug mentions in COVID-19 Twitter Chatter. In: Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics, Online, 2020. https://doi.org/10.18653/v1/2020.nlpcovid19-2.25
Sarker A, Gonzalez G (2017) A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities. Data Brief 10:122–131. https://doi.org/10.1016/j.dib.2016.11.056
Article Google Scholar
Tekumalla R, Asl JR, Banda JM (2020) Mining Archive. org’s Twitter Stream Grab for Pharmacovigilance Research Gold. In: Proceedings of the International AAAI Conference on Web and Social Media. pp 909–917
Tekumalla R, Banda JM (2020) Social Media Mining Toolkit (SMMT). Genomics Inform 18:e16. https://doi.org/10.5808/GI.2020.18.2.e16
Article Google Scholar
Machine W (2015) The Internet Archive. Searched for https://www.icannorg/icp/icp-1html
Klein A, Sarker A, Rouhizadeh M et al (2017) Detecting personal medication intake in Twitter: an annotated corpus and baseline classification system. BioNLP 2017:136–142
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Zhu Y, Kiros R, Zemel R, et al (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision. pp 19–27
Lee J, Yoon W, Kim S et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Article Google Scholar
Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv [cs.CL]
Rajapakse T (2020) Simple Transformers Available at https://github.com/ThilinaRajapakse/simpletransformers
Wolf T, Debut L, Sanh V, et al (2019) HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv e-prints. arXiv:1910.03771
Hu B, Lu Z, Li H, Chen Q (2014) Convolutional Neural Network Architectures for Matching Natural Language Sentences. In: Ghahramani Z, Welling M, Cortes C, et al (eds) Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp 2042–2050
Dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. pp 69–78
Wang P, Xu J, Xu B, et al (2015) Semantic clustering and convolutional neural network for short text categorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp 352–357
Kowsari K, Jafari Meimandi K, Heidarysafa M et al (2019) Text Classification Algorithms: A Survey. Information 10:150. https://doi.org/10.3390/info10040150
Article Google Scholar
Lavertu A, Altman RB (2019) RedMed: Extending drug lexicons for social media applications. J Biomed Inform 99:103307. https://doi.org/10.1016/j.jbi.2019.103307
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543
Godin F (2019) Improving and Interpreting Neural Networks for Word-Level Prediction Tasks in Natural Language Processing. PhD thesis, PhD Thesis, Ghent University, Belgium, 2019. 35

Download references

Acknowledgements

We would like to thank Stephen Fleischman and HP labs for providing us with a server with 8 GPUs to perform our experiments during our research server downtime.

Author information

Authors and Affiliations

Department of Computer Science, Georgia State University, Atlanta, GA, USA
Ramya Tekumalla & Juan M. Banda

Authors

Ramya Tekumalla
View author publications
You can also search for this author inPubMed Google Scholar
Juan M. Banda
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

RT and JMB were involved in conceptualization, data curation, formal analysis, methodology and writing—review & editing.

Corresponding author

Correspondence to Juan M. Banda.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tekumalla, R., Banda, J.M. Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions. Neural Comput & Applic 35, 18161–18169 (2023). https://doi.org/10.1007/s00521-021-06614-2

Download citation

Received: 03 January 2021
Accepted: 04 October 2021
Published: 29 October 2021
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00521-021-06614-2

Keywords

Profiles

Juan M. Banda View author profile

Part of a collection:

S.I.: LatinX in AI Research (vol 35, issue 25)

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction

Multi-task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets

Explore related subjects

Availability of data and material

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now