Skip to main content

Advertisement

Log in

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

  • S.I. : LatinX in AI Research
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming and not scalable. In this work, we demonstrate the feasibility of applying weak supervision (noisy labeling) to select drug data, and build machine learning models using large amounts of noisy labeled data instead of limited gold standard labelled sets. Our results demonstrate the models built with large amounts of noisy data achieve similar performance than models trained on limited gold standard datasets, hence demonstrating that weak supervision helps reduce the need to rely on manual annotation, allowing more data to be easily labeled and useful for downstream machine learning applications, in this case drug mention identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Availability of data and material

The dataset used for training the models is available via Zenodo [17]. The trained models are available on request.

References

  1. Sarker A, Ginn R, Nikfarjam A et al (2015) Utilizing social media data for pharmacovigilance: A review. J Biomed Inform 54:202–212. https://doi.org/10.1016/j.jbi.2015.02.004

    Article  Google Scholar 

  2. Zhou Z-H (2017) A brief introduction to weakly supervised learning. Natl Sci Rev 5:44–53. https://doi.org/10.1093/nsr/nwx106

    Article  Google Scholar 

  3. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv [cs.CL]

  4. Cocos A, Fiks AG, Masino AJ (2017) Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts. J Am Med Inform Assoc 24:813–821. https://doi.org/10.1093/jamia/ocw180

    Article  Google Scholar 

  5. Nikfarjam A, Sarker A, O’Connor K et al (2015) Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 22:671–681. https://doi.org/10.1093/jamia/ocu041

    Article  Google Scholar 

  6. Ratner A, Varma P, Hancock B, et al (2019) Weak Supervision: A New Programming Paradigm for Machine Learning. http://ai.stanford.edu/blog/weak-supervision/. Accessed 23 Nov 2020

  7. Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn 2:343–370. https://doi.org/10.1007/BF00116829

    Article  Google Scholar 

  8. Bishop CM (1995) Training with noise is equivalent to Tikhonov regularization. Neural Comput 7:108–116. https://doi.org/10.1162/neco.1995.7.1.108

    Article  Google Scholar 

  9. Simon HU (1996) General bounds on the number of examples needed for learning probabilistic concepts. J Comput System Sci 52:239–254. https://doi.org/10.1006/jcss.1996.0019

    Article  MathSciNet  MATH  Google Scholar 

  10. Aslam JA, Decatur SE (1996) On the sample complexity of noise-tolerant learning. Inf Process Lett 57:189–195. https://doi.org/10.1016/0020-0190(96)00006-3

    Article  MathSciNet  Google Scholar 

  11. Agarwal V, Podchiyska T, Banda JM et al (2016) Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc 23:1166–1173. https://doi.org/10.1093/jamia/ocw028

    Article  Google Scholar 

  12. Lindquist M (2007) The need for definitions in pharmacovigilance. Drug Saf 30:825–830. https://doi.org/10.2165/00002018-200730100-00001

    Article  Google Scholar 

  13. O’Connor K, Pimpalkhute P, Nikfarjam A et al (2014) Pharmacovigilance on twitter? Mining tweets for adverse drug reactions. AMIA Annu Symp Proc 2014:924–933

    Google Scholar 

  14. Pain J, Levacher J, Quinquenel A, Belz A (2016) Analysis of Twitter data for postmarketing surveillance in pharmacovigilance. In: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT). pp 94–101

  15. Tekumalla R, Banda JM (2020) Characterizing drug mentions in COVID-19 Twitter Chatter. In: Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics, Online, 2020. https://doi.org/10.18653/v1/2020.nlpcovid19-2.25

  16. Sarker A, Gonzalez G (2017) A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities. Data Brief 10:122–131. https://doi.org/10.1016/j.dib.2016.11.056

    Article  Google Scholar 

  17. Tekumalla R, Asl JR, Banda JM (2020) Mining Archive. org’s Twitter Stream Grab for Pharmacovigilance Research Gold. In: Proceedings of the International AAAI Conference on Web and Social Media. pp 909–917

  18. Tekumalla R, Banda JM (2020) Social Media Mining Toolkit (SMMT). Genomics Inform 18:e16. https://doi.org/10.5808/GI.2020.18.2.e16

    Article  Google Scholar 

  19. Machine W (2015) The Internet Archive. Searched for https://www.icannorg/icp/icp-1html

  20. Klein A, Sarker A, Rouhizadeh M et al (2017) Detecting personal medication intake in Twitter: an annotated corpus and baseline classification system. BioNLP 2017:136–142

    Google Scholar 

  21. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  22. Zhu Y, Kiros R, Zemel R, et al (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision. pp 19–27

  23. Lee J, Yoon W, Kim S et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240. https://doi.org/10.1093/bioinformatics/btz682

    Article  Google Scholar 

  24. Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv [cs.CL]

  25. Rajapakse T (2020) Simple Transformers Available at https://github.com/ThilinaRajapakse/simpletransformers

  26. Wolf T, Debut L, Sanh V, et al (2019) HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv e-prints. arXiv:1910.03771

  27. Hu B, Lu Z, Li H, Chen Q (2014) Convolutional Neural Network Architectures for Matching Natural Language Sentences. In: Ghahramani Z, Welling M, Cortes C, et al (eds) Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp 2042–2050

  28. Dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. pp 69–78

  29. Wang P, Xu J, Xu B, et al (2015) Semantic clustering and convolutional neural network for short text categorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp 352–357

  30. Kowsari K, Jafari Meimandi K, Heidarysafa M et al (2019) Text Classification Algorithms: A Survey. Information 10:150. https://doi.org/10.3390/info10040150

    Article  Google Scholar 

  31. Lavertu A, Altman RB (2019) RedMed: Extending drug lexicons for social media applications. J Biomed Inform 99:103307. https://doi.org/10.1016/j.jbi.2019.103307

    Article  Google Scholar 

  32. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  33. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543

  34. Godin F (2019) Improving and Interpreting Neural Networks for Word-Level Prediction Tasks in Natural Language Processing. PhD thesis, PhD Thesis, Ghent University, Belgium, 2019. 35

Download references

Acknowledgements

We would like to thank Stephen Fleischman and HP labs for providing us with a server with 8 GPUs to perform our experiments during our research server downtime.

Author information

Authors and Affiliations

Authors

Contributions

RT and JMB were involved in conceptualization, data curation, formal analysis, methodology and writing—review & editing.

Corresponding author

Correspondence to Juan M. Banda.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tekumalla, R., Banda, J.M. Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions. Neural Comput & Applic 35, 18161–18169 (2023). https://doi.org/10.1007/s00521-021-06614-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06614-2

Keywords

Profiles

  1. Juan M. Banda