skip to main content
10.1145/3110025.3110110acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Mining Frequency of Drug Side Effects over a Large Twitter Dataset Using Apache Spark

Published: 31 July 2017 Publication History

Abstract

Despite clinical trials by pharmaceutical companies as well as current FDA reporting systems, there are still drug side effects that have not been caught. To find a larger sample of reports, a possible way is to mine online social media. With its current widespread use, social media such as Twitter has given rise to massive amounts of data, which can be used as reports for drug side effects. To process these large datasets, Apache Spark has become popular for fast, distributed batch processing. In this work, we have improved on previous pipelines in sentimental analysis-based mining, processing, and extracting tweets with drug-caused side effects. We have also added a new ensemble classifier using a combination of sentiment analysis features to increase the accuracy of identifying drug-caused side effects. In addition, the frequency count for the side effects is also provided. Furthermore, we have also implemented the same pipeline in Apache Spark to improve the speed of processing of tweets by 2.5 times, as well as to support the process of large tweet datasets. As the frequency count of drug side effects opens a wide door for further analysis, we present a preliminary study on this issue, including the side effects of simultaneously using two drugs, and the potential danger of using less-common combination of drugs. We believe the pipeline design and the results present in this work would have great implication on studying drug side effects and on big data analysis in general.

References

[1]
FDA Adverse Event Reporting System (FAERS). Last Retrieved on December 15th, 2016 from http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm
[2]
L. Wu, T.-S. Moh, and N. Khuri, "Twitter Opinion Mining for Adverse Drug Reactions", Proceedings of the 2015 IEEE International Conference on Big Data (BigData), Santa Clara, California, Oct. 2015, pp.1570--1574.
[3]
F. Yu, M. Moh and T. S. Moh, "Towards Extracting Drug-Effect Relation from Twitter: A Supervised Learning Approach," 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), New York, NY, 2016, pp. 339--344.
[4]
Y. Peng, M. Moh, and T. Moh, "Efficient Adverse Drug Event Extraction Using Twitter Sentiment Analysis," Proceedings of the 8th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, California, Aug. 2016, pp. 1101--1018.
[5]
K. Jiang, Y. Zheng, "Mining Twitter Data for Potential Drug Effects" in Advanced Data Mining and Applications, Springer Berlin Heidelberg, pp. 434--443, 2013.
[6]
A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau. 2011. Sentiment analysis of Twitter data. In Proceedings of the Workshop on Languages in Social Media (LSM '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 30--38.
[7]
D. Bates, D. Cullen, N. Laird, L. Petersen, S. Small, D. Servi et al. "Incidence of Adverse Drug Events and Potential Adverse Drug Events Implications for Prevention." JAMA. 1995;274(1):29--34.
[8]
R. Banerjee, I. V. Ramakrishnan, M. Henry and M. Perciavalle, "Patient Centered Identification, Attribution, and Ranking of Adverse Drug Events," 2015 International Conference on Healthcare Informatics, Dallas, TX, 2015, pp. 18--27.
[9]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing". NSDI 2012. April 2012.
[10]
A. R. Aronson. "Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program". Proc AMIA Symposium, pages 17--21, 2001.
[11]
W. B. Cavnar, and J. M. Trenkle, "N-Gram-Based Text Categorization", in Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161--175, 1994.
[12]
D. Harnie, A.E. Vapirev, J.K. Wegner, A. Gedich, M. Steijaert; R. Wuyts, W.D. Meuter, "Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark," 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Shenzhen, 2015, pp. 871--879.
[13]
Tweepy (An easy-to-use Python library for accessing the Twitter API). Last Retrieved on December 15, 2016 from http://www.tweepy.org
[14]
"Popular Drugs" from Drug Index A to Z. Last Retrieved on December 14, 2016 from https://www.drugs.com/drug_information.html
[15]
NLTK (Nature Language Tool Kit). Last Retrieved on December 15, 2016 from www.nltk.org
[16]
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. "SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining". LREC Conference, 2015.
[17]
K. Toutanova, D. Klein, C.D. Manning, and Y. Singer, "Featurerich part-of-speech tagging with a cyclic dependency network". In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (NAACL '03), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 173--180. 2003.
[18]
Finn Årup Nielsen. "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs", Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop Proceedings, pp. 93--98. May 2011.
[19]
Lingjia Deng and Janyce Wiebe (2015). "MPQA 3.0: An Entity/Event-Level Sentiment Corpus". NAACL-HLT, 2015.
[20]
B. Liu. "Sentiment Analysis: mining opinions, sentiments, and emotions". Cambridge University Press, 2015. Last Retrieved on December 21, 2016 from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
[21]
N. Tabassum and T. Ahmed, "A theoretical study on classifier ensemble methods and its applications," 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 374--378.
[22]
Bodenreider, Olivier; Hole, William T.; Humphreys, Betsy, L.; Roth, Laura, A.; Srinivasan, Suresh. "Customizing the UMLS Metathesaurus for your Applications". Proc AMIA Symposium. Nov. 2002.
[23]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al. "Scikit-learn: Machine Learning in Python", JMLR 12, pp. 2825--2830, 2011.
[24]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu et al. "MLlib: Machine Learning in Apache Spark", J. Mach. Learn. Res. 17, 1 (January 2016), 1235--1241. 2016.
[25]
Pyspark (Spark Python API). Last Retrieved on December 21, 2016 from http://spark.apache.org/docs/latest/api/python/index.html
[26]
C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition". Data Mining and Knowledge Discovery 2, pp. 121--167, 1998.

Cited By

View all
  • (2024)Using Social Media as a Source of Real-World Data for Pharmaceutical Drug Development and Regulatory Decision MakingDrug Safety10.1007/s40264-024-01409-547:5(495-511)Online publication date: 6-Mar-2024
  • (2023)Exploration of sentiment analysis in twitter propaganda: a deep diveMultimedia Tools and Applications10.1007/s11042-023-17383-683:15(44729-44751)Online publication date: 19-Oct-2023
  • (2023)Machine Learning-Based Sentiment Analysis ApproachesSentiment Analysis in the Medical Domain10.1007/978-3-031-30187-2_11(71-78)Online publication date: 24-Mar-2023
  • Show More Cited By
  1. Mining Frequency of Drug Side Effects over a Large Twitter Dataset Using Apache Spark

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASONAM '17: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017
      July 2017
      698 pages
      ISBN:9781450349932
      DOI:10.1145/3110025
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 July 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Apache Spark
      2. Twitter
      3. adverse drug event
      4. classification
      5. machine learning
      6. natural language processing
      7. opinion mining
      8. sentiment analysis
      9. supervised learning

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ASONAM '17
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 116 of 549 submissions, 21%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Using Social Media as a Source of Real-World Data for Pharmaceutical Drug Development and Regulatory Decision MakingDrug Safety10.1007/s40264-024-01409-547:5(495-511)Online publication date: 6-Mar-2024
      • (2023)Exploration of sentiment analysis in twitter propaganda: a deep diveMultimedia Tools and Applications10.1007/s11042-023-17383-683:15(44729-44751)Online publication date: 19-Oct-2023
      • (2023)Machine Learning-Based Sentiment Analysis ApproachesSentiment Analysis in the Medical Domain10.1007/978-3-031-30187-2_11(71-78)Online publication date: 24-Mar-2023
      • (2022)Large-scale digital forensic investigation for Windows registry on Apache SparkPLOS ONE10.1371/journal.pone.026741117:12(e0267411)Online publication date: 7-Dec-2022
      • (2020)Big data analytics meets social media: A systematic review of techniques, open issues, and future directionsTelematics and Informatics10.1016/j.tele.2020.101517(101517)Online publication date: Oct-2020
      • (2019)Exploratory data analysis and crime prediction for smart citiesProceedings of the 23rd International Database Applications & Engineering Symposium10.1145/3331076.3331114(1-9)Online publication date: 10-Jun-2019
      • (2019)Harnessing social media data for pharmacovigilance: a review of current state of the art, challenges and future directionsInternational Journal of Data Science and Analytics10.1007/s41060-019-00175-3Online publication date: 12-Feb-2019
      • (2018)Social Network Mining for Recommendation of Friends Based on Music Interests2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)10.1109/ASONAM.2018.8508262(833-840)Online publication date: Aug-2018
      • (2017)On adverse drug event extractions using twitter sentiment analysisNetwork Modeling Analysis in Health Informatics and Bioinformatics10.1007/s13721-017-0159-46:1Online publication date: 18-Sep-2017

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media