1 Introduction

The unwelcome, undesirable or dangerous effects that a drug may have are referred to as adverse drug reaction (ADR, or adverse drug effect). The term ‘side effect’ which is inexact term refers to inadvertent, secondary effect that is observed along with the therapeutic function of the drug. Side effect may fluctuate from person to person.

Adverse drug reactions can be considered a form of toxicity, i.e., due to accidental or intentional overdose, or other causes of elevated blood levels, and drug interactions. All drugs have the potential for adverse reactions; it is assumed that between 3 and 7% of all hospitalizations are due to adverse drug reactions.

A systematic review 25 prospective observational studies demonstrated that 5.3% of all patients are faced with adverse drug reactions (Kongkaew et al. 2008). Thus, early detection of these events could greatly impact on human health. According to the Agency for Healthcare Research and Quality report, annually over 770,000 of people are injured and/or die in hospitals due to adverse drug reactions (Classen et al. 1997). Thus, societies required alternative approach to detect side effects of the clinical medications. Economically, ADRs can considerably increase patient’s hospitalities costs (Bordet et al. 2001; Sultana et al. 2013).

The data from social media are a novel and rich source of data using which the trends and thinking flow of users about side effects of drugs and special events in the field of health can be identified and managed. The purpose is to use these data to help patients.

Ask a patient (2001) is the website that empowers patients by allowing them to share and compare medication experiences and was awarded the 2012 Webby Award for best website in the Pharmaceutical Category. The Ask a patient database contains more than 4000 chemically prepared prescription drugs approved by FDA’s Center for Drug Evaluation and Research. You can find comment of prescription or over-the-counter drugs, based on fine-tuned search criteria (age, gender, symptom, etc.). The difference between written and oral language in social media can create noise.

In addition, lack of a suitable structure and imbalance data in drug groups are considered as important challenges in classifying data from social media. So, in spite of richness in health-related data in social media, little practical use of these data is made. The method used in the present research had three main phases: first, in order to extract features from social media, a learning process happens automatically in deep learning. The comments by the users of a website Ask a patient were processed to describe side effects and accordingly to reduce the difference between written and oral language as well as the noise.

Second, the efficiency of deep learning method in classifying the data from the website Ask a patient was proved. The results showed that deep learning performance accompanies high accuracy and high speed. Third, the advantages and disadvantages of using comments in recognizing side effects were compared with the side effects reported in the comments in two websites of Sider and WebMD. In so doing, deep learning method HAN (Yang et al. 2016) was employed to classify users’ comments. Then, for determining specific topic in each group of drugs, the non-monitoring method (NMF) of topic modeling was employed.

2 Related work

Sarker and Gonzalez (2015) highlighted importance of employing advanced NLP-based information generation in accuracy of ADR sentence detection and classification by used traditional text classification such as Support Vector Machine, Naïve Bayes and Maximum Entropy.

Ginn et al. (2014), they presented an annotated Twitter corpus focused on ADR mentions with broad. They applied two supervised machine learning approaches (NB and SVM) on broad range of annotated medications related to ADR tweets in Twitter. Although, the classifier shows moderate performance, but it was considered as fundamental for future development of advanced techniques. In line with this approach, Akhtyamova et al. (2017a) used convolutional neural networks (CNN) model with word2vec embedding for Twitter comments classification. In contrast to Arker’s model (reference), their proposed model not only uses a small fraction of features for gathering information, but show high performance of applicability in text classification.

Lee et al. (2017) suggested a semi-supervised CNN-based framework for adverse drug events (ADE) classification in Twitter. A Twitter datasets used in PSB 2016 Social Media Shared Task applied for evaluation of model, resulting high performance classification of ADE with +9.9% F1-Score. Notably, Adverse Drug Event detection (ADE) surveillance systems required small number of labeled instances.

Akhtyamova et al. (2017b) present a CNN-based architecture consisting of numerous parameters to predict revealing adverse drug reaction based on the quantity of vote. To evaluate the performance of model, a large-scale medical dataset derived from medical websites was utilized. In contrast to previous reports networks, the proposed end-to-end model does not require handcrafted features and data pre-processing, resulting an enormous improvement for standard CNN-based methods.

In this study aimed to investigate the written topic modeling of typical users and to identify the changes in reporting comments within 10 years. In a way so that the designed model can provide researchers with immediate capability of analyzing comments through combining deep learning methods. The inclination of the comments within years showed a significant change of users’ comments in reporting side effects of drugs. This reduction can be attributed to using drug supplements, change in life style, genetic improvement of drugs, etc.

3 Methods

3.1 Workflow of the research

This paper is organized in two sections as Classification and Topic Modeling (Fig. 1).

Fig. 1
figure 1

The workflow of the proposed deep learning-based strategy is illustrated

4 Section 1: Classification

4.1 Data sources

Prior to collecting data, we selected a set of drugs of interest, which were likely to have a large number of associated comments in Ask a patient database. We selected drugs that were prescribed for chronic diseases and syndromes for which large numbers of comments were expected and drugs with high prevalence of use. The names of the medications are reported in separate classes (Anti-depressant Medicines, Anti-Pregnancy Medicines and Digestion Medicines) in Figs. 11, 12, and 13 in Appendix.

4.2 Pre-processing

The pre-processing of comments in both data is done as follows:

  1. 1.

    Data shuffling

  2. 2.

    Converting all uppercase words to lowercase

  3. 3.

    Elimination of special characters like: @, !,/, *, $ and etc.

  4. 4.

    Remove stop word: at, of, the, …

  5. 5.

    Correction of words with repeated characters like: pleaseeeeeeeeee and/or yessss

  6. 6.

    Convert contractions to base format like: I’m → I am

  7. 7.

    Lemmatization: I started taking almost two months ago. → I start take almost two months ago.

4.3 Cross-validation

For many classification models, the complexity may be governed by multiple parameters. In order to achieve the best prediction performance on new data, we wish to find appropriate values of the complexity parameters that lead to the optimal model for a particular application.

If data are plentiful, then a simple way for model selection is to divide the entire data 1 into three subsets, the training set, the validation set and the test set. A range of models are trained on the training set, compared and selected on the validation set, and finally evaluated on the test set.

Among the diverse complex models that have been trained, the one having the best predictive performance is selected, which is an effective model validated by the data in the validation set. In a practical application, however, the supply of data for training and testing is limited, leading to an increase in the generalization error. An approach to reducing the generalization error and preventing over-fitting is to use cross-validation. The distribution of data for each group is shown in Table 7 in Appendix.

4.4 Deep classification

The used methods for data classification are including HNN (Yang et al. 2016) and FastText (Joulin et al. 2017) with similar word2vec section. Once word2vec generated, this file used for further steps of study.

4.4.1 HAN method

Hierarchical Attention Network (HAN) has two distinctive characteristics: (I) a hierarchical structure that mirrors the hierarchical structure of documents; and (II) two levels of attention mechanisms applied at the word and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation. In addition to these, the HAN network composed of quite a few parts including, a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. HAN hypothesized that considering sentence and documents structure in modeling play positive role in better representation of document structure in the model architecture (Table 1).

Table 1 (HAN and FastText) training phase configuration

4.4.2 FastText method

This method proposes a simple and efficient approach for classifying the texts and its expressions. Large number of researches shows that the rapid classification of text with this method is faster in comparison with deep learning methods in terms of accuracy and using commands for training and evaluation (Table 1). Architecturally, there are two major and influential differences:

  1. 1.

    Softmax: is a hierarchy, based on the Huffman-encoded tree structure that reduce Time Complexity O(Kd) to O(d log k) where K is the number of targets and D the dimension of the hidden layer.

  2. 2.

    N-gram features: While Bag of words is invariant to word order but taking explicitly this order into account is often computationally very expensive. Instead, we used bag of n-gram as additional features to capture some partial information about the local word order. This is very efficient in practice while achieving comparable results to methods that explicitly use the order.

4.5 Evaluation metrics

Precision (positive predictive value) and recall (sensitivity): are appropriate fraction of retrieved related samples from all and relevant instances, respectively. Application of these metrics depends on understanding and measuring of relevance.

Accuracy: The accuracy criterion is the accuracy of the x-group classification against all items where the x-tag for investigating records is suggested by means of classification. This criterion indicates how much classification output is trustable.

F-measure: This criterion is a combination of call metrics and accuracy and it is used in cases where it is impossible to consider special importance to each of the two criteria.

Kappa: This criterion is often used to test the reliability of the viewer and to compare the accuracy of the system in terms of how much generated output is coincidental (Table 2).

Table 2 Evaluation metrics formula

5 Section 2: Topic Modeling

5.1 Input datasets in the second phase

The three classes of drugs in the time period between 2008 and 2018 are used based on what is presented in Figs. 11, 12, 13.

5.2 Topic modeling

As a linear-algebraic model, non-negative matrix factorization (NMF) includes high-dimensional vectors into a low-dimensional image. NMF like principal component analysis (PCA) considers the fact that the vectors are non-negative. NMF through including the vectors into lower-dimensional form causes the coefficients to be non-negative, as well.

Using the original matrix A, the two matrices of W and H can be obtained so that A = WH. As NMF has an inborn clustering property and W and H represent the following information:

A (Document-Word Matrix): input that shows which words appear in which documents.

W (Basis Vectors): the topics (clusters) elicited from the documents.

H (Coefficient Matrix): the membership weights for the topics in each document.

W and H are calculated by optimizing an objective function (like the EM algorithm), updating both W and H iteratively until they are converged. In this way, the NMF topic modeling configuration is provided in Table 3.

Table 3 Topic modeling configuration

6 Results

6.1 Usage model

In this research, we used user’s comments of Ask a patient to extract side effects of drugs. In the field of deep learning, the following issues are considered in the training phase.

In general, the size of a window that moves on texts in both FastText and HAN methods is called Pad_Seq_Len, and we considered quantity equal to 150 because generally the maximum size of comments is 150 where the length of sentences and semantic conjugation are important. Moreover, the value of Embedding dim was 100 (Table 4). We evaluated several optimizations such as Stochastic Gradient Descent, RMS prob and Adam. That Adam shows better results.

Table 4 The hyper parameters in training phase

Also in Section 2, for extracting critical topics ngram_range to detecting words in a define scale and min_df to finding words in documents by minimum frequency were determined.

The value of ngram_range choose based on the side effects expressions that extracted from Sider or WebMD website, although other values such as (1,2), (2,3) and (3,3) were determined but (2,2) was the best choice (Table 5).

Table 5 NMF topic modeling parameters

6.2 Implementation model in section 1

In this research, the used hardware includes: NVIDIA GEFORCE GTX 1050 and CPU Intel Core i7. Two methods of classification were applied against three different data groups listed in the following table. As shown the best results in Table 6, in each method the learning rate and batch size were evaluated and different criteria have been tested for each type of model according to the type of data, and various values have been obtained. For example, applying HAN method included with batch size of 128 and learning rate 0.001 on Ask a patient dataset resulted highest accuracy (0.924). Confusion matrix HAN and FastText; for best results are reported in Tables 8 and 9 in Appendix.

Table 6 Best result of deep learning classification methods on dataset

6.3 Implementation model in section 2

Considering the output of the previous phase, the three features, namely side effects, reason and drug, were used. Accordingly, in each class of drugs (neurotic medicines, anti-pregnancy and digestion), 10 topics of high priority were selected. As shown in Tables 10, 11 and 12, topics of each class are verbally similar. After extraction of these tables, all are mapped with a similar word, and meaningless topics were deleted. Figures 2, 4, and 6 show the frequency of repetition of topic models, and Figs. 3, 5 and 7 show the dispersion of topics on the website of Ask a patient during the years 2008 to 2018. The users’ comment about side effects shows a different model in each year.

Fig. 2
figure 2

Anti-depressant topic modeling visualization (anti-depressant topic modeling (Ask a patient) is reported in Table 10 in Appendix)

Fig. 3
figure 3

Scatterplot of anti-depressant medicines topics on the website of Ask a patient based on year

Fig. 4
figure 4

Anti-pregnancy topic modeling visualization (anti-pregnancy topic modeling (Ask a patient) is reported in Table 11 in Appendix)

Fig. 5
figure 5

Scatterplot of anti-pregnancy medicines topics on the website of Ask a patient based on year

Fig. 6
figure 6

Digestion topic modeling visualization (digestion topic modeling (Ask a patient) is reported in Table 12 in Appendix)

Fig. 7
figure 7

Scatterplot of digestion medicines topics on the website of Ask a patient based on year

According to Figs. 8, 9, and 10, users’ comments were different from the side effects of drugs reported in the websites Sider and WebMD in case of the three classes of drugs; however, the websites had reported some side effects but with a low frequency. The blue diagram shows the frequency of side effects reported in websites, and the red diagram presents the comments by typical users from side effects; however, some reports overlapped with the users’ comments and the websites (Sider and WebMD) in terms of topics.

Fig. 8
figure 8

Comparison of topic modeling of users’ comments with the side effects reported on the websites of Sider and WebMD (Neurotic drugs)

Fig. 9
figure 9

Comparison of topic modeling of users’ comments with the side effects reported on the websites of Sider and WebMD (Anti-pregnancy drugs)

Fig. 10
figure 10

Comparison of topic modeling of users’ comments with the side effects reported in the websites of Sider and WebMD (Digestion drugs)

7 Discussion

In the present research, the deep learning methods of HAN and FastText were employed to classify side effects of three classes of drugs, namely neurotic, anti- pregnancy and digestion. Due to the fact that the comments on these three classes of drugs had a high frequency, they were investigated. In the first phase, the extracted data from the website Ask a patient were entered into the model. Then, in the pre-processing phase special characters, sign and stop words were removed and the characters were converted into small-case letters in order to improve the text. In the second phase, the three fields of drugs, the side effect and the reason of side effect were investigated. Then, these data were exposed to classifying phase (topic modeling) to extract 10 topics of high priority from the three groups of drugs. The outputs show that the frequency of occurrence of side effects reported in the comments from Ask a patient was different from the side effects reported from Sider and WebMD, and in some minor cases some similarities in frequency were seen. Finally, the proposed model is compared with the output of drug’s side effects, and the analysis of users’ reports and the sites’ reports is illustrated. In addition, the users’ trends about side effects were analyzed for the time period between 2008 and 2018 during 10 years. As it is clear, the users’ comments have changed gradually.

Finally, the obtained results derived from the preliminary analysis of drug classification presented in confusion matrices and interpreted using accuracy rate and false positive ratio.

In this work, we used a simple method for text classification by deep learning models. In contrast to unsupervised trained word vectors derived from word2vec, our word features can be averaged together to generate appropriate sentence representations.

In comparison with recent deep learning-based methods, the FastText and HAN were much faster to text classification. Theoretically, although deep neural networks suggest higher representational power than shallow models, but it is not clear if simple text classification problems.

Using the proposed model of consecutive deep learning framework, the pre-processing and classification of side effects were used for the three classes of drugs.

In this work, we used a simple method for text classifications by deep learning models. In contrast to unsupervised trained word vectors derived from word2vec, our word features can be averaged together to generate appropriate sentence representations. In comparison with recent deep learning-based methods, the FastText and HAN were much faster to text classification. Theoretically, although deep neural networks suggest higher representational power than shallow models, but it is not clear if simple text classification problems.

Additionally, in contrast to previous studies, we suggested an end-to-end solution based on deep learning models that do not require any handcrafted features and data pre-processing. Our experimental findings show that each model significantly outperforms baseline methods for different datasets.

8 Conclusions

The users’ comments on identifying the side effects of drugs presented in a website, namely Ask a patient, were investigated, then a combined classification based on three types of diseases which were mostly commented on were extracted. Through analysis of the data using deep learning method, it was found that users’ comments on side effects of drugs were biased, as their comments were not to be evaluated, and it was voluntary. The comments were classified using topic modeling, then some reports similar to the reports issued by Sider and WebMD were issued; however, the reports were different in frequency. As a case in the point, the side effects had been reported with a high frequency in Sider and WebMD, while typical users did not report those side effects very much. On the other hand, some other side effects not reported by Sider and WebMD had attracted typical users’ attention. Our findings enable the efficient use of vast batch sizes, significantly reducing the number of parameter updates required to train a model. This has the potential to dramatically reduce model training times. To sum up, using the data from social media in studies on social media opens a wide and novel window in the field of drug studies. The results show that the data from social media may have noise, or may not be reliable. Accordingly, social media can be employed as a secondary source in identifying the side effects of drugs rather than a substitution for traditional and scientific methods of identifying side effects. The option of reporting ‘unregistered side effect’ shows that a great deal of data, which have not been reported in drug studies, can be extracted from social media. The side effects may have appeared due to the dosage or procedure for using the drug, or it may have been appeared due to interference from other drugs. The model proposed in this study can be used for immediate identification of pharmacological events which most probably leads to immediate reaction and on-time discovery to these events.