Keywords

1 Introduction

Many hospitals use the Electronic Health Records system (EHRs) to integrate different kinds of medical data [1, 2]. The data mining technology is widely applied on the EHRs to support clinical decision, including drug supervision [3], complication identification [4], disease prediction [5], patient stratification [6] and other related directions [7]. Among them, medical complication identification gets special attention. Complications refer to diseases caused in the process of another disease. It is difficult to discover the potential relationship between disease and its complication based on a small dataset. The emerging of big data analytic technology makes it feasible to identify the possible complications by mining the massive data in EHRs.

Previous studies on complication identification normally focused on the specific disease and adopted the data mining algorithms to find some potential complications for the specific disease. However, these researches rarely consider the complications of different diseases together and ignore individual disease’s features. Aiming to mine the potential complication relationship in the EHRs, this study firstly proposes the concurrent weight of diseases to depict the possibility that one disease will be a complication of other diseases. Based on the concurrent weight, we adopt the weighted association rule mining algorithm to mine potential complication relationship and predict the possible complications for clinical decision-making.

The rest of the paper is organized as follows. Section 2 introduces the related works on complication identification. Then, the proposed complication mining method is presented in Sect. 3, followed by the experiment results in Sect. 4. Finally, we discuss major research findings and future research of this study in Sect. 5.

2 Related Works

Complications represent the concurrence and association relationships among different diseases. They normally affect the patients seriously and cost much higher medical expenditure. In order to identify the possible complications for the disease, various data mining techniques are applied in complication identification.

Roque et al. [8] uses the text mining technology to analyze the electronic patient records of a Danish Psychiatric Hospital. By extracting phenotype information from electronic patient records, they analyze the disease co-occurrence relationship. Hanauer et al. [9] uses the Molecular Concept Map (MCM) algorithm to mine the electronic medical database and successfully find some interesting new diseases associations. Holmes et al. [10] use Application for Discovering Disease Associations using Multiple Sources (ADAMS) to identify the co-morbidities of the rare diseases Kaposi sarcoma, toxoplasmosis, and Kawasaki disease. By incorporating textual information from PubMed and Wikipedia, they find some rare or previously unreported associations.

As one of the most popular data mining algorithms, Association Rule Mining (ARM) is widely used in complication identification of a certain disease. Tai and Chiu [11] focuses on the complications identification of Attention Deficit/Hyperactivity Disorder (ADHD). By employing the association rule mining algorithm, they find that ADHD case group has apparently higher risk of comorbidity with psychiatric comorbidity than with other physical illnesses. Kim et al. [12] analyze the complications of type 2 diabetes mellitus. Based on the medical data of 411,414 patients from 1996 to 2007, they develop the Dx Analyze tool to clean data and reveal associations of comorbidity. The results show that association rule mining was practical for complication studies. Shin et al. [13] use Apriori algorithm and Clementine program to analyze the data of 5,022 patients with essential hypertension. The strong associations between hypertension, non-insulin-dependent diabetes mellitus and cerebral infarction are mined. Moreover, based on a large amount of data, Wright et al. [14] also use the association rule mining method to find out the association among different diseases and laboratory results. The results show that association rule mining is a useful tool for identifying clinically accurate associations and has a better performance over other knowledge-based methods.

In summary, some prior studies make great efforts to identify complication relationship by adopting association rule mining technique. However, they don’t consider the different roles of different diseases on the complication identification, which can improve the mining performance. At the same time, patients’ medical history is critical information for disease diagnosis and should be incorporated in mining complication relationship.

3 Methods

In this study, we firstly define the concurrent weight to evaluate the possibility that a disease develops as a complication of other diseases. Then we adopt the Back Propagation (BP) neural network to derive the concurrent weight of diseases and identify the disease complication relationship based on the weighted association rule mining algorithm. Finally, the list of possible complications is recommended to support clinical decision-making.

3.1 The Concurrent Weight of Diseases

Prior medical experience is very helpful to predict the future health status and possible diseases. We assume that some diseases will appear as complication more frequent than others. A concurrent weight is proposed to represent the possibility that a disease becomes the complication of other diseases. The higher the disease’s concurrent weight, the more likely it is the complication of certain diseases.

Assume that the disease set is D, and every item in D is one disease. Each disease d i also has a corresponding set C i to describe its known complications. So the concurrent weight of the disease di can be defined as follows:

$$ W\left( {d_{i} } \right) = \frac{{\mathop \sum \nolimits_{j = 1}^{n} m_{ij} }}{n} $$
(1)

where w(d i ) is the concurrent weight of the disease d i , n is the total number of diseases in the set D, and m ij describes whether the disease d i appears in the complication set of d j or not. When the disease di appears in the C j , then m ij is equal to 1, otherwise it is zero.

However, it is impossible to generate a complete complication list for each disease, and we can’t directly obtain concurrent weights for all diseases by prior knowledge. Thus we firstly collect the complications of some traditional diseases, then apply some artificial intelligent algorithms to train the collected data and predict the whole set of concurrent weight. For the initial set of traditional diseases and its complications, we used the web crawler to download disease information from domestic professional medical website “Clove Garden” (http://www.dxy.cn/). Then 596 kinds of diseases and their occurrence frequency as complications are collected as the known complication knowledge.

3.2 BP Neural Network Model

The Back Propagation (BP) neural network is one of the most used forward neural networks [15]. We selected the three layer BP neural network to predict the concurrent weight of the diseases. It includes input, output and hidden layer.

Input Layer of the Model.

First of all, we describe the disease as a three-dimension vector to characterize the properties of comorbid diseases. The three dimensions include the position of the disease in the International Classification of Disease (ICD) coding schema, the importance of the disease, and the appearing order in diagnosis.

The position of the disease in the International Classification of Disease (ICD) coding schema represents the location of the disease in the ICD list. The importance of the disease represents the impact of the disease on the patient’s recovery process. For inpatients, the hospital will normally record all diseases they have and each disease will be assigned an importance value, which describes how it is important for patients’ treatment. During the hospital stay, patients may have several diseases. These diseases will be recorded in the appearing order. By converting each disease to a vector, we used the 596 kinds of diseases and their weights as the input of the BP neural network model. At the same time, each dimension of the disease vector is normalized to makes sure that each value is ranged from 0 to 1.

Output Layer of the Model.

The output layer represents the learning result of the BP neural network model. In this study, BP model is adopted to predict the disease concurrent weight. Thus the output layer is the disease concurrent weight.

We firstly discretize the concurrent weights of diseases. The discretization process can effectively avoid the hidden defects in the training dataset and make the model more stable. Moreover, the values of the concurrent weights of diseases are commonly very small considering the large number of diseases, thus we amplify the concurrent weights to get significant results. The discretization calculation formula of the concurrent weights is given as follows:

$$ f\left( w \right) = \left\{ {\begin{array}{*{20}c} {0.2,} & {\quad \quad \quad w \le 0.022} \\ {0.4,} & {0.022 < w \le 0.044} \\ {0.6,} & {0.044 < w \le 0.066} \\ {0.8, } & {\quad \quad \quad w > 0.066} \\ \end{array} } \right. $$
(2)

where f(w) is the discretization result of the concurrent weight w of a disease.

Then the number of neurons in output layer is set to 2, and the output value of each neuron cell is 0 or 1. Four output values of two neurons: (0,0), (0,1), (1,0) and (1,1) are mapped to four kinds of weights: 0.2, 0.4, 0.6 and 0.8 respectively. Thus we establish the direct connection between input layer and output layer.

Hidden Layer of the Model.

The hidden layer of BP model is responsible for the information transformation. It can have one or several layers. As a single hidden layer BP neural network can approximate any nonlinear function with high precision [16], only one hidden layer is set in this study. The number of the neuron in the input and output layer is determined according to the input data and output data. For the hidden nodes number in the hidden layer, although many approaches have been proposed, no one works efficiently for all problems. The most common method is to determine the appropriate number of hidden nodes by experiments performance comparison. Thus we do experiments on a set of values as the number of the neuron in the hidden layer. The value that brings the least training time is the final number of hidden nodes.

3.3 The Weighted Association Rule Mining

Differences among diseases are significant and different disease will have different roles in identifying complications. Thus the weighted association rule mining method [17] is adopted in this study. It attempts to provide a weight to individual items that are not based solely on item support. And thresholds of weighted support and confidence are also defined to measure the significance of the association rules mined.

Similar with the traditional association rule mining algorithm, the support of the item set X is denoted as support(X), if the number of items in X is n, the weighted support of X is:

$$ {\text{Wsupport}}\left( {\text{X}} \right) = {\text{support}}\left( {\text{X}} \right) \times \left( {\frac{1}{n} \times \sum\nolimits_{j = 1}^{n} {w_{j} } } \right) $$
(3)

The item set X is weighted frequent if the weighted support of X is greater than a predefined minimum weighted support threshold (wminsup):

$$ {\text{Wsupport}}\left( {\text{X}} \right) \ge wminsup $$
(4)

The weighted support of a rule X → Y can be defined as:

$$ {\text{Wsupport}}\left( {{\text{X}} \to {\text{Y}}} \right) = {\text{support}}\left( {{\text{X}} \to {\text{Y}}} \right) \times \left( {\frac{1}{m} \times \mathop \sum \limits_{{i_{j} { \in }(X \cup Y),j = 1}}^{m} w_{j} } \right) $$
(5)

in which m is the total number of items in the set of (X ∪ Y).

The weighted association rule mining algorithm will retrieve all rules X → Y, where X ∪ Y is weighted frequent and whose confidence is greater than or equal to a minimum confidence threshold [18].

In order to improve the algorithm efficiency, we adopt the frequent pattern (FP) tree structure to optimize the weighted association rule mining algorithm [19]. At first, by scanning transaction database and define the minimum weighted support threshold (wminsup), the weighted FP-tree is constructed with the weighted potential frequent 1-itemsets. Then the list of potential rules is mined by the weighted association mining approach.

3.4 Complication Prediction

The mined complication association rules among diseases provide valuable information for patient diagnosis. Based on patients’ medical history, we can predict patients’ possible complication by applying the mined complication rules.

When a patient has a new visit to the hospital and the doctor identifies his/her disease, the patient’s main diseases in several latest visits are considered as a disease set, which includes patients’ medical history information. The antecedents of mined complication association rules will be browsed to identify whether it contains all diseases in the set. If some rules are matched, the consequents will be displayed as the possible complications. If no rules have antecedents that contain all diseases in the set, the oldest diagnosis will be excluded. Suppose there have n diseases in the original set, the set will be n1 diseases after the exclusion. Then the antecedents of mined complication association rules will be browsed again to find the matched rules. Iterate the above steps until some complication rules are matched or the set is empty. If the set is empty, no prediction will be given. Because only few rules include more than 6 diseases, we consider patients’ 6 latest visits for prediction in the first step. For the matched rules, the possible complications are listed in the order of confidence of the complication association rules.

4 Evaluation

We have conducted an empirical evaluation of the proposed approach by using electronic medical records from a hospital in China and using the methods proposed by Wright et al. [14] and Hoque et al. [20] as the benchmarks.

4.1 Data Preprocessing

The medical dataset we used is from a hospital in China. The dataset includes the information of inpatients and outpatients. Each patient gets a descriptive and longitudinal record to describe what happened during each visit. The record covers the information of diagnoses, lab test results, medications and procedures. The total number of records is about one million. Because we focus on the disease comorbidity relationship mining, we exclude the patients’ data with only one visit to the hospital.

Before mining complication association rules from the dataset, we firstly clean the data. First, the outpatient information is excluded. In the dataset, some medical information of outpatients is missing or incomplete. Moreover, the treatment outcome of outpatients is not recorded and the correctness of the diagnosis can’t be evaluated. Second, some doctors may fail to diagnose patients’ diseases and patients don’t get better after the inpatient treatment. Thus we remove the inpatients information with unclear or uncured treatment results. Third, for those records that missed some important information, we mark them as invalid and exclude from the experiment.

After the data preprocessing, the final qualified diagnosis data includes 253,271 records, and it is related with 24,754 patients and 6,698 diseases.

4.2 Metric and Benchmarks

We used precision (P) as the metric to assess the effectiveness of the proposed approach. Specifically, precision is the fraction of complication predictions that are correct. Higher value of P indicates better performance.

We use the diagnosis with the complication information as test dataset. The dataset includes 1,410 diagnosis records, which include the main disease and complication information. Based on the mined complication association rules, a list of possible complications for each patient can be generated. If the list includes the actual complications, we count that as a correct predication. Thus, the metric P is defined as:

$$ P = \frac{Number\;of\;corrected\;prediction}{Number\;of\;test\;data} \times 100\% $$
(6)

To evaluate the effectiveness of the proposed approach, the association rule mining (ARM) algorithm introduced by Wright et al. [14] and a rare association rule (RAR) mining approach proposed by Hoque et al. [20] are chosen as benchmarks. Wright et al. applied the tradition association rule mining algorithms in the medical data and confirmed the validity of the association rule mining algorithms. Hoque et al. focused on the improved low-frequency association rule mining and the effectiveness of the generated rules has been validated over several real life datasets.

4.3 Data Analysis and Results.

Derive the Concurrent Weight.

In the process of deriving disease concurrent weight, there has one important parameter which influences the effect of the BP network. It is the number of neurons in the hidden layer of BP network. Therefore, we choose the training time of the BP network as the evaluation metric and compare the performances with different values of the number of neurons in the hidden layer. By comparing the training time, the number of neurons in the hidden layer is set to 7.

After determining the parameters of BP network, we predict the concurrent weight of the whole 6,698 diseases. The known weights of 596 diseases are inputted to train the model. The weight of other diseases is predicted through the trained model. Finally, the weight of 3,340 diseases is calculated. For those diseases whose weights failed to be predicted, we set their weight as 0.

Weighted Association Rules Mining.

Based on the derived concurrent weight, we develop the weighted association rule mining approach by Java language and mine lots of interesting complication relationship. For the mined complication association rules, the number of items in rules is varying. Figure 1 demonstrates the distribution of the number of items in mined rules. Obviously, most rules include 4 items, which accounts for 21 % of the total association rules. The rules that include 3, 4 or 5 items are more than half of the total rules. Surprisingly, there has a little of rules that only include two items.

Fig. 1.
figure 1

Number of items in complication rules

We also apply the baseline methods to mine the complication association rules in the experiment dataset. The RAR method is trying the find the rare association rules and the number of generated rules is the biggest, i.e., 96,637. The ARM method derives 42,142 rules. Because some uncommon diseases are weighted, 83,029 rules are mined by the proposed method.

Performance Comparison.

We compare three algorithms from two perspectives: processing time and accuracy. For processing time, three algorithms have significant differences on the time consumption for complication association mining. All tests were performed on a PC with 3.4 GHz Intel i7-4770 CPU and 12G RAM. The running time of RAR, RAM, and the proposed method are 262, 196, and 27 min, respectively. The predication accuracy of RAR, ARM and the proposed method are 45.3 %, 38.5 % and 80 % respectively. The results show that the proposed method performs better than two baseline methods in processing time and accuracy.

As we mentioned before, the prediction step is to go through all mined complication rules and identify the consequent items of the matched rules as the predicted complications. And the prediction list includes several diseases. However, in the practical scenario, it is important to limit the list length, which can give more insightful suggestions for doctors. Thus we also compare the accuracy of three algorithms with different length of prediction list (Table 1). The results show that the proposed method is better than the baseline methods in three scenarios.

Table 1. The accuracy comparison of three algorithms with different list length

5 Conclusions

This paper focuses on mining disease complication association rules based on medical information. The concept of concurrent weight of diseases is proposed and defined. And the BP neural network model is introduced to predict the weight for all related diseases. Then, we adopt the weighted association rule mining algorithm and FP-tree structure to retrieve the complication relationship among diseases. Based on the mined rules, the potential list of patients’ complication can be generated.

This research provides several research contributions. First, we propose a new index to evaluate the importance of different diseases on complication prediction. The defined index, i.e., concurrent weight, can describe the possibility that a disease become a complication of other diseases. Second, we introduce the BP network to predict the disease weight and design the appropriate input and output data. We define the disease information as a three dimension vector and the output of BP network is described by two neurons. By using the BP model, we can deduce a relatively complete disease knowledgebase. Third, we adopt the weighted association rule approach to mine the diseases association rules. To the best of our knowledge, it is the first time to apply the weighted association rule mining approach in the medical field. And some interesting association rules are retrieved.

There are several limitations of this study, which provide opportunities for future research. First of all, we only focus on the mining of relationship among diseases, which can’t describe the complication relationship accurately. Second, due to the scope and complexity of this study, we do not invite medical professionals to evaluate mined complication association rules. Although the derived rules are surely helpful for doctors’ decision-making in the real practice, some rules maybe not meaningful or even wrong from the view of clinical research. Third, this research only uses the predication accuracy as the metric.