A machine learning-based approach for predicting the outbreak of cardiovascular diseases in patients on dialysis

https://doi.org/10.1016/j.cmpb.2019.05.005Get rights and content

Abstract

Background and Objective: Patients with End- Stage Kidney Disease (ESKD) have a unique cardiovascular risk. This study aims at predicting, with a certain precision, death and cardiovascular diseases in dialysis patients.

Methods: To achieve our aim, machine learning techniques have been used. Two datasets have been taken into consideration: the first is an Italian dataset obtained from the Istituto di Fisiologia Clinica of Consiglio Nazionale delle Ricerche of Reggio Calabria; the second is an American dataset provided by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) repository. From each one we obtained 5 datasets, according to the outcome of interest. We tested different types of algorithm (both linear and non-linear), but the final choice was to use Support Vector Machine. In particular, we obtained the best performances using the non-linear SVC with RBF kernel algorithm, optimizing it with GridSearch. The last is an algorithm useful to search the best combination of hyper-parameters (in our case, to find the best couple (C, γ)), in order to improve the accuracy of the algorithm.

Results: The use of non-linear SVC with RBF kernel algorithm, optimized with GridSearch, allowed to obtain an accuracy of 95.25% in the Italian dataset and of 92.15% in the American dataset, in a timeframe of 2.5 years,in the prediction of Ischaemic Heart Disease. A worse performance was obtained for the other outcomes.

Conclusions: The machine learning-based approach applied in our study is able to predict, with a high accuracy, the outbreak of cardiovascular diseases in patients on dialysis.

Introduction

Patients with end-stage kidney disease (ESKD) have an incredibly high risk of death and cardiovascular disease, risk which is strongly associated with the level of renal function both in community-based studies and in selected populations with established cardiovascular disease [1].

Moderate renal insufficiency carries a 19% excess risk for cardiovascular complications [2], and the risk is even higher in the elderly [3] and in patients with pre-existing cardiovascular disease [4], [5], [6], [7]. In the general population, the so called Framingham risk factors are able to predict mortality and cardiovascular events [8], [9], [10]. Similarly, the Body Mass Index (BMI) was found to interact with the Framingham score in predicting incident CV diseases in general population [11] and in comorbid conditions [12], while educational status [13] and marital status [14] have been widely associated with mortality in general population. Contrarily to the general population, where as much as the 75% of excess risk for coronary heart disease could be explained by Framingham risk factors [15], the excess of risk of CVD in CKD patients it is not so easy to explain. Other factors, the so called “uremic risk factors” might contribute to the increase in cardiovascular risk in patients with ESRD [1], [16].

Machine Learning has been extensively used for predicting clinical outcomes. Literature referred to this technique is increasing, so trying to include it in a paper is a hard task. Here we will limit our description to just some of the most recent studies in various clinical fields and some surveys that may guide the interested reader, for example Jothi et al. [17], focused on healthcare data mining. Even more recently, a comprehensive survey of the most widely used computational models and algorithms can be found in Shishvan et al. [18]. In Kavakiotis et al. [19] and Durgadevi and Kalpana [20] is presented a thorough description of Machine Learning methods applied to diabetes mellitus, whereas in Delen et al. [21] the survival time of patients after thoracic transplantations has been successfully predicted. An application of classifiers for the estimation of heart failure can be found in Tripoliti et al. [22], [23], while Sartakhti et al. [24] applied a support vector machine with optimization by means of simulated annealing for the diagnosis of hepatitis disease. In Lopez-Martnez et al. [25] logistic regression is applied to a large dataset to study the risk factors responsible of the emergence of hypertension. The identification of high-risk patients is the main focus of Panicacci et al. [26]. The authors predict the risk of hospitalization relying on socio-economic and administrative data related to aged citizens.

Recent approaches include the employment of multiple algorithms in order to reduce variance and increase the accuracy of diagnosis. In Wang et al. [27], in which a SVM ensemble is used to predict the outbreak of breast cancer, whereas in Zheng et al. [28] a hybrid algorithm of K-means and SVM applied to the diagnosis of breast cancer is presented. With reference to the clinical field of our study, chronic kidney disease have been studied in Abdelaziz et al. [29] and Park et al. [30], devoted to the analysis of acute kidney injury in cancer patients. Chen et al. [31] applied logistic regression to a large dataset in order to predict the emergence of kidney stone disease. Our study faces a binary classification problem with supervised learning, since we are aware of both of the set of inputs made up of all the features, and of the objective to be achieved, represented by the outcome. To predict the mortality due to cardiovascular events and the outbreak of cardiovascular diseases in dialysis patients, several machine learning algorithms (both linear and not) have been used [32]:

  • Logistic Regression (LR);

  • K-Nearest Neighbor (KNN);

  • Classification Decision Tree (CART);

  • Naïve Bayes (NB);

  • LinearSVC (SVCL);

  • Support Vector Classifier with Radial Basis Function kernel (SVCR);

  • SVC with Polynomial kernel (SVCP).

Among all, the algorithms with the greatest predictive power are LR and SVC with RBF kernel. We decided to use the latter because it is a very powerful and widespread algorithm, and it has already been used for other similar studies.

Finally, it is important to underline that the GridSearch optimization algorithm [33] has been used in order to achieve greater accuracy.

Section snippets

Datasets

In this Section the three datasets adopted in this work are described. General characteristics are summarised in Supplemental Table1. In Supplemental Table 2 how many samples belong to each class are shown, while in Supplemental Table 3 the features are listed. For the purpose of the study we considered 5 cardiovascular outcomes such as cardiovascular death, heart failure, ischemia, arrhythmia, other cardiovascular (65 events), and consequently, 5 datasets.

Data preprocessing

Before starting, categorical variables were checked and converted into numerical form. For example, regarding the outcome, the class 0 indicated the non-occurrence of the event, while the class 1 indicated the occurrence of the event. Moreover, it was necessary to identify and manage missing values (if any). The Italian dataset had no missing values, contrarily to the American one.

In some of the existing records, the outcome t variable was missing and, in this case, we excluded the patients

Model selection and evaluation

To select the model to use, first of all, we need to understand if we were dealing with a linear or non-linear problem.

Usually, in order to measure the degree of correlation (i.e. of linear dependence) between two variables, we proceed with visual analysis using scatter plots.

Due to the high number of variables, this approach was not adopted.

In Supplemental Fig. 1 the two- and three-dimensional scatter plots of calcium as a function of glucose, in order to ascertain if the two data clouds are

Discussion

The final results show that the model used (SVC with RBF kernel and GridSearch algorithm) allows to obtain important results in the prediction of mortality and on the onset of cardiovascular diseases in dialysis patients. Similarly, to the Framingham risk score, developed and validated in the general population, and the other risk scores validated in the field of nephrology [38], [39] our model, once validated in a wider context, will allow to predict the individual risk of mortality and/or CV

Acknowledgments

The HEMO study was conducted by the HEMO Investigators and supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The data from the HEMO study reported here were supplied by the NIDDK Central Repositories. This manuscript was not prepared in collaboration with Investigators of the HEMO study and does not necessarily reflect the opinions or views of the HEMO study, the NIDDK Central Repositories, or the NIDDK.

References (41)

  • H. Wang et al.

    A support vector machine-based ensemble algorithm for breast cancer diagnosis

    Eur. J. Oper. Res.

    (2018)
  • B. Zheng et al.

    Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms

    Expert Syst. Appl.

    (2014)
  • A. Abdelaziz et al.

    A machine learning model for improving healthcare services on cloud computing environment

    Measurement

    (2018)
  • Z. Chen et al.

    Development of a personalized diagnostic model for kidney stone disease tailored to acute care by integrating large clinical, demographics and laboratory data: the diagnostic acute care algorithm - kidney stones (DACA-KS)

    BMC Med. Inform. Decis. Making

    (2018)
  • J. Floege et al.

    Development and validation of a predictive mortality risk score from a european hemodialysis cohort

    Kidney Int.

    (2015)
  • S.D. Anker et al.

    Development and validation of cardiovascular risk scores for haemodialysis patients

    Int. J. Cardiol.

    (2016)
  • T. Greene et al.

    Design and statistical issues of the hemodialysis (HEMO) study

    Controll. Clin. Trials

    (2000)
  • D.E. Weiner et al.

    Chronic kidney disease as a risk factor for cardiovascular disease and all-cause mortality: a pooled analysis of community-based studies

    J. Am. Soc. Nephrol.

    (2004)
  • P.W. De Leeuw et al.

    Prognostic significance of renal function in elderly patients with isolated systolic hypertension: results from the Syst-Eur trial

    J. Am. Soc. Nephrol.

    (2002)
  • G. Schillaci et al.

    High-normal serum creatinine concentration is a predictor of cardiovascular risk in essential hypertension

    Arch. Intern. Med.

    (2001)
  • Cited by (90)

    • Recent Advances and Future Perspectives in the Use of Machine Learning and Mathematical Models in Nephrology

      2022, Advances in Chronic Kidney Disease
      Citation Excerpt :

      Their worst-performing model was a logistic regression with an AUROC of 0.92, and the similarity in performance raises the question of advantages in the use of logistic regression, due to its higher interpretability.84 Support vector machines were used to predict the outbreak of cardiovascular disease in dialysis patients,85 and natural language processing at annotations in the EHR for symptom identification of dialysis patients.86 One application in which ML (deep learning in particular) excels is in the identification of patterns in images.

    View all citing articles on Scopus
    View full text