Predicting the onset of type 2 diabetes using wide and deep learning with electronic health records

https://doi.org/10.1016/j.cmpb.2019.105055Get rights and content

Highlights

  • Information from electronic health records can be used to help us understand what contributes to the onset of diseases including type 2 diabetes mellitus.

  • Machine learning, including deep learning, has been used to predict the onset of diseases using information from electronic health records.

  • Our work is the first to use wide and deep learning, a state-of-the-art deep learning architecture that achieves both memorisation and generalisation abilities, to predict the onset of type 2 diabetes mellitus using electronic health records.

  • Our algorithm is better at predicting the onset of type 2 diabetes mellitus than other state-of-the-art machine learning algorithms using the same dataset with similar experimental settings.

  • The synthetic minority over-sampling technique (SMOTE) was found to work better with the wide and deep learning framework than other machine learning algorithms in improving sensitivity for imbalanced electronic health record datasets.

Abstract

Objective

Diabetes is responsible for considerable morbidity, healthcare utilisation and mortality in both developed and developing countries. Currently, methods of treating diabetes are inadequate and costly so prevention becomes an important step in reducing the burden of diabetes and its complications. Electronic health records (EHRs) for each individual or a population have become important tools in understanding developing trends of diseases. Using EHRs to predict the onset of diabetes could improve the quality and efficiency of medical care. In this paper, we apply a wide and deep learning model that combines the strength of a generalised linear model with various features and a deep feed-forward neural network to improve the prediction of the onset of type 2 diabetes mellitus (T2DM).

Materials and methods

The proposed method was implemented by training various models into a logistic loss function using a stochastic gradient descent. We applied this model using public hospital record data provided by the Practice Fusion EHRs for the United States population. The dataset consists of de-identified electronic health records for 9948 patients, of which 1904 have been diagnosed with T2DM. Prediction of diabetes in 2012 was based on data obtained from previous years (2009–2011). The imbalance class of the model was handled by Synthetic Minority Oversampling Technique (SMOTE) for each cross-validation training fold to analyse the performance when synthetic examples for the minority class are created. We used SMOTE of 150 and 300 percent, in which 300 percent means that three new synthetic instances are created for each minority class instance. This results in the approximated diabetes:non-diabetes distributions in the training set of 1:2 and 1:1, respectively.

Results

Our final ensemble model not using SMOTE obtained an accuracy of 84.28%, area under the receiver operating characteristic curve (AUC) of 84.13%, sensitivity of 31.17% and specificity of 96.85%. Using SMOTE of 150 and 300 percent did not improve AUC (83.33% and 82.12%, respectively) but increased sensitivity (49.40% and 71.57%, respectively) with a moderate decrease in specificity (90.16% and 76.59%, respectively).

Discussion and conclusions

Our algorithm has further optimised the prediction of diabetes onset using a novel state-of-the-art machine learning algorithm: the wide and deep learning neural network architecture.

Introduction

Diabetes is responsible for considerable morbidity, healthcare utilisation and mortality in both developed and developing countries. Globally, in 2017 it was estimated that 425 million people had diabetes – this is predicted to increase to 629 million by the end of 2045 [1]. Type 2 diabetes mellitus (T2DM) is the most common type of diabetes (95%) in the United States (US) [2]. In the US, more than 30 million people had diabetes in 2017 [1]. The high costs of hospital treatment and the high rate of readmission associated with diabetes means that early prevention and effective treatment is crucial [3]. The early prediction of the onset of diabetes using routinely available data such as electronic health records (EHRs) is therefore important [4].

EHRs are relatively complete electronic systems that have the potential to store information from millions of patients across many healthcare institutions, including patient demographics, medical data (e.g., diagnoses, laboratory tests and medications), clinical notes and so on [5], [6]. In the past, EHRs were used by doctors, healthcare practitioners and public health workers to store and extract patients’ information for clinical care [7]. The secondary use of EHR data for tool development aims to assist healthcare practitioners and policy makers to initiate or modify interventions, understand disease progress and introduce or improve policies to help prevent disease [8]. Patient information in EHRs is highly varied with dimensions, class imbalanced data (i.e., a heterogeneous sample of diabetic and non-diabetic patients) [4] and missing data [6], making it difficult to develop efficient analytic models using classical statistical analysis methods [9]. The availability of electronic health records (EHRs) along with advances in hardware (Central Processing Units (CPUs) and Graphical Processing Units (GPUs)) and computer algorithms (machine learning and especially its sub-field deep learning) make it possible to predict disease onset with high accuracy. With respect to diabetes, most studies utilising EHRs used and compared the performance of common machine learning algorithms (k-Nearest Neighbors, Naive Bayes, Decision Tree, Random Forest, Support Vector Machine, and Logistic Regression) in prediction of diabetes progression [10], [11], [12], [13], [14], [15], [16], [17].

Deep learning algorithms have been used in recent years for the prediction of the onset of diseases based on secondary uses of EHRs. With respect to healthcare research, deep learning models can outperform classical machine learning methods which require more manual feature engineering [6]. Moreover, longitudinal event and continuous monitoring characteristics data from EHRs allows training of complex and challenging deep learning models [6]. Compared with statistical models for the prediction of the onset of diabetes using risk factors (logistic regression [18]) and patient mortality using hazard ratios (survival analysis [19]), and classical machine learning (decision tree, random forest and support vector machine [20]), deep learning is capable of automatically learning represented features from input data and subsequently reduces the feature engineering [21]. To attain state-of-the-art performance with less computational resource, a wide and deep learning framework was developed by Google to achieve both memorisation and generalisation [22]. Memorisation is learning a wide set of crossed-product feature transformations representing the correlation between the co-occurrence of a feature pair and the target label. Generalisation is obtained by matching different features that are close to each other in an embedding space generated by a deep feed-forward neural network. In this framework, the wide part accounts for a generalised linear model and the deep element represents a feed-forward neural network. By combining the advantages of both components, this framework is able to use a data structure which is highly varied and complicated. To the best of our understanding, there has been little previous work which has used deep learning approaches to develop risk scores using large healthcare data [23], [24], [25], [26], [27].

Miottothe et al. [8] developed a novel unsupervised deep learning algorithm (Deep Patient) to predict the future of patients using 700,000 records from the Mount Sinai EHRs. They used demographic information (age, sex and race), clinical notes as ICD-9 codes, medical prescriptions, procedures and laboratory tests. They designed a multi-layer deep representation neural network optimised with stochastic gradient descent to a local unsupervised criterion. Their model was tested using 76,214 patients comprising 78 diseases. The prediction of T2DM with complications within one year using AUC score was 90.7%. The algorithm was found to improve the prediction of various diseases in EHRs and other tasks such as clinical trial testing and treatment suggestions.

In recent work, Pham et al. [27] introduced a deep dynamic neural network framework (DeepCare) that performed various tasks including assessing patient trajectories and predicting future disease outcomes. The dataset contained more than 12,000 patients between 2002 and 2013 with 7191 patients selected. The dataset was divided into three parts: 67% for parameter estimation, 16.5% for tuning, and 16.5% for testing. The performance of DeepCare using max-pooling on the diabetes dataset was a F-score of nearly 60%.

One of the most important applications of using secondary EHRs is the development of web-based tools or software for the prediction of future outcomes. One of the examples of these online tools and software is QDiabetes™-2018 [28], an algorithm developed using Cox proportional hazards models by ClinRisk Ltd using information from QResearch database in UK (https://www.qresearch.org). QDiabetes is a risk prediction algorithm which calculates an individual’s risk of T2DM for the next 10 years for people aged 25 to 84 years taking account of their individual risk factors (age, sex, ethnicity, clinical values and diagnoses) [29]. This tool is used to predict the risk of developing T2DM and integrated in doctors’ computer systems with an average receiver operator curve statistic of 0.85 for women and 0.83 for men.

In summary, compared with classical machine learning models, deep learning can extract useful information from EHRs by learning features related to diabetes outcomes and therefore help in the targeting of people who are likely to develop the disease so that they can change their lifestyles. This information is important for developing tools and software for secondary uses of EHRs. In this study, we applied a wide and deep learning approach to predict the onset of type 2 diabetes mellitus using the Practice Fusion EHR dataset and compared the performance of this approach with a machine learning approach used by Pimentel et al. [17]. The wide and deep learning approach has been increasingly used for clinical risk prediction and classification. It is anticipated that predictive modelling using data from EHRs will drive personalised medicine resulting in improved healthcare quality. This information is important for development of the potential tools to assist health practitioners (doctors/clinicians) with the prognosis of diabetes disease and policy makers with creating suitable interventions to reduce the burden of diabetes.

Section snippets

Data source

We used a publicly available EHR dataset from the United States released by Practice Fusion in 2012 for a data science competition and compared our model performance with another study by Pimentel et al. [17] who applied a random forest with temporal features and feature selection for onset T2DM prediction using this dataset. The dataset consisted of de-identified electronic health records of 9948 patients, with 1904 diagnosed with T2DM over a four-year period (2009–2012). The dataset also

Results and discussion

Table 3 shows the performance obtained from the test set using 10 models from a 10-fold stratified cross validation and the final ensemble model. The ensemble model produced an AUC of 84.13% (Fig. 6) which is higher than that of each individual model. This means the modeling averaging ensemble is more robust and produces better performance on average than a single model.

We further tested the algorithm using 10-fold cross-validation and the final ensemble model was selected for assessing the

Conclusions

In this study, we proposed a wide and deep learning neural network architecture for the prediction of the onset of diabetes using a publicly available EHR dataset. Our ensemble model improved AUC and specificity risk scores and substantially improved sensitivity for predicting T2DM onset compared with other machine learning algorithms which used the same dataset and experimental settings [17]. In the future, we will incorporate an auto feature selection method to design the crossed features and

Declaration of Competing Interest

All authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

B. P. Nguyen and Q. H. Nguyen gratefully acknowledge the support of NVIDIA Corporation for the donation of the GPUs used for this study.

References (34)

  • J.A. Casey et al.

    Using electronic health records for population health research: a review of methods and applications

    Ann. Rev. Public Health

    (2016)
  • C. Xiao et al.

    Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review

    J. Am. Med. Inform. Assoc.

    (2018)
  • W.R. Hersh

    Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance

    Am. J. Manag. Care

    (2007)
  • R. Miotto et al.

    Deep patient: an unsupervised representation to predict the future of patients from the electronic health records

    Sci. Rep.

    (2016)
  • B.A. Goldstein et al.

    Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review

    J. Am. Med. Inform. Assoc.

    (2017)
  • S. Mani et al.

    Type 2 diabetes risk forecasting from EMR data using machine learning

    Proceedings of the AMIA Annual Symposium

    (2012)
  • N. Razavian et al.

    Population-level prediction of type 2 diabetes from claims data and analysis of risk factors

    Big Data

    (2015)
  • Cited by (0)

    1

    These authors contributed equally.

    View full text