Hybrid prediction model for Type-2 diabetic patients

doi:10.1016/j.eswa.2010.05.078

Expert Systems with Applications

Volume 37, Issue 12, December 2010, Pages 8102-8108

https://doi.org/10.1016/j.eswa.2010.05.078 Get rights and content

Abstract

A wide range of computational methods and tools for data analysis are available. In this study we took advantage of those available technological advancements to develop prediction models for the prediction of a Type-2 Diabetic Patient. We aim to investigate how the diabetes incidents are affected by patients’ characteristics and measurements. Efficient predictive modeling is required for medical researchers and practitioners. This study proposes Hybrid Prediction Model (HPM) which uses Simple K-means clustering algorithm aimed at validating chosen class label of given data (incorrectly classified instances are removed, i.e. pattern extracted from original data) and subsequently applying the classification algorithm to the result set. C4.5 algorithm is used to build the final classifier model by using the k-fold cross-validation method. The Pima Indians diabetes data was obtained from the University of California at Irvine (UCI) machine learning repository datasets. A wide range of different classification methods have been applied previously by various researchers in order to find the best performing algorithm on this dataset. The accuracies achieved have been in the range of 59.4–84.05%. However the proposed HPM obtained a classification accuracy of 92.38%. In order to evaluate the performance of the proposed method, sensitivity and specificity performance measures that are used commonly in medical classification studies were used.

Introduction

Diabetes is the most common disease nowadays in all populations and in all age groups. It is a disease in which the body does not produce or properly use insulin. The cells in our body require glucose for growth for which insulin is quite essential. When someone has diabetes, little or no insulin is secreted. In this situation, plenty of glucose is available in the blood stream but the body is unable to use it (Mohamed et al., 2002). Basically there are two types of diabetes, viz. Type-1 and Type-2. Type-1 diabetes occurs when the body’s immune system is attacked and the beta cells (these cells produce insulin) of pancreas are destroyed. This results in insulin deficiency. The only treatment to Type-1 diabetes is insulin. On the other hand, Type-2 diabetes is caused by relative insulin deficiency. Pancreas in Type-2 diabetes still produces insulin but it may not be effective or may not produce sufficient amount of insulin to control blood glucose (Guthrie & Guthrie, 2002). Type-2 diabetes is the most common type of diabetes (Acharya, Tan, Subramanian, et al., 2008), which usually develops at age 40 and older.

According to Diabetes Atlas, it is estimated that about 194 million people worldwide, or 5.1% in the adult population, have diabetes and that this will increase to 333 million, or 6.3% by 2025 (Gan, 2003). Type-2 diabetes constitutes about 85–95% of all diabetes in developed countries and accounts for an even higher percentage in developing countries. Type-2 diabetes is serious global health problem, which, for most countries, has evolved in association with rapid cultural and social changes, ageing populations, increasing urbanization, dietary changes, reduced physical activity and other unhealthy lifestyle and behavioral patterns (Pickup & Williams, 2003). The purpose of this study is to build a Hybrid Prediction Model that could accurately classify newly diagnosed patients (pregnant women) into either a group that is likely to develop diabetes or into a group that does not develop the diabetes in 5 years from the time of first diagnosis.

In this study the data benchmark is the UCI machine learning database available at ftp://ftp.ics.uci.edu/pub/machine-learningdatabases (accessed 14.03.2009). The only reason for using this dataset is that it is very commonly used among the other classification systems, hence easier to compare the results of the proposed model for Pima Indian diabetes diagnosis problem.

Section snippets

Background

In this section we shall mainly discuss data mining, data mining tool and data mining methods (clustering and classification).

Predictive classifications in medicine literature review

In recent years, use of predictive classification in medical diagnosis has received a strong boost owing to earnest research activity in this field in recent times. Over the last few years several researchers have highlighted the potential of predictive data mining to infer clinically relevant models from patient data and to provide decision support in this field. Majority of papers published in the area of predictive classification for diabetic data deals with the goal of improving accuracy.

Feature selections

The process of determining and selecting the features are most relevant to the data mining task. This technique is also known as attributes selection or relevance analysis. The quality of data is an important aspect for data mining Application. Various quality measures can be used to assess the quality of data. However accuracy and consistency are the two most important measures that decide the data quality. Definitions of each feature in the database are analyzed. If a feature is not

Predictive data mining process for proposed model

Data mining is most often the application of a number of different techniques from various disciplines with the goal to discover interesting patterns from data. Here we have used the guideline of predictive data mining process (Riccardo & Blaz, 2008).

Analysis of results

The present study shows that TP, TN, FP and FN rate parameters are important for interpreting the result of a classifier, the values of above parameters are presented in Table 5. These parameters can be used to measure accuracy, sensitivity and specificity, respectively. Sensitivity is also referred to as the true positive rate that is, the proportion of positive tuples that are correctly identified, while specificity is the true negative rate that is, the proportion of negative tuples that are

References (29)

G.A. Carpenter et al.
ARTMAP-IC and medical diagnosis: Instance counting and inconsistent cases
Neural Networks
(1998)
D. Delen et al.
Predicting breast cancer survivability: A comparison of three data mining methods
Artificial Intelligence in Medicine
(2005)
K. Polat et al.
A cascade learning system for classification of diabetes disease: Generalized discriminant analysis and least square support vector machine
Expert Systems with Applications
(2008)
M. Sebban et al.
A hybrid filter/wrapper approach of feature selection using information theory
Pattern Recognition
(2002)
U.R. Acharya et al.
Automated identification of diabetic type 2 subjects with and without neuropathy using wavelet transform on pedobarograph
Journal of Medical Systems
(2008)
Bioch, J. C., Meer, O., & Potharst, R. (1996). Classification using Bayesian neural nets. In International conference...
N.V. Chawla et al.
SMOTE: Synthetic minority over-sampling technique
Journal of Artificial Intelligence Research (JAIR)
(2002)
Deng, D., & Kasabov, N. (2001). On-line pattern analysis by evolving self-organizing maps. In Proceedings of the fifth...
Gan, D. (2003). Diabetes atlas, Brussels: International diabetes federation (2nd ed.)....
Guojun, G., Chaoqu, M., & Jianhong, W. (2007). Data clustering theory algorithm and application (1st ed.)....

R.A. Guthrie et al.

Nursing management of diabetes mellitus

(2002)

M.A. Hall

Correlation-based subset feature selection for machine learning

(1999)

J. Han et al.

Data mining: Concepts and techniques

(2006)

K. Hoshi et al.

An analysis of thyroid function diagnosis using Bayesian-type and SOM type neural networks

Chemical and Pharmaceutical Bulletin

(2005)

Cited by (167)

Improving diabetes disease patients classification using stacking ensemble method with PIMA and local healthcare data
2024, Heliyon
Diabetes mellitus, a chronic metabolic disorder, continues to be a major public health issue around the world. It is estimated that one in every two diabetics is undiagnosed. Early diagnosis and management of diabetes can also prevent or delay the onset of complications. With the help of a variety of machine learning and deep learning models, stacking algorithms, and other techniques, our study's goal is to detect diseases early. In this study, we propose two stacking-based models for diabetes disease classification using a combination of the PIMA Indian diabetes dataset, simulated data, and additional data collected from a local healthcare facility. We use both the classical and deep neural network stacking ensemble methods to combine the predictions of multiple classification models and improve classification accuracy and robustness. In the evaluation protocol, we used both the train-test and cross-validation (CV) techniques to validate our proposed model. The highest accuracy is obtained by stacking ensemble with three NN architectures, resulting in an accuracy of 95.50 %, precision of 94 %, recall of 97 %, and f1-score of 96 % using 5-fold CV on simulation study. The stacked accuracy obtained from ML algorithms for the Pima Indian Diabetes dataset is 75.03 % using the train-test split protocol, while the accuracy obtained from the CV protocol is 77.10 % on the stacked model. The range of performance scores that outperformed the CV protocol 2.23 %–12 %. Our proposed method achieves a high accuracy range from 92 % to 95 %, precision, recall, and F1-score ranges from 88 % to 96 % using classical and deep neural network (NN)-based stacking method on the primary dataset. The proposed dataset and ensemble method could be useful in the early detection and treatment of diabetes, as well as in the advancement of machine learning and data analysis techniques in the healthcare industry.
A fractional-order di®erential equation model of diabetes mellitus type SEII<inf>T</inf>
2024, International Journal of Mathematics for Industry
NOVEL DIABETES CLASSIFICATION APPROACH BASED ON CNN-LSTM: ENHANCED PERFORMANCE AND ACCURACY
2024, Diagnostyka
A novel computational analysis of diabetes model with Caputo-Katugampola memory
2024, Journal of Computational Analysis and Applications
A voting-based machine learning approach for classifying biological and clinical datasets
2023, BMC Bioinformatics
An Enhance Approach Based on Preprocessing Strategies in Lymphoma's Image Classification
2023, AIP Conference Proceedings

View all citing articles on Scopus

View full text

Hybrid prediction model for Type-2 diabetic patients

Abstract

Introduction

Section snippets

Background

Predictive classifications in medicine literature review

Feature selections

Predictive data mining process for proposed model

Analysis of results

Neural Networks

Artificial Intelligence in Medicine

Expert Systems with Applications

Pattern Recognition

Automated identification of diabetic type 2 subjects with and without neuropathy using wavelet transform on pedobarograph

Journal of Medical Systems

SMOTE: Synthetic minority over-sampling technique

Journal of Artificial Intelligence Research (JAIR)

Nursing management of diabetes mellitus

Correlation-based subset feature selection for machine learning

Data mining: Concepts and techniques

An analysis of thyroid function diagnosis using Bayesian-type and SOM type neural networks

Chemical and Pharmaceutical Bulletin