Hybrid prediction model for Type-2 diabetic patients

https://doi.org/10.1016/j.eswa.2010.05.078Get rights and content

Abstract

A wide range of computational methods and tools for data analysis are available. In this study we took advantage of those available technological advancements to develop prediction models for the prediction of a Type-2 Diabetic Patient. We aim to investigate how the diabetes incidents are affected by patients’ characteristics and measurements. Efficient predictive modeling is required for medical researchers and practitioners. This study proposes Hybrid Prediction Model (HPM) which uses Simple K-means clustering algorithm aimed at validating chosen class label of given data (incorrectly classified instances are removed, i.e. pattern extracted from original data) and subsequently applying the classification algorithm to the result set. C4.5 algorithm is used to build the final classifier model by using the k-fold cross-validation method. The Pima Indians diabetes data was obtained from the University of California at Irvine (UCI) machine learning repository datasets. A wide range of different classification methods have been applied previously by various researchers in order to find the best performing algorithm on this dataset. The accuracies achieved have been in the range of 59.4–84.05%. However the proposed HPM obtained a classification accuracy of 92.38%. In order to evaluate the performance of the proposed method, sensitivity and specificity performance measures that are used commonly in medical classification studies were used.

Introduction

Diabetes is the most common disease nowadays in all populations and in all age groups. It is a disease in which the body does not produce or properly use insulin. The cells in our body require glucose for growth for which insulin is quite essential. When someone has diabetes, little or no insulin is secreted. In this situation, plenty of glucose is available in the blood stream but the body is unable to use it (Mohamed et al., 2002). Basically there are two types of diabetes, viz. Type-1 and Type-2. Type-1 diabetes occurs when the body’s immune system is attacked and the beta cells (these cells produce insulin) of pancreas are destroyed. This results in insulin deficiency. The only treatment to Type-1 diabetes is insulin. On the other hand, Type-2 diabetes is caused by relative insulin deficiency. Pancreas in Type-2 diabetes still produces insulin but it may not be effective or may not produce sufficient amount of insulin to control blood glucose (Guthrie & Guthrie, 2002). Type-2 diabetes is the most common type of diabetes (Acharya, Tan, Subramanian, et al., 2008), which usually develops at age 40 and older.

According to Diabetes Atlas, it is estimated that about 194 million people worldwide, or 5.1% in the adult population, have diabetes and that this will increase to 333 million, or 6.3% by 2025 (Gan, 2003). Type-2 diabetes constitutes about 85–95% of all diabetes in developed countries and accounts for an even higher percentage in developing countries. Type-2 diabetes is serious global health problem, which, for most countries, has evolved in association with rapid cultural and social changes, ageing populations, increasing urbanization, dietary changes, reduced physical activity and other unhealthy lifestyle and behavioral patterns (Pickup & Williams, 2003). The purpose of this study is to build a Hybrid Prediction Model that could accurately classify newly diagnosed patients (pregnant women) into either a group that is likely to develop diabetes or into a group that does not develop the diabetes in 5 years from the time of first diagnosis.

In this study the data benchmark is the UCI machine learning database available at ftp://ftp.ics.uci.edu/pub/machine-learningdatabases (accessed 14.03.2009). The only reason for using this dataset is that it is very commonly used among the other classification systems, hence easier to compare the results of the proposed model for Pima Indian diabetes diagnosis problem.

Section snippets

Background

In this section we shall mainly discuss data mining, data mining tool and data mining methods (clustering and classification).

Predictive classifications in medicine literature review

In recent years, use of predictive classification in medical diagnosis has received a strong boost owing to earnest research activity in this field in recent times. Over the last few years several researchers have highlighted the potential of predictive data mining to infer clinically relevant models from patient data and to provide decision support in this field. Majority of papers published in the area of predictive classification for diabetic data deals with the goal of improving accuracy.

Feature selections

The process of determining and selecting the features are most relevant to the data mining task. This technique is also known as attributes selection or relevance analysis. The quality of data is an important aspect for data mining Application. Various quality measures can be used to assess the quality of data. However accuracy and consistency are the two most important measures that decide the data quality. Definitions of each feature in the database are analyzed. If a feature is not

Predictive data mining process for proposed model

Data mining is most often the application of a number of different techniques from various disciplines with the goal to discover interesting patterns from data. Here we have used the guideline of predictive data mining process (Riccardo & Blaz, 2008).

Analysis of results

The present study shows that TP, TN, FP and FN rate parameters are important for interpreting the result of a classifier, the values of above parameters are presented in Table 5. These parameters can be used to measure accuracy, sensitivity and specificity, respectively. Sensitivity is also referred to as the true positive rate that is, the proportion of positive tuples that are correctly identified, while specificity is the true negative rate that is, the proportion of negative tuples that are

References (29)

  • R.A. Guthrie et al.

    Nursing management of diabetes mellitus

    (2002)
  • M.A. Hall

    Correlation-based subset feature selection for machine learning

    (1999)
  • J. Han et al.

    Data mining: Concepts and techniques

    (2006)
  • K. Hoshi et al.

    An analysis of thyroid function diagnosis using Bayesian-type and SOM type neural networks

    Chemical and Pharmaceutical Bulletin

    (2005)
  • Cited by (167)

    View all citing articles on Scopus
    View full text