Keywords

1 Introduction

Chronic disease has become an important public health issue facing the world now, especially the cardiovascular disease (CVD) which ranks the first cause of death in the world. As the predisposing factors of CVD, Hypertension, Hyperglycemia and Hyperlipidemia (3Hs) has attracted much attention and research. According to statistics, nearly 3 million people die of varying degrees of CVD in China every year, while the number of people suffering from CAD is far more than 300 million. The death toll caused by 3Hs accounts for more than a quarter in China. Literature shows that high blood fatness in adults can lead to long-term risk of coronary heart disease [1].

The link among the 3Hs is well-connected. The increased blood fatness leads to increased blood pressure. The patients with hyperglycemia are always accompanied with the increased blood lipid. Obese hypertensive patients are generally suffering from diabetes.

Obesity relates to 3Hs closely especially hypertension [2]. In fact, the prevalence of obesity is increasing substantially in the world. It is estimated that 35.5% of adult female and 32.2% of adult male in the US are obese [3]. The Brazilian Institute of Geography and Statistics has shown that 50.1% of male and 48% of female in Brazil are overweight, while 12.4% of male and 16.9% of female are suffering from obesity [4].

The report shows that the hyperlipidemia and hyperglycemia is often overlooked, thanks to the inconvenient examination method, and more than half of patients realize their illness till the complications occur showing by kidney or eye. According to the National Health Research [5], over 30% of Indonesian is suffering from hypertension and 76% of them do not realize. To minimize the risk of CVD, an early prevention and treatment are needed. Therefore, it is necessary to adopt a low-cost method to assess the 3Hs risk, especially in underdeveloped and developing countries.

The body parameters such as body mass index (BMI), waist circumference (WC), hip circumference (WHR), waist to height ratio (WHtR) are the most practical and effective indicators for assessment of obesity, which is accessible and noninvasive [3]. Literature shows a positive correlation between WC and WHR and the amount of visceral fat, and combined with them can predict CVD effectively [6]. Young et al. [7] validate the predictive ability of WC, WHR and BMI for hypertension based on 722 Chinese adults. Akdag determined that the risk factors for hypertension are WHR, gender, triglycerides (TC) and family history of hypertension [8]. Fava C validates that the addition of additional variables is more predictive of the hypertensive model [9]. Sandi G uses C4.5 to predict the health risk for the treatment of hypertension [10]. Tayefi M et al. apply to CART of decision tree with multi types of data in prediction of hypertension [11]. Yanrui S. proposes adaptive prediction algorithm which established a blood glucose prediction model based on glucose data, exercise and dietary [12]. However, it is valuable only if the user has a large amount of monitoring data.

Since blood fat detection is not as convenient as blood pressure and glucose, there is no prediction model for blood fat across the world. However, the relationship between blood fat and CVD is even closer than that of high blood pressure and glucose, so the prediction of blood fat is of great significance.

With the powerful ability to extract complex patterns and accurate assessment, machine learning has been widely used in the medical arena such as prediction of obesity, classification of prostate cancer etc. The Microsoft Machine Learning Summit presented new machine-based health science applications in 2013. Classification and regression tree (CART) has a special significance for health research, owing to it can find out a best combination of variables.

This article mainly contrast the Harvard Cancer Risk Index (HCRI) and CART in the 3Hs prediction. We will analyze that which variables or the combinations of them can assess the risk of 3Hs better so that patients have a choice to take an early treatment, which can prevent the occurrence of 3Hs or the complications such as CVD effectively [13].

2 Data

2.1 Data Set

The data are collected from a number of chronic disease surveillance sites in 10 cities of China (Beijing, Shanghai, Wuhan, etc.) from August 2013 to June 2015.

Screening of obese body parameters, including Age, BMI, WC, HC, WHtR, WHR, etc., along with features for 3Hs family history, history of illness, smoking, drinking, dietary, exercise, pressure from data sets. There are 6071 samples in total, age from 19 to 86 years old.

The statistical analysis of features shows in Table 1.

Table 1. Statistics of the 3Hs patient

It is obvious that the difference between both genders of feature is large, thus we analyze separately. In order to take full use of data resources and make the data balanced, the final data set consist of 1400 males (including 643 hypertensive patients while 757 not) and 800 females (including 421 hypertensive patients while 379 not) for high blood pressure study; 2800 males (including 1421 hyperlipidemia patients while 1379 not) and 2000 females (including 945 hyperlipidemia patients while 1055 not) for high blood fatness study; 1400 males (including 682 hyperglycemia patients while 718 not) and 800 females (including 367 hyperglycemia patients while 433 not) for high blood glucose study.

Each data set is randomly divided into ten groups containing one validation set and nine training set. Then we evaluate the model by averaging the accuracy, specificity, etc.

2.2 Data Analysis

The single factor analysis results of features are as follow:

For hypertension, the three most relevant factors are history of hypertension, family history of hypertension and gender (F-test = 615.232, 51.245 and 50.156 respectively). In addition, age, WC and HC are strong correlation (F-test = 6.215, 5.667 and 2.667 respectively). Fully illustrates the relationship between obesity and high blood pressure is very close. This demonstrates that obesity has a high correlation with hypertension.

For hyperlipidemia, the three most relevant factors are gender, age and WC (F-test = 192.065, 9.578 and 9.409 respectively) in the absence of fat-related medical history data. This demonstrates that obesity has a high correlation with hyperlipidemia. Besides, the F-test value of systolic blood pressure and HC is 6.026 and 5.789 respectively which means there is a relationship between hyperlipidemia and hypertension.

For hyperglycemia, the three most relevant factors are the history of diabetes, gender and family history of diabetes, and gender is more prominent than family history of diabetes. Compared with hypertension and hyperlipidemia, the F-test values of WHR, WHtR, WC and BMI are 16.774, 11.412, 9.667 and 9.435 respectively indicating that the relationship between obesity and hyperglycemia is well-connected. In addition, DBP, SBP and TC also have a significant correlation with hyperglycemia, which shows that the relationship among 3Hs is extremely close.

In order to show the relationship between each feature and the 3Hs more intuitively and clearly, Receiver Operational Characteristic (ROC) curve of male and female were drawn respectively by SPSS, with the body measurement parameters as test variables, and diagnostic as state variable. The area under the curve (AUC) was used to explore the predictive value of each feature for 3Hs.

Compared to the WHtR, WC, WHR of male, the WHtR and WC of female show a better prediction according to the hypertension ROC curve. From hyperlipidemia ROC curve, the feature shows similar predictive value for both male and female. The top three most valuable variables are WC, WHR and WHtR. From the AUC of hyperglycemia ROC curve, we can see that the feature of female show a better prediction than male, and WHtR ranks first (Figs. 1, 2 and 3).

Fig. 1.
figure 1

ROC curve and AUC of the variables in hypertension

Fig. 2.
figure 2

ROC curve and AUC of the variables in hyperlipidemia

Fig. 3.
figure 3

ROC curve and AUC of the variables in hyperlipidemia

3 Assessment Methods

3.1 HCRI - Logistic Model

The Harvard Cancer Risk Index (HCRI) is presented by Professor Harvard for 10 years of accumulated cancer-related data [14]:

$$ RR = \frac{{\prod\limits_{i = 1}^{N} {RI_{i} } }}{{\prod\limits_{j = 1}^{N} {(P_{j} \cdot RC_{j} + 1 - P_{j} )} }} $$
(1)

Where N is the number of risk factors, RR is the relative risk of the predicted sick individual compared with the general of same gender age group, and RC is the relative risk of a certain risk factor. RIi refers to the relative risk of sick individual, if one has a certain risk factor, RI = RC, otherwise RI = 1.0. And P j is the ratio of those who has a risk factor in the same gender and age [15]:

$$ P_{j} = \frac{{n_{j} }}{{m_{{}} }} $$
(2)

Where \( n_{j} \) is the number of samples with risk factor j in one gender and age group and m is the total.

The most critical parameter in the formula (1) is RC. In traditional methods, it is assigned by simple statistical analysis combined with expert advice. In order to describe the relationship between each feature and 3Hs more accurately, we determine RC by using logistic algorithm.

$$ x = \omega_{0} + \omega_{1} a_{1} + \omega_{2} a_{2} + \cdots + \omega_{k} a_{k} $$
(3)

Where x is the category, a is the attribute and w is the weight. By training the data, the corresponding weights can be calculated, and the attribute values of each training sample can be marked.

In order to solve the problem that the subordinate relationship value range is not always between 0 and 1 and independence, take the logarithm of the ratio of p and 1-p, where p is the probability:

$$ \ln \frac{p}{1 - p} = \log it\left( p \right) = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k} + \varepsilon $$
(4)
$$ \begin{aligned} \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{p} & = \frac{{\exp \left( {\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k} } \right)}}{{1 + \exp \left( {\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k} } \right)}} \\ {\kern 1pt} & = \frac{1}{{1 + \exp [ - \left( {\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k} } \right)]}} \\ \end{aligned} $$
(5)

Statistical variables (P < 0.05) were obtained, while insignificant attributes were screened out (P > = 0.05). Through logistic regression training, the regression coefficient can be get. And OR = exp (β) is the relative risk (it is a risk factor when OR > 1 while protective factor when OR < 1).

This article uses the Logistic classifier from the tree package in scikit-learn to tune and train the model and for make 10-fold cross-validation by using Kfold package. RC can be get (Fig. 4):

Fig. 4.
figure 4

The Calculation Process of RC

3.2 Logistic – Cart Model

Different from the traditional statistics, the CART algorithm is given in the form of a binary tree, which is easy to understand and has a more accurate prediction criterion. And the more complex the data, the more variables, the more significant the superiority of the algorithm. It consists of two parts:

Construction.

First, determine the condition attributes and decision attributes in the target data set, and then determine the split criteria. According to the value of the splitting function F i , the largest corresponding attribute A i is selected as the splitting attribute.

$$ F_{i} = \prod {P_{ij} } \sum {\left( {1 - 2 \times \hbox{min} \left\{ {k = 1 \cup k = 2\left| {P_{ijk} } \right.} \right\}} \right)} $$
(6)

Where P ij is the quantity ratio of the value j whose property is i in data set, P ijk is the quantity ratio of the value j whose property is i and which belongs to decision attribute k in data set.

After determining the best split attribute, we need to select the best split value:

$$ \Phi \left( {{s \mathord{\left/ {\vphantom {s t}} \right. \kern-0pt} t}} \right) = 2 \times P_{L} P_{R} \sum {\left| {P\left( {C_{j} \left| {{\kern 1pt} {\kern 1pt} t_{L} } \right.} \right) - P\left( {C_{j} \left| {{\kern 1pt} {\kern 1pt} t_{R} } \right.} \right)} \right|} $$
(7)

Where P L and P R are the ratio of the number of samples in the left and right sub tree to the total respectively:

$$ P_{L} \cup P_{R} = \frac{{m_{L\left| R \right.} }}{n} $$
(8)

Where is the number of samples in the left or right sub tree and n is the total. P(C j |t L ) and P(C j |t R ) refer to the probability of belonging to the Cj in the left and right sub tree respectively:

$$ P\left( {C_{j} \left| {{\kern 1pt} {\kern 1pt} t_{L} } \right.} \right) \cup P\left( {C_{j} \left| {{\kern 1pt} {\kern 1pt} t_{R} } \right.} \right) = \frac{{n_{j} }}{N} $$
(9)

Where n j is the number of samples belonging to class j in the left or right sub tree, and N is the total number at the target node.

After selecting the best split attribute and the best split attribute value, the data set of the current node is divided into W ijL and W ijR , and continue to split until all the nodes belong to the same class or the attribute set is empty.

Pruning.

Since the generated decision tree is likely over-fitting, it is necessary to prune the decision tree to optimize:

$$ R_{\alpha } \left( T \right) = R\left( T \right) + \alpha \left| {\tilde{T}} \right| $$
(10)

Where R α (T) represents the cost complexity of T, R(T) is misclassification loss, | \( \tilde{T} \) | is the number of nodes, and α is the complexity coefficient. During the increase of α from 0, always choose the sub tree with the smallest value of R α (T), and make it a node, finally we get the best decision tree. Textual data are pre-processed and tagged. Obesity-related variables A 1 A 2 …A m are input into the decision tree model. The output value is entered into the logistic regression model as a new variable B 1 along with the family history, personal history, and drinking habits (B 1 B 2 …B n ) with a threshold of 0.5 (Fig. 5).

Fig. 5.
figure 5

Decision tree and logistics regression assessment model

3.3 Assessment Indicators

After training, the article evaluates by accuracy, specificity and sensitivity (Fig. 6 and Table 2).

Fig. 6.
figure 6

Decision-making process

Table 2. Risk grading
$$ Accuracy = {{\left( {a + b} \right)} \mathord{\left/ {\vphantom {{\left( {a + b} \right)} {\left( {c + d} \right)}}} \right. \kern-0pt} {\left( {c + d} \right)}} $$
(11)
$$ Sensitivity = a/c $$
(12)
$$ Specificity = b/d $$
(13)

Where a and b are the number of patients and the normal which is determined correctly, while c and d are the total number of patients and the normal respectively.

3.4 Result

The result of the model 1 is in Table 3, the accuracy, specificity and sensitivity are taken as the mean of 10-fold cross-validation.

Table 3. Risk grading

From Table 4, the prediction results of individual variables are not ideal, and the accuracy, specificity and sensitivity are greatly improved after synthesis.

Table 4. Decision

The only WHtR is the ideal prediction parameter for both genders whose accuracy, specificity and sensitivity are better than the other four variables. It shows that the only WHtR has high value in hypertension prediction. However, WHtR has a decrease of 3.43%, 2.52% and 1.19% in the accuracy of male, specificity and sensitivity respectively compared with female which corresponds with ROC curve. It is noticeable that the max sensitivity of the individual WC on the test set is 60.61%, and the accuracy is next only to WHtR. Therefore, WC also has a high value in the prediction of hypertension for female.

Considering the relative between 3Hs and the prediction of blood pressure, we add DBP and SBP in the prediction of blood fat and blood glucose. The only WHR and the only WHtR have better accuracy, specificity and sensitivity than other 4 variables in both male and female models. Secondly, the accuracy, specificity and sensitivity of male model 8 and female model 8 have been greatly improved which indicates that the interaction of variables can provide more information for the hyperlipidemia of prediction.

We can see that the prediction of female is significantly better than male which is match to the results derived from ROC curve. In the prediction model of individual variables, WHtR is obviously superior to other variables which fully proves the high value of WHtR in the prediction of female hyperglycemia. The prediction variable next to WHtR is BMI, WC, WHR whose accuracy, specificity and sensitivity are close to the optimal model 8 of male, which means that the obesity index is closely related to the hyperglycemia of female. BMI, WHR and blood pressure are relatively ideal prediction parameters for male, and the accuracy and the sensitivity of the prediction affected by the only blood pressure are the highest which are 71.01% and 85.72% respectively, which means that blood pressure and blood glucose have confidential relation. It can be seen from the table that the accuracy and the specificity of model 8, are higher 7.35% and 27.6% respectively than the maximum of 71.01% (model 7) and 61.49% (model 2) of the single variable model.

According to the predictions of the 3Hs models, WHtR is the most crucial factor, and the more the parameters, the more favorable the prediction results. After adding factors such as 3Hs family history, history of illness, smoking, drinking, dietary, exercise, pressure. The prediction results are shown in Table 5.

Table 5. Decision tree prediction results of 3Hs

4 Discussion

Literature assumes that 500 million adults worldwide are troubled with obesity including fifteen million who are overweight [2]. Obesity is a public health problem, so that its and relative diseases treatment expenditure can be considerable. The prevalence of this chronic non-communicable diseases have been ever-increasing both in developed developing countries. The prevalence of female in the United States was 35.5%, compared with 32.2% for male. And the prevalence of female in the United States was 16.9%, compared with 12.4% for male [16].

Considering certain limitations, it is the most practical and low-cost way to assess obesity by anthropometric variables. The report of WHO in 2008 noted that BMI, WC and WHR are well-connected related to cardiovascular risk, hypertension, hyperlipidemia and other health problems.

The research of this article is that the algorithm of machine learning can predict 3Hs instantly with corresponding accuracy by measuring certain body parameters. Although we can now easily measure blood pressure and various body indicators in hospitals, there are poor supplies in some remote places, such as some parts of Africa and poor villages in China. Therefore, the study of this article will become very practical, because we only need ruler-measured parameters of the body to predict 3Hs. In the present study, we use the classification regression tree model to study the 3Hs prediction model for body measurement parameters. We can see that body measurement parameters such as obesity are more valuable to female than male in prediction. In all body measurements, the waist circumference was the most valuable predictor of 3Hs which is suitable for both male and female.

5 Conclusion and Future Work

This article mainly uses machine learning algorithm of CART to realize the prediction of 3Hs comparing with the HCRI - Logistic model. The study provides a simple classification rule to determine obesity risk factors related to 3Hs, which is potential to help to make management solution of 3Hs.