Elsevier

Applied Soft Computing

Volume 75, February 2019, Pages 323-332
Applied Soft Computing

GA-SVM based feature selection and parameter optimization in hospitalization expense modeling

https://doi.org/10.1016/j.asoc.2018.11.001Get rights and content

Highlights

  • Using k-means to cluster and obtain two category label—low expense and high expense.

  • A weighted combination of classification accuracy and feature number are taken as the fitness function.

  • Kernel penalty factor c, kernel function γ and the feature mask are used to construct chromosome.

  • In this paper a GA-SVM based feature selection and parameter optimization in hospitalization expense modeling are proposed.

Abstract

Feature selection and parameter optimization are two important aspects to improve the performance of classifier. A novel approach based on the genetic algorithm(GA) for feature selection and parameter optimization of support vector machine(SVM) is proposed in order to improve the prediction accuracy of hospitalization expense model. First of all, the data of hospitalization expense are preprocessed, including data cleaning, discretization, normalization; Secondly, using k-means to cluster and obtain two category labels; Thirdly, kernel penalty factor c, kernel function γ and the feature mask are used to construct chromosome; The Fourth, a weighted combination of classification accuracy and feature number are taken as the fitness function, and GA was used to optimize the SVM parameters, and simultaneously select the optimal subset of features; Finally, single parameter optimization is performed using GA and particle swarm optimization (PSO), and the optimization performance of which is compared with that of GA-PCA and PSO-PCA. Experimental results show that the proposed algorithm can be used to quickly obtain suitable feature subsets and SVM parameters, thereby achieving a better classification result.

Introduction

With the development of economy and the acceleration of urbanization process, the medical expenses of residents are increasing quickly. At the same time, the main part of medical expenses is hospitalization expenses. Therefore, how to control and predict hospitalization expenses of patients is very important. In recent years, hospital expenses are studied by following aspects, first is impact factors analysis [1], [2], [3], second is finding a suitable model to forecast hospitalization expenses [4], [5]. In general, there are two methods to do this work, such as data mining, statistics, multiple linear regression [6], Cox proportional hazard model [7] are used in statistical field. The requirements of statistics are usually strict, including large enough samples, normal distribution and nonlinear relationship. As the distribution of samples on hospitalization expenses are high dimensional, complex, nonlinear, and the variables are usually redundant, which can be affected accuracy and stability of the model significantly. Therefore, the traditional statistical methods are not satisfactory due to their difficulties to meet the requirements for applications. Neural Network (NN) [8] is an information processing paradigm that is inspired by the way biological nervous systems, but it is easy to fall into local minimum value and lead to the disadvantages of slow convergence speed. Support vector machine (SVM) is a classification method based on the principle of structural risk minimization [9], [10]. It is an effective method to avoid local optimum and has unique advantages in dealing with complex problems such as limited samples, high dimensional and nonlinear data. The performance of SVM is highly related to its kernel parameters and penalty factor, and the key to improve the classification accuracy is to select the appropriate parameters. At present, there are a lot of parameters optimization methods. For example, Sun [11] used grid search method to optimize the SVM parameters to improve the classification accuracy of the bearing fault. Wang [12] and P.J., [13] Used PSO algorithm to optimize SVM parameters, and the results confirm feasibility and superiority of the proposed optimization method. T. Santhanam [14] and Wu [15] utilize genetic algorithm (GA) for SVM parameters optimization. These methods have been applied in combination of parameters optimization, but few reports are available on the analysis of the hospitalization expenses.

Feature subset selection is another important factor affecting the performance of classifier because the original feature contains a large amount of redundant information, which are not directly related to modeling, thereby increasing the computation quantity of hospitalization expenses and reducing the classification accuracy. The common methods such as principal component analysis (PCA) are based on linear combination, which will lose some important information in the original variables, and cannot reveal the characteristics of nonlinear structure, therefore the nonlinear dimensionality reduction method is emerged. Li [16] proposed that the redundancy of feature information is reduced by using GA in remote sensing image classification and the classification accuracy of image is improved. Liu [17] used GA to select feature subset of network intrusion data with neural network as the evaluation model and it is revealed that the method can effectively reduce the modeling time of the intrusion detection and improve the detection rate.

Most of the methods can only individually optimize either feature subset selection or SVM parameters, greatly limiting the classification potential of SVM. Therefore, solving classification problems of SVM has become a hot topic recently which can optimize feature subset selection and SVM parameters simultaneously [18]. Oliveira [19] and Zhao [20] used GA to optimize feature subset selection and SVM parameters at the same time and the experimental results show that the method can not only reduce the time complexity of the operation but also improve the classification accuracy of the proposed method. In this paper a method is proposed to optimize feature subset selection and SVM parameters based on GA. This method not only reduces the redundancy of the feature variables in hospitalization expense data, but also improves the classification performance of SVM.

Gastric cancer is one of the most common gastrointestinal tract tumors in the People’s Republic of China. According to the 2014 World Cancer Report [21], and it is one of the leading causes of cancer deaths in the world [22]. The morbidity of gastric cancer exhibits marked geographical variation, with high-risk areas in Japan, China, Eastern Europe and certain countries in Latin America. Although its incidence has declined, gastric cancer still represents a tremendous health care burden in China [23], it has the world’s largest number of new cases and deaths of gastric cancer. No systematic national vital statistics exist in China, but a retrospective sampling survey on malignant tumors from 2004 to 2005 found that the mortality rate from gastric cancer ranked third in overall cancer mortality [24]. Notably, China alone accounts for 42% of all gastric cancer cases worldwide, at least in part because of its large population [25]. The highest rates were often found in economically undeveloped rural areas in China, including Gansu, Henan, Hebei, Shanxi, and Shaanxi Provinces [26]. Ningxia is located in the west of China, also it is a high risk area of gastric cancer in China. The etiology of gastric cancer is not clear and is easy to relapse. The good news is gastric cancer is a disease of long duration thanks to the advances in technology, research and science. At the same time, however, as one of the chronic and long-lasting diseases, gastric cancer is costly and debilitating not only to the individuals, the families, but also to the medical insurance company, the community and the nation. In addition, diversification of diagnostic and therapeutic technologies, in combination with the commonly seen unnecessary tests, procedures and treatment, further aggravate the economic burden on the patients. Therefore, it is of considerable importance to understand and model the structure of hospitalization expenses of patients with gastric cancer in order to make a reasonable forecast of anticipated costs of hospitalization of patients.

Section snippets

Algorithm and principle of support vector machine

SVM is a pattern recognition method developed from statistical learning theory based on the idea of structural risk minimization principle. In the case of ensuring classification accuracy, SVM can improve the generalization ability of the learning machine by maximizing the classification interval. The biggest advantage of SVM is that it overcomes the over learning and high dimension both of which lead to computational complexity and local extremum. A reliable classification model based on SVM

The algorithm of feature selection and parameter optimization based on GA-SVM

GA is a heuristic optimization algorithm and can be categorized as global search algorithm. In this paper, GA is used to synchronously optimize the parameters of SVM and feature selection of hospitalization expenses in order to effectively use the appropriate features and parameters to build the model of hospitalization expenses. Chromosome design, fitness function, the progress of K-means is utilized to obtain two category labels (high cost and low cost). GA-SVM algorithm architecture will be

Data cleaning

Data in our study were collected from the medical record of patients with gastric cancer in a tertiary hospital in Yinchuan city from 2013 to 2014. The total number of cases was 1252. The quality of all the target samples was addressed by data cleaning by which the missing values were filled (the individual missing values were replaced with neighboring non-missing values), cases with incomplete information (such as unknown age, hospitalization date and payment information), duplicate cases and

Experimental environment and parameter settings

In this paper, the size of the population is set to 20, the maximum iteration number is set to 150, the crossover rate is 0.6, the mutation rate is 0.02, the penalty factor C and the kernel parameters γ are in the range of 0 to 100. In order to improve the stability of the model, five cross validation experiments are performed. The data of 1243 cases are divided into 5 parts, each of the first 4 parts includes 248 cases while the fifth part includes 251 cases. Each part is used as the test data

Conclusions

The data of hospitalization expenses are high dimensional, complex and nonlinear. The traditional methods in modeling and linear dimensionality reduction are difficult to meet the requirements of the model. In this paper, a method of feature selection and parameter optimization based on GA is proposed, which avoids the loss of information seen in traditional linear feature reduction methods, and solves the problem of the parameters setting at the same time. Experimental results show that the

Acknowledgments

The work is partially supported by National Natural Science Foundation of China under Grants 81160183, 61561040, 61471297 and 61771397, Natural Science Foundation of Ningxia, China under Grant NZ14085.

References (31)

  • SangYu et al.

    An effective discretization method for disposing high-dimensional data

    Inform. Sci.

    (2014)
  • CaoShu-zhen et al.

    A decision-tree-based analysis of the factors influencing single disease costs

  • MirabzadehA. et al.

    Cost prediction of antipsychotic medication of psychiatric disorder using artificial neural network model

    J. Res. Med. Sci.

    (2013)
  • TangZhentao et al.

    Multiple linearregression analysis of influencing factors of hospitalization cost in patiens with cancer in Liaoning province

    China Cancer

    (2011)
  • ZhaoYan et al.

    Applied study on Cox regression model in hospitalization expenses control

    Modern Prev. Med.

    (2008)
  • Cited by (0)

    View full text