Abstract
In this paper, we show how to select different feature subsets for different classes, i.e., class-dependent feature subsets, for biomedical data. A feature importance ranking measure, i.e., class separability measure, is used to rank features for each class and obtain class-dependent feature importance ranking. Then several feature subsets for each class are formed and an “optimal” one for each class is determined through a classifier, e.g., the support vector machine (SVM). Our method of class-dependent feature selection is applied on several biomedical data sets and compared with class-independent feature selection. The experimental result shows that our approach to class-dependent feature selection is efficient in reducing feature dimension and producing satisfactory classification accuracy.
1 Introduction
When dealing with biomedical data, the data dimensionality, i.e., the number of input features, can be quite large. To reduce computational burden and noise, it is often desirable to reduce the data dimensionality. An effective approach is feature extraction, for example, principal component analysis (PCA) [10, 25] and singular vector decomposition [22]. The resultant features via feature extraction are obtained by certain tranforms from the original features and are therefore different from the original features. In this paper, we focus on another approach, i.e., feature selection, which chooses a subset of input features from the entire input feature set.
Based on the different measures used to find the best feature subset, feature selection methods are divided into the following two categories: filter approaches [19] and wrapper approaches [18]. The classical RELIEF algorithm [19] and its extended version RELIEFF [21] are examples of filter approaches. They assign a weight to each feature and then update the weight according to training instances. These weights represent relevance of features, therefore, all features are ranked according to their weights and those features with weights above a predefined threshold are selected. Wrapper approaches “wrap” feature selection around a classifier. Most wrapper approaches also utilize heuristic search techniques, such as sequential forward and backward search [35], hill climbing [2], and best-first search [18], to first search for possible feature subsets, then evaluate those feature subsets through the classifier, and finally determine an optimal subset in terms of classification accuracy. In this paper, we adopt a wrapper approach to select features for better accuracy.
Considering the possibility that different groups of features may have different abilities in distinguishing different classes [1, 7, 8, 13, 14, 23, 24, 26, 28–30, 34, 37, 38], we will choose different feature subsets for different classes, which is called class-dependent feature selection [29], as opposed to the usual class-independent feature selection [4–6, 11, 19, 36], which is to select a common feature subset for all the classes in a given classification problem. We note that class-independent feature selection is in fact a special case of class-dependent feature selection, that is, if all feature subsets in class-dependent feature selection for all classes happen to be the same, one obtains class-independent feature selection. The filter and wrapper approaches mentioned in [19, 21] belong to class-independent feature selection. For class-dependent feature selection, Baggenstoss [1] provided related theoretical analysis and utilized it on some artificial data sets. Oh et al. [28, 29] proposed a filter approach to selecting class-dependent feature subsets for the CENPARMI handwritten numerical database. The experimental results [28, 29] showed that classification accuracies of class-dependent feature selection were better compared to those of class-independent feature selection. In this paper, we will demonstrate a wrapper approach to selecting class-dependent features for biomedical data using the support vector machine (SVM) [31, 32] as the classifier.
This paper is organized as follows. In Sect. 2, we review the class separability measure (CSM), and introduce our approach to class-dependent feature selection. In Sect. 3, we provide experiment results of our method on four biomedical data sets from the UCI machine learning repository databases [27], and compare the results with those of class-independent feature selection. In the end, we present conclusions about the present work.
2 Methodology
2.1 The Class Separability Measure
Class separability measure (CSM) has been used by many researchers with different versions. The class separability proposed by Oh et al. [28, 29] is represented by \(S(c_{i},c_{j},\varvec{x})\), where \(c_{i}\) and \(c_{j}\) represent class i and class j of the data set, respectively, and \(\varvec{x}\) is a training sample. Each feature’s class separability is calculated individually, e.g., \(S(c_{i},c_{j},x_{p})\) for feature p, and features are ranked according to their class separation values. Fu and Wang [11] defined another class separability measure to rank each feature’s classification capability. This CSM includes two distance elements: the within-class distance (distance between patterns within each class) and the between-class distance (the distance between patterns among different classes), which are described in Eqs. 1 and 2. According to [11], we will adopt the ratio of the within-class distance to the between-class distance to measure each feature’s classification capability. For the whole training data, the within-class distance \(S_{w}\) [11] is calculated as:
and the between-class distance \(S_{b}\) [11] is calculated as:
Here C denotes the number of classes and \(P_{c}\) denotes the probability of class c. \(n_{c}\) refers to the number of samples in class c and \(\varvec{x}_{cj}\) refers to sample j in class c. \(\varvec{m}_{c}\) refers to the mean vector of class c and \(\varvec{m}\) refers to the mean vector of all the training samples. As mentioned above, the smaller the ratio \(S_{w}/S_{b}\), the better the separability. When evaluating one feature’s classification capability, we calculate the ratio (\(S_{w}/S_{b}\)) with the current feature removed, i.e., denoted as \(S_{w}^{'}/S_{b}^{'}\). The greater \(S_{w}^{'}/S_{b}^{'}\), the more important the removed feature is. Hence, we may evaluate the importance level of the features according to the ratio [11] with an attribute deleted each time in turn.
2.2 Our Approach to Class-Dependent Feature Selection
We describe our class-dependent feature selection method in three steps. In step one, we convert a C-class classification problem to C 2-class classification problems. Each problem only has two classes: the current class and the other one including all the other classes. In step two, for each 2-class problem, we adopt the ranking measure CSM to evaluate the importance of each feature. In step three, based on each class-dependent feature importance ranking list, we form different feature subsets for each class by sequentially adding one feature into the previous subset. Each feature subset is evaluated through an SVM and the feature subset corresponding to the highest classification accuracy will be our choice for this class.
During the process, a feature mask is introduced to describe features’ states, i.e., kept or removed. The feature mask is a vector, each element of which has only two values ’0’ and ’1’, in which ’0’ represents the absence of a particular feature and ’1’ represents the presence of a feature. For example, considering a data set with 5 features \(\{x_{1}, x_{2}, x_{3}, x_{4}, x_{5}\}\), if the optimal feature subset obtained is \(\{x_{1}, x_{3}, x_{5}\}\), the corresponding feature mask should be the vector {1, 0, 1, 0, 1}.
2.3 SVM with Class-Dependent Features
We first build several SVM models and then combine them together for accommodating class-dependent feature subsets. Each model is a binary classifier and is specific for one class. In the following, we will introduce the construction process.
-
1.
The training process: In this process, we construct C SVM models by training patterns, i.e., each class has its own model according to its specific feature subset. For example, the model i is trained with all the training examples in class i having positive labels and all the examples in other classes having negative labels. Specifically, all the training examples need to be filtered by a feature mask of class i before they are input for training. For instance, if the feature mask of class i has \(n^{(i)}\) ‘1’, all the training examples to form class i will have \(n^{(i)}\) features as the input and those features corresponding to ‘0’ in the feature mask are removed. The output can be either ‘+1’ or ‘−1’. If the input pattern \(\varvec{x}_{j}\) belongs to class i, we consider it as a positive sample (‘+1’). Or we consider it as a negative sample (‘−1’). The ith SVM model solves the following problem [16]:
$$\begin{aligned} \begin{array}{l} \displaystyle \text {min}_{\omega ^{i},b^{i},\xi ^{i}} \frac{1}{2} \varvec{\omega ^{i}}^{T}\varvec{\omega ^{i}}+\varsigma ^{i}\sum _{j=1}^{l}\xi _{j}^{i}\\ \displaystyle \varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i} \ge 1- \xi ^{i}_{j}, \text { if } y_{j}= i, \\ \displaystyle \varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i} \le -1 \xi ^{i}_{j}, \text { if } y_{j}\ne i,\\ \displaystyle \xi ^{i}_{j}\ge 0,\quad j=1,\ldots ,l \end{array} \end{aligned}$$(3)where \(\phi \) is the mapping function. \(\varsigma ^{i}\) is the penalty parameter for class i, and \(\xi _{j}^{i}\) are “slack variables” for class i. \(\varvec{x}_{j}\) corresponds to sample j in l samples. Minimizing \(\frac{1}{2}(\omega ^{i})^{T}\omega ^{i}\) means maximizing the margin between two groups of data. \(\varsigma ^{i}\sum ^{l}_{j=1}\xi ^{i}_{j}\) is a penalty term used to reduce the number of training errors in case of nonlinear separable data.
-
2.
The testing process: After the class-dependent models are constructed, we will use them to test unlabeled patterns. Same as the training process, each testing pattern is filtered with one class’s feature mask before input into the corresponding SVM model, i.e., the original attributes corresponding to ‘0’ in the feature mask are removed. Among the C outputs, the testing pattern \(\varvec{x}_{j}\), belongs to the class with the largest output value:
$$\begin{aligned} \text {Class of } \varvec{x}_{j}\equiv \text {argmax}_{i=1,2,\ldots ,C}(\varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i}) \end{aligned}$$(4)
3 Experiments and Discussions
In order to demonstrate whether class-dependent feature selection is more efficient than class-independent feature selection, we conduct the experiment on two biomedical data sets from the UCI machine learning repository databases [27]. Two terms used for comparison between the two methods are the number of features deleted and the classification accuracy.
3.1 Experimental Data
The first data set is the Ecoli data set. It has 7 attributes (localization sites of the protein) and 8 classes. The number of instances is 336. The second data set is the processed Cleveland data set. It mainly concerns heart disease diagnosis and is collected from the Cleveland Clinic Foundation. There are originally 303 samples, 13 features and 5 classes. Because there are 6 samples with unknown feature values, we remove the 6 samples from our experiment.
3.2 Experiment and Results
From various kinds of SVM software packages, LIBSVM 3.1 [3] with the RBF kernel was chosen in our experiment. 10-fold corss validation method is used to calculate the accuracy. In Table 1, the results for the Ecoli data set show us that Ecoli data set [15] has very different numbers of features deleted for different classes with class-dependent feature selection. The result on the Cleveland Heart Disease data (Table 2) [9] also show that different classes have very different feature subsets. Class 1 has few features removed, i.e., on average 1.9 (Table 3). While for class 2, 3, 4 and 5, the number of features deleted in the 10 simulations are on average within the range of [9, 12].
In Table 3, we present classification accuracies for 3 different conditions, i.e., without feature selection and with class-dependent and class-independent feature selection. The obvious improvement on the classification accuracy is for the Cleveland data set. Compared with the accuracy on the data without feature selection, our method has the accuracy increased by about 3 %. Compared with that of the class-independent method, our method has increased the accuracy from 56.23 to 58.61 %.
4 Conclusions
In this paper, we demonstrated an approach to class-dependent feature selection. We adopted class separability measure [11] to evaluate feature importance, based on which an optimal feature subset was determined for each class through the SVM. The experimental results for two biomedical data sets [27] show that each class has a different feature subset which includes representative features for classifying the current class from the other classes, and the corresponding classifier can improve or at least maintain the classification accuracy using those class-dependent feature subsets.
References
Baggenstoss, P.M.: Class-specific features in classification. IEEE Trans. Signal Process. pp. 3428–3432 (1999)
Caurana, R.A., Freitag, D.: Greedy attribute selection. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann Publishers, NEW Brunswick, NJ (1994)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
Chu, F., Wang, L.P.: Gene expression data analysis using support vector machines. In: Proceedings of the International Joint Conference on Neural Networks 2003, vols. 1–4, pp. 2268–2271 (2003)
Chu, F., Xie, W., Wang, L.P.: Gene selection and cancer classification using a fuzzy neural network. In: Proceedings of the North-American Fuzzy Information Processing Conference (NAFIPS 2004), vol. 2, pp. 555–559 (2004)
Chu, F., Wang, L.P.: Applications of support vector machines to cancer classification with microarray data. Int. J. Neural Syst. 15(6), 475–484 (2005)
Crawford, M.M., Kumar, S., Ricard, M.R., Gibeaut, J.C., Neuenschwander, A.: Fusion of airborne polarimetric and interferometric SAR for classification of coastal environments. IEEE Trans. Geosci. Remote Sens. 37, 1306–1315 (1999)
Desai, M., Shazeer, D.J.: Acoustic transient analysis using wavelet decomposition. In: IEEE Conference on Neural Networks for Ocean Engineering, pp.29–40 (1991)
Detrano, R.: The Cleveland Heart Disease Data Set. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation (1988)
La Foresta, F., Morabito, F.C., Azzerboni, B., Ipsale, M.: PCA and ICA for the extraction of EEG components in cerebral death assessment. In: IJCNN 05. Proceedings of 2005 IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2532–2537 (2005)
Fu, X.J., Wang, L.P.: Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance. IEEE Trans. Syst. Man Cybern. B Cybern. 33(3), 399–400 (2003)
Fu, X.J., Wang, L.P.: A GA-based novel RBF classifier with class-dependent features. In: Proceedings of 2002 Congress on Evolutionary Computation, no. 2, pp. 1890–1894 (2002)
Fu, X.J., Wang, L.P.: Rule extraction from an RBF classifier based on class-dependent features. In: CEC2002: Proceedings of the 2002 Congress on Evolutionary Computation, vols. 1 and 2, pp. 1916–1921 (2002)
Fu, X.J., Wang, L.P.: A rule extraction system with class-dependent features. In: Ghosh, A., Jain, L.C. (eds.) Evolutionary Computing in Data Mining, pp. 79–99. Springer, Berlin (2005)
Horton, P., Nakai, K.: A probablistic classification system for predicting the cellular localization sites of proteins. In: Intelligent Systems in Molecular Biology, pp.109–115 (1996)
Hsu, C.-W., Lin, C.-J.: A Comparison of methods for multi-class support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. National Taiwan University, Department of Computer Science and Information Engineering, Taipei, Taiwan (2003)
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 367–370. AAAI Press, Portland (1994)
Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of 10th National Conference on Artificial Intelligence, pp. 129–134. AAAI Press/MIT press, Park, CA (1992)
Koller, D., Sahami, M.: Toward Optimal Feature Selection. In: Proceedings of the 13th International Conference on Machine Learning (ML), pp. 284–292, Bari, Italy (1996)
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Proceeding of the European Conference on Machine Learning (ECML94), pp. 171–182. Springer-Verlag, Berlin, Heidelberg (1994)
Liu, B., Wan, C.R., Wang, L.P.: An efficient semi-unsupervised gene selection method via spectral biclustering. IEEE Trans. Nano Biosci. 5(2), 110–114 (2006)
Marchiori, E.: Class dependent feature weighting and K-nearest Neighbor classification. Patt. Recogn. Bioinform. LNCS 7986, 69–78 (2013)
Mohammadi, M., Raahemi, B., Akbari, A., Nassersharif, B.: New class-dependent feature transformation for intrusion detection systems. Secur. Commun. Netw. 5, 1296–1311 (2012)
Morabito, C.F.: Independent component analysis and feature extraction techniques for NDT data. Mater. Eval. 58(1), 85–92 (2000)
Musselman, M., Djurdjanovic, D.: Time-frequency distributions in the classification of epilepsy from EEG signals. Expert Syst. Appl. 39, 11413–11422 (2012)
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Oh, I.S., Lee, J.S., Suen, C.Y.: Using class separation for feature analysis and combination of class-dependent features. In: Fourteenth International Conference on Pattern Recognition, no.1, pp. 453–455 (1998)
Oh, I.S., Lee, J.S., Suen, C.Y.: Analysis of class separation and combination of class-dependent features for handwriting recognition. IEEE Trans. Patt. Anal. Mach. Intell. no.21, pp. 1089–1094 (1999)
Tian, J., Li, M., Chen, F., Feng, N.: Learning subspace-based rbfnn using coevolutionary algorithm for complex classification tasks. IEEE Trans. Neural Netw. Learn. Sys. (2015)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Wang, L.P. (ed.): Support Vector Machines: Theory and Applications. Springer, New York (2005)
Wang, L.P., Chu, F., Xie, W.: Accurate cancer classification using expressions of very few genes. IEEE/ACM Trans. Bioinf. Comput. Biol. 4(1), 40–53 (2007)
Wang, L.P., Zhou, N., Chu, F.: A general wrapper approach to selection of class-dependent features. IEEE Trans. Neural Netw. 19(7), 1267–1278 (2008)
Wang, L.P., Fu, X.J.: Data Mining with Computational Intelligence. Springer, Berlin (2005)
Zhou, N., Wang, L.P.: Effective selection of informative SNPs and classification on the HapMap genotype data. BMC Bioinf. 8, 484 (2007)
Zhou, N., Wang, L.P.: Class-dependent feature selection for face recognition. In: Advances in Neuro-Information Processing, Part II, vol. 5507, pp. 551–558 (2009). Proceedings of 15th International Conference on Neural Information Processing, ICONIP 2008, Auckland, New Zealand, 2008
Zhou, W.G., Dickson, J.: A novel class dependent feature selection method for cancer biomarker discovery. Comput. Biol. Med. 47, 66–75 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhou, N., Wang, L. (2016). Processing Bio-medical Data with Class-Dependent Feature Selection. In: Bassis, S., Esposito, A., Morabito, F., Pasero, E. (eds) Advances in Neural Networks. WIRN 2015. Smart Innovation, Systems and Technologies, vol 54. Springer, Cham. https://doi.org/10.1007/978-3-319-33747-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-33747-0_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33746-3
Online ISBN: 978-3-319-33747-0
eBook Packages: EngineeringEngineering (R0)