Processing Bio-medical Data with Class-Dependent Feature Selection

Zhou, Nina; Wang, Lipo

doi:10.1007/978-3-319-33747-0_30

Processing Bio-medical Data with Class-Dependent Feature Selection

Nina Zhou⁷ &
Lipo Wang⁸

Conference paper
First Online: 19 June 2016

1495 Accesses

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 54))

Abstract

In this paper, we show how to select different feature subsets for different classes, i.e., class-dependent feature subsets, for biomedical data. A feature importance ranking measure, i.e., class separability measure, is used to rank features for each class and obtain class-dependent feature importance ranking. Then several feature subsets for each class are formed and an “optimal” one for each class is determined through a classifier, e.g., the support vector machine (SVM). Our method of class-dependent feature selection is applied on several biomedical data sets and compared with class-independent feature selection. The experimental result shows that our approach to class-dependent feature selection is efficient in reducing feature dimension and producing satisfactory classification accuracy.

Download conference paper PDF

1 Introduction

When dealing with biomedical data, the data dimensionality, i.e., the number of input features, can be quite large. To reduce computational burden and noise, it is often desirable to reduce the data dimensionality. An effective approach is feature extraction, for example, principal component analysis (PCA) [10, 25] and singular vector decomposition [22]. The resultant features via feature extraction are obtained by certain tranforms from the original features and are therefore different from the original features. In this paper, we focus on another approach, i.e., feature selection, which chooses a subset of input features from the entire input feature set.

Based on the different measures used to find the best feature subset, feature selection methods are divided into the following two categories: filter approaches [19] and wrapper approaches [18]. The classical RELIEF algorithm [19] and its extended version RELIEFF [21] are examples of filter approaches. They assign a weight to each feature and then update the weight according to training instances. These weights represent relevance of features, therefore, all features are ranked according to their weights and those features with weights above a predefined threshold are selected. Wrapper approaches “wrap” feature selection around a classifier. Most wrapper approaches also utilize heuristic search techniques, such as sequential forward and backward search [35], hill climbing [2], and best-first search [18], to first search for possible feature subsets, then evaluate those feature subsets through the classifier, and finally determine an optimal subset in terms of classification accuracy. In this paper, we adopt a wrapper approach to select features for better accuracy.

Considering the possibility that different groups of features may have different abilities in distinguishing different classes [1, 7, 8, 13, 14, 23, 24, 26, 28–30, 34, 37, 38], we will choose different feature subsets for different classes, which is called class-dependent feature selection [29], as opposed to the usual class-independent feature selection [4–6, 11, 19, 36], which is to select a common feature subset for all the classes in a given classification problem. We note that class-independent feature selection is in fact a special case of class-dependent feature selection, that is, if all feature subsets in class-dependent feature selection for all classes happen to be the same, one obtains class-independent feature selection. The filter and wrapper approaches mentioned in [19, 21] belong to class-independent feature selection. For class-dependent feature selection, Baggenstoss [1] provided related theoretical analysis and utilized it on some artificial data sets. Oh et al. [28, 29] proposed a filter approach to selecting class-dependent feature subsets for the CENPARMI handwritten numerical database. The experimental results [28, 29] showed that classification accuracies of class-dependent feature selection were better compared to those of class-independent feature selection. In this paper, we will demonstrate a wrapper approach to selecting class-dependent features for biomedical data using the support vector machine (SVM) [31, 32] as the classifier.

This paper is organized as follows. In Sect. 2, we review the class separability measure (CSM), and introduce our approach to class-dependent feature selection. In Sect. 3, we provide experiment results of our method on four biomedical data sets from the UCI machine learning repository databases [27], and compare the results with those of class-independent feature selection. In the end, we present conclusions about the present work.

2 Methodology

2.1 The Class Separability Measure

Class separability measure (CSM) has been used by many researchers with different versions. The class separability proposed by Oh et al. [28, 29] is represented by $S(c_{i},c_{j},\varvec{x})$, where $c_{i}$ and $c_{j}$ represent class i and class j of the data set, respectively, and $\varvec{x}$ is a training sample. Each feature’s class separability is calculated individually, e.g., $S(c_{i},c_{j},x_{p})$ for feature p, and features are ranked according to their class separation values. Fu and Wang [11] defined another class separability measure to rank each feature’s classification capability. This CSM includes two distance elements: the within-class distance (distance between patterns within each class) and the between-class distance (the distance between patterns among different classes), which are described in Eqs. 1 and 2. According to [11], we will adopt the ratio of the within-class distance to the between-class distance to measure each feature’s classification capability. For the whole training data, the within-class distance $S_{w}$ [11] is calculated as:

$$\begin{aligned} S_{w}=\sum _{c=1}^{C}P_{c}\sum _{j=1}^{n_{c}}(\varvec{x}_{cj}-\varvec{m}_{c})(\varvec{x}_{cj}-\varvec{m}_{c})^{T} \end{aligned}$$

(1)

and the between-class distance $S_{b}$ [11] is calculated as:

$$\begin{aligned} S_{b}=\sum _{c=1}^{C}P_{c}(\varvec{m}_{c}-\varvec{m})(\varvec{m}_{c}-\varvec{m})^{T} \end{aligned}$$

(2)

Here C denotes the number of classes and $P_{c}$ denotes the probability of class c. $n_{c}$ refers to the number of samples in class c and $\varvec{x}_{cj}$ refers to sample j in class c. $\varvec{m}_{c}$ refers to the mean vector of class c and $\varvec{m}$ refers to the mean vector of all the training samples. As mentioned above, the smaller the ratio $S_{w}/S_{b}$, the better the separability. When evaluating one feature’s classification capability, we calculate the ratio ($S_{w}/S_{b}$) with the current feature removed, i.e., denoted as $S_{w}^{'}/S_{b}^{'}$. The greater $S_{w}^{'}/S_{b}^{'}$, the more important the removed feature is. Hence, we may evaluate the importance level of the features according to the ratio [11] with an attribute deleted each time in turn.

2.2 Our Approach to Class-Dependent Feature Selection

We describe our class-dependent feature selection method in three steps. In step one, we convert a C-class classification problem to C 2-class classification problems. Each problem only has two classes: the current class and the other one including all the other classes. In step two, for each 2-class problem, we adopt the ranking measure CSM to evaluate the importance of each feature. In step three, based on each class-dependent feature importance ranking list, we form different feature subsets for each class by sequentially adding one feature into the previous subset. Each feature subset is evaluated through an SVM and the feature subset corresponding to the highest classification accuracy will be our choice for this class.

During the process, a feature mask is introduced to describe features’ states, i.e., kept or removed. The feature mask is a vector, each element of which has only two values ’0’ and ’1’, in which ’0’ represents the absence of a particular feature and ’1’ represents the presence of a feature. For example, considering a data set with 5 features $\{x_{1}, x_{2}, x_{3}, x_{4}, x_{5}\}$, if the optimal feature subset obtained is $\{x_{1}, x_{3}, x_{5}\}$, the corresponding feature mask should be the vector {1, 0, 1, 0, 1}.

2.3 SVM with Class-Dependent Features

We first build several SVM models and then combine them together for accommodating class-dependent feature subsets. Each model is a binary classifier and is specific for one class. In the following, we will introduce the construction process.

1.
The training process: In this process, we construct C SVM models by training patterns, i.e., each class has its own model according to its specific feature subset. For example, the model i is trained with all the training examples in class i having positive labels and all the examples in other classes having negative labels. Specifically, all the training examples need to be filtered by a feature mask of class i before they are input for training. For instance, if the feature mask of class i has $n^{(i)}$ ‘1’, all the training examples to form class i will have $n^{(i)}$ features as the input and those features corresponding to ‘0’ in the feature mask are removed. The output can be either ‘+1’ or ‘−1’. If the input pattern $\varvec{x}_{j}$ belongs to class i, we consider it as a positive sample (‘+1’). Or we consider it as a negative sample (‘−1’). The ith SVM model solves the following problem [16]:
$$\begin{aligned} \begin{array}{l} \displaystyle \text {min}_{\omega ^{i},b^{i},\xi ^{i}} \frac{1}{2} \varvec{\omega ^{i}}^{T}\varvec{\omega ^{i}}+\varsigma ^{i}\sum _{j=1}^{l}\xi _{j}^{i}\\ \displaystyle \varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i} \ge 1- \xi ^{i}_{j}, \text { if } y_{j}= i, \\ \displaystyle \varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i} \le -1 \xi ^{i}_{j}, \text { if } y_{j}\ne i,\\ \displaystyle \xi ^{i}_{j}\ge 0,\quad j=1,\ldots ,l \end{array} \end{aligned}$$
(3)
where $\phi $ is the mapping function. $\varsigma ^{i}$ is the penalty parameter for class i, and $\xi _{j}^{i}$ are “slack variables” for class i. $\varvec{x}_{j}$ corresponds to sample j in l samples. Minimizing $\frac{1}{2}(\omega ^{i})^{T}\omega ^{i}$ means maximizing the margin between two groups of data. $\varsigma ^{i}\sum ^{l}_{j=1}\xi ^{i}_{j}$ is a penalty term used to reduce the number of training errors in case of nonlinear separable data.
2.
The testing process: After the class-dependent models are constructed, we will use them to test unlabeled patterns. Same as the training process, each testing pattern is filtered with one class’s feature mask before input into the corresponding SVM model, i.e., the original attributes corresponding to ‘0’ in the feature mask are removed. Among the C outputs, the testing pattern $\varvec{x}_{j}$, belongs to the class with the largest output value:
$$\begin{aligned} \text {Class of } \varvec{x}_{j}\equiv \text {argmax}_{i=1,2,\ldots ,C}(\varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i}) \end{aligned}$$
(4)

3 Experiments and Discussions

In order to demonstrate whether class-dependent feature selection is more efficient than class-independent feature selection, we conduct the experiment on two biomedical data sets from the UCI machine learning repository databases [27]. Two terms used for comparison between the two methods are the number of features deleted and the classification accuracy.

3.1 Experimental Data

The first data set is the Ecoli data set. It has 7 attributes (localization sites of the protein) and 8 classes. The number of instances is 336. The second data set is the processed Cleveland data set. It mainly concerns heart disease diagnosis and is collected from the Cleveland Clinic Foundation. There are originally 303 samples, 13 features and 5 classes. Because there are 6 samples with unknown feature values, we remove the 6 samples from our experiment.

3.2 Experiment and Results

From various kinds of SVM software packages, LIBSVM 3.1 [3] with the RBF kernel was chosen in our experiment. 10-fold corss validation method is used to calculate the accuracy. In Table 1, the results for the Ecoli data set show us that Ecoli data set [15] has very different numbers of features deleted for different classes with class-dependent feature selection. The result on the Cleveland Heart Disease data (Table 2) [9] also show that different classes have very different feature subsets. Class 1 has few features removed, i.e., on average 1.9 (Table 3). While for class 2, 3, 4 and 5, the number of features deleted in the 10 simulations are on average within the range of [9, 12].

Table 1 Feature selection results for the Ecoli data set

Full size table

Table 2 Feature selection results for the Cleveland heart disease data set

Full size table

Table 3 Classification accuracy comparisons among different feature selection methods for the two biomedical data sets

Full size table

In Table 3, we present classification accuracies for 3 different conditions, i.e., without feature selection and with class-dependent and class-independent feature selection. The obvious improvement on the classification accuracy is for the Cleveland data set. Compared with the accuracy on the data without feature selection, our method has the accuracy increased by about 3 %. Compared with that of the class-independent method, our method has increased the accuracy from 56.23 to 58.61 %.

4 Conclusions

In this paper, we demonstrated an approach to class-dependent feature selection. We adopted class separability measure [11] to evaluate feature importance, based on which an optimal feature subset was determined for each class through the SVM. The experimental results for two biomedical data sets [27] show that each class has a different feature subset which includes representative features for classifying the current class from the other classes, and the corresponding classifier can improve or at least maintain the classification accuracy using those class-dependent feature subsets.

References

Baggenstoss, P.M.: Class-specific features in classification. IEEE Trans. Signal Process. pp. 3428–3432 (1999)
Google Scholar
Caurana, R.A., Freitag, D.: Greedy attribute selection. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann Publishers, NEW Brunswick, NJ (1994)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
Chu, F., Wang, L.P.: Gene expression data analysis using support vector machines. In: Proceedings of the International Joint Conference on Neural Networks 2003, vols. 1–4, pp. 2268–2271 (2003)
Google Scholar
Chu, F., Xie, W., Wang, L.P.: Gene selection and cancer classification using a fuzzy neural network. In: Proceedings of the North-American Fuzzy Information Processing Conference (NAFIPS 2004), vol. 2, pp. 555–559 (2004)
Google Scholar
Chu, F., Wang, L.P.: Applications of support vector machines to cancer classification with microarray data. Int. J. Neural Syst. 15(6), 475–484 (2005)
Article MathSciNet Google Scholar
Crawford, M.M., Kumar, S., Ricard, M.R., Gibeaut, J.C., Neuenschwander, A.: Fusion of airborne polarimetric and interferometric SAR for classification of coastal environments. IEEE Trans. Geosci. Remote Sens. 37, 1306–1315 (1999)
Article Google Scholar
Desai, M., Shazeer, D.J.: Acoustic transient analysis using wavelet decomposition. In: IEEE Conference on Neural Networks for Ocean Engineering, pp.29–40 (1991)
Google Scholar
Detrano, R.: The Cleveland Heart Disease Data Set. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation (1988)
Google Scholar
La Foresta, F., Morabito, F.C., Azzerboni, B., Ipsale, M.: PCA and ICA for the extraction of EEG components in cerebral death assessment. In: IJCNN 05. Proceedings of 2005 IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2532–2537 (2005)
Google Scholar
Fu, X.J., Wang, L.P.: Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance. IEEE Trans. Syst. Man Cybern. B Cybern. 33(3), 399–400 (2003)
Article Google Scholar
Fu, X.J., Wang, L.P.: A GA-based novel RBF classifier with class-dependent features. In: Proceedings of 2002 Congress on Evolutionary Computation, no. 2, pp. 1890–1894 (2002)
Google Scholar
Fu, X.J., Wang, L.P.: Rule extraction from an RBF classifier based on class-dependent features. In: CEC2002: Proceedings of the 2002 Congress on Evolutionary Computation, vols. 1 and 2, pp. 1916–1921 (2002)
Google Scholar
Fu, X.J., Wang, L.P.: A rule extraction system with class-dependent features. In: Ghosh, A., Jain, L.C. (eds.) Evolutionary Computing in Data Mining, pp. 79–99. Springer, Berlin (2005)
Google Scholar
Horton, P., Nakai, K.: A probablistic classification system for predicting the cellular localization sites of proteins. In: Intelligent Systems in Molecular Biology, pp.109–115 (1996)
Google Scholar
Hsu, C.-W., Lin, C.-J.: A Comparison of methods for multi-class support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
Article Google Scholar
Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. National Taiwan University, Department of Computer Science and Information Engineering, Taipei, Taiwan (2003)
Google Scholar
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 367–370. AAAI Press, Portland (1994)
Google Scholar
Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of 10th National Conference on Artificial Intelligence, pp. 129–134. AAAI Press/MIT press, Park, CA (1992)
Google Scholar
Koller, D., Sahami, M.: Toward Optimal Feature Selection. In: Proceedings of the 13th International Conference on Machine Learning (ML), pp. 284–292, Bari, Italy (1996)
Google Scholar
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Proceeding of the European Conference on Machine Learning (ECML94), pp. 171–182. Springer-Verlag, Berlin, Heidelberg (1994)
Google Scholar
Liu, B., Wan, C.R., Wang, L.P.: An efficient semi-unsupervised gene selection method via spectral biclustering. IEEE Trans. Nano Biosci. 5(2), 110–114 (2006)
Article Google Scholar
Marchiori, E.: Class dependent feature weighting and K-nearest Neighbor classification. Patt. Recogn. Bioinform. LNCS 7986, 69–78 (2013)
Article Google Scholar
Mohammadi, M., Raahemi, B., Akbari, A., Nassersharif, B.: New class-dependent feature transformation for intrusion detection systems. Secur. Commun. Netw. 5, 1296–1311 (2012)
Article Google Scholar
Morabito, C.F.: Independent component analysis and feature extraction techniques for NDT data. Mater. Eval. 58(1), 85–92 (2000)
Google Scholar
Musselman, M., Djurdjanovic, D.: Time-frequency distributions in the classification of epilepsy from EEG signals. Expert Syst. Appl. 39, 11413–11422 (2012)
Article Google Scholar
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Oh, I.S., Lee, J.S., Suen, C.Y.: Using class separation for feature analysis and combination of class-dependent features. In: Fourteenth International Conference on Pattern Recognition, no.1, pp. 453–455 (1998)
Google Scholar
Oh, I.S., Lee, J.S., Suen, C.Y.: Analysis of class separation and combination of class-dependent features for handwriting recognition. IEEE Trans. Patt. Anal. Mach. Intell. no.21, pp. 1089–1094 (1999)
Google Scholar
Tian, J., Li, M., Chen, F., Feng, N.: Learning subspace-based rbfnn using coevolutionary algorithm for complex classification tasks. IEEE Trans. Neural Netw. Learn. Sys. (2015)
Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Wang, L.P. (ed.): Support Vector Machines: Theory and Applications. Springer, New York (2005)
MATH Google Scholar
Wang, L.P., Chu, F., Xie, W.: Accurate cancer classification using expressions of very few genes. IEEE/ACM Trans. Bioinf. Comput. Biol. 4(1), 40–53 (2007)
Article MathSciNet Google Scholar
Wang, L.P., Zhou, N., Chu, F.: A general wrapper approach to selection of class-dependent features. IEEE Trans. Neural Netw. 19(7), 1267–1278 (2008)
Article Google Scholar
Wang, L.P., Fu, X.J.: Data Mining with Computational Intelligence. Springer, Berlin (2005)
MATH Google Scholar
Zhou, N., Wang, L.P.: Effective selection of informative SNPs and classification on the HapMap genotype data. BMC Bioinf. 8, 484 (2007)
Article Google Scholar
Zhou, N., Wang, L.P.: Class-dependent feature selection for face recognition. In: Advances in Neuro-Information Processing, Part II, vol. 5507, pp. 551–558 (2009). Proceedings of 15th International Conference on Neural Information Processing, ICONIP 2008, Auckland, New Zealand, 2008
Google Scholar
Zhou, W.G., Dickson, J.: A novel class dependent feature selection method for cancer biomarker discovery. Comput. Biol. Med. 47, 66–75 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Infocomm Research, 21-01 Connexis (South Tower), 1 Fusionopolis Way, 138632, Singapore
Nina Zhou
School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, 50 Nanyang Avenue, Central Area, 639798, Singapore
Lipo Wang

Authors

Nina Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Lipo Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lipo Wang .

Editor information

Editors and Affiliations

Computer Science Department, University of Milano, Milano, Italy
Simone Bassis
Department of Psychology, Seconda Università di Napoli and IIASS, Caserta, Italy
Anna Esposito
Dept. of Info., Mathematics, Ele & Trans, Univ. Mediterranea of Reggio Calabria, Reggio Calabria, Italy
Francesco Carlo Morabito
Dip. Elettronica e Telecomunicazioni, Politecnico di Torino, Torino, Italy
Eros Pasero

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, N., Wang, L. (2016). Processing Bio-medical Data with Class-Dependent Feature Selection. In: Bassis, S., Esposito, A., Morabito, F., Pasero, E. (eds) Advances in Neural Networks. WIRN 2015. Smart Innovation, Systems and Technologies, vol 54. Springer, Cham. https://doi.org/10.1007/978-3-319-33747-0_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-33747-0_30
Published: 19 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33746-3
Online ISBN: 978-3-319-33747-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Methodology

2.1 The Class Separability Measure

2.2 Our Approach to Class-Dependent Feature Selection

2.3 SVM with Class-Dependent Features

3 Experiments and Discussions

3.1 Experimental Data

3.2 Experiment and Results

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation