1 Introduction

When dealing with biomedical data, the data dimensionality, i.e., the number of input features, can be quite large. To reduce computational burden and noise, it is often desirable to reduce the data dimensionality. An effective approach is feature extraction, for example, principal component analysis (PCA) [10, 25] and singular vector decomposition [22]. The resultant features via feature extraction are obtained by certain tranforms from the original features and are therefore different from the original features. In this paper, we focus on another approach, i.e., feature selection, which chooses a subset of input features from the entire input feature set.

Based on the different measures used to find the best feature subset, feature selection methods are divided into the following two categories: filter approaches [19] and wrapper approaches [18]. The classical RELIEF algorithm [19] and its extended version RELIEFF [21] are examples of filter approaches. They assign a weight to each feature and then update the weight according to training instances. These weights represent relevance of features, therefore, all features are ranked according to their weights and those features with weights above a predefined threshold are selected. Wrapper approaches “wrap” feature selection around a classifier. Most wrapper approaches also utilize heuristic search techniques, such as sequential forward and backward search [35], hill climbing [2], and best-first search [18], to first search for possible feature subsets, then evaluate those feature subsets through the classifier, and finally determine an optimal subset in terms of classification accuracy. In this paper, we adopt a wrapper approach to select features for better accuracy.

Considering the possibility that different groups of features may have different abilities in distinguishing different classes [1, 7, 8, 13, 14, 23, 24, 26, 2830, 34, 37, 38], we will choose different feature subsets for different classes, which is called class-dependent feature selection [29], as opposed to the usual class-independent feature selection [46, 11, 19, 36], which is to select a common feature subset for all the classes in a given classification problem. We note that class-independent feature selection is in fact a special case of class-dependent feature selection, that is, if all feature subsets in class-dependent feature selection for all classes happen to be the same, one obtains class-independent feature selection. The filter and wrapper approaches mentioned in [19, 21] belong to class-independent feature selection. For class-dependent feature selection, Baggenstoss [1] provided related theoretical analysis and utilized it on some artificial data sets. Oh et al. [28, 29] proposed a filter approach to selecting class-dependent feature subsets for the CENPARMI handwritten numerical database. The experimental results [28, 29] showed that classification accuracies of class-dependent feature selection were better compared to those of class-independent feature selection. In this paper, we will demonstrate a wrapper approach to selecting class-dependent features for biomedical data using the support vector machine (SVM) [31, 32] as the classifier.

This paper is organized as follows. In Sect. 2, we review the class separability measure (CSM), and introduce our approach to class-dependent feature selection. In Sect. 3, we provide experiment results of our method on four biomedical data sets from the UCI machine learning repository databases [27], and compare the results with those of class-independent feature selection. In the end, we present conclusions about the present work.

2 Methodology

2.1 The Class Separability Measure

Class separability measure (CSM) has been used by many researchers with different versions. The class separability proposed by Oh et al. [28, 29] is represented by \(S(c_{i},c_{j},\varvec{x})\), where \(c_{i}\) and \(c_{j}\) represent class i and class j of the data set, respectively, and \(\varvec{x}\) is a training sample. Each feature’s class separability is calculated individually, e.g., \(S(c_{i},c_{j},x_{p})\) for feature p, and features are ranked according to their class separation values. Fu and Wang [11] defined another class separability measure to rank each feature’s classification capability. This CSM includes two distance elements: the within-class distance (distance between patterns within each class) and the between-class distance (the distance between patterns among different classes), which are described in Eqs. 1 and 2. According to [11], we will adopt the ratio of the within-class distance to the between-class distance to measure each feature’s classification capability. For the whole training data, the within-class distance \(S_{w}\) [11] is calculated as:

$$\begin{aligned} S_{w}=\sum _{c=1}^{C}P_{c}\sum _{j=1}^{n_{c}}(\varvec{x}_{cj}-\varvec{m}_{c})(\varvec{x}_{cj}-\varvec{m}_{c})^{T} \end{aligned}$$
(1)

and the between-class distance \(S_{b}\) [11] is calculated as:

$$\begin{aligned} S_{b}=\sum _{c=1}^{C}P_{c}(\varvec{m}_{c}-\varvec{m})(\varvec{m}_{c}-\varvec{m})^{T} \end{aligned}$$
(2)

Here C denotes the number of classes and \(P_{c}\) denotes the probability of class c. \(n_{c}\) refers to the number of samples in class c and \(\varvec{x}_{cj}\) refers to sample j in class c. \(\varvec{m}_{c}\) refers to the mean vector of class c and \(\varvec{m}\) refers to the mean vector of all the training samples. As mentioned above, the smaller the ratio \(S_{w}/S_{b}\), the better the separability. When evaluating one feature’s classification capability, we calculate the ratio (\(S_{w}/S_{b}\)) with the current feature removed, i.e., denoted as \(S_{w}^{'}/S_{b}^{'}\). The greater \(S_{w}^{'}/S_{b}^{'}\), the more important the removed feature is. Hence, we may evaluate the importance level of the features according to the ratio [11] with an attribute deleted each time in turn.

2.2 Our Approach to Class-Dependent Feature Selection

We describe our class-dependent feature selection method in three steps. In step one, we convert a C-class classification problem to C 2-class classification problems. Each problem only has two classes: the current class and the other one including all the other classes. In step two, for each 2-class problem, we adopt the ranking measure CSM to evaluate the importance of each feature. In step three, based on each class-dependent feature importance ranking list, we form different feature subsets for each class by sequentially adding one feature into the previous subset. Each feature subset is evaluated through an SVM and the feature subset corresponding to the highest classification accuracy will be our choice for this class.

During the process, a feature mask is introduced to describe features’ states, i.e., kept or removed. The feature mask is a vector, each element of which has only two values ’0’ and ’1’, in which ’0’ represents the absence of a particular feature and ’1’ represents the presence of a feature. For example, considering a data set with 5 features \(\{x_{1}, x_{2}, x_{3}, x_{4}, x_{5}\}\), if the optimal feature subset obtained is \(\{x_{1}, x_{3}, x_{5}\}\), the corresponding feature mask should be the vector {1, 0, 1, 0, 1}.

2.3 SVM with Class-Dependent Features

We first build several SVM models and then combine them together for accommodating class-dependent feature subsets. Each model is a binary classifier and is specific for one class. In the following, we will introduce the construction process.

  1. 1.

    The training process: In this process, we construct C SVM models by training patterns, i.e., each class has its own model according to its specific feature subset. For example, the model i is trained with all the training examples in class i having positive labels and all the examples in other classes having negative labels. Specifically, all the training examples need to be filtered by a feature mask of class i before they are input for training. For instance, if the feature mask of class i has \(n^{(i)}\) ‘1’, all the training examples to form class i will have \(n^{(i)}\) features as the input and those features corresponding to ‘0’ in the feature mask are removed. The output can be either ‘+1’ or ‘−1’. If the input pattern \(\varvec{x}_{j}\) belongs to class i, we consider it as a positive sample (‘+1’). Or we consider it as a negative sample (‘−1’). The ith SVM model solves the following problem [16]:

    $$\begin{aligned} \begin{array}{l} \displaystyle \text {min}_{\omega ^{i},b^{i},\xi ^{i}} \frac{1}{2} \varvec{\omega ^{i}}^{T}\varvec{\omega ^{i}}+\varsigma ^{i}\sum _{j=1}^{l}\xi _{j}^{i}\\ \displaystyle \varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i} \ge 1- \xi ^{i}_{j}, \text { if } y_{j}= i, \\ \displaystyle \varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i} \le -1 \xi ^{i}_{j}, \text { if } y_{j}\ne i,\\ \displaystyle \xi ^{i}_{j}\ge 0,\quad j=1,\ldots ,l \end{array} \end{aligned}$$
    (3)

    where \(\phi \) is the mapping function. \(\varsigma ^{i}\) is the penalty parameter for class i, and \(\xi _{j}^{i}\) are “slack variables” for class i. \(\varvec{x}_{j}\) corresponds to sample j in l samples. Minimizing \(\frac{1}{2}(\omega ^{i})^{T}\omega ^{i}\) means maximizing the margin between two groups of data. \(\varsigma ^{i}\sum ^{l}_{j=1}\xi ^{i}_{j}\) is a penalty term used to reduce the number of training errors in case of nonlinear separable data.

  2. 2.

    The testing process: After the class-dependent models are constructed, we will use them to test unlabeled patterns. Same as the training process, each testing pattern is filtered with one class’s feature mask before input into the corresponding SVM model, i.e., the original attributes corresponding to ‘0’ in the feature mask are removed. Among the C outputs, the testing pattern \(\varvec{x}_{j}\), belongs to the class with the largest output value:

    $$\begin{aligned} \text {Class of } \varvec{x}_{j}\equiv \text {argmax}_{i=1,2,\ldots ,C}(\varvec{\omega ^{i}}^{T}\phi (\varvec{x}_{j})+ b^{i}) \end{aligned}$$
    (4)

3 Experiments and Discussions

In order to demonstrate whether class-dependent feature selection is more efficient than class-independent feature selection, we conduct the experiment on two biomedical data sets from the UCI machine learning repository databases [27]. Two terms used for comparison between the two methods are the number of features deleted and the classification accuracy.

3.1 Experimental Data

The first data set is the Ecoli data set. It has 7 attributes (localization sites of the protein) and 8 classes. The number of instances is 336. The second data set is the processed Cleveland data set. It mainly concerns heart disease diagnosis and is collected from the Cleveland Clinic Foundation. There are originally 303 samples, 13 features and 5 classes. Because there are 6 samples with unknown feature values, we remove the 6 samples from our experiment.

3.2 Experiment and Results

From various kinds of SVM software packages, LIBSVM 3.1 [3] with the RBF kernel was chosen in our experiment. 10-fold corss validation method is used to calculate the accuracy. In Table 1, the results for the Ecoli data set show us that Ecoli data set [15] has very different numbers of features deleted for different classes with class-dependent feature selection. The result on the Cleveland Heart Disease data (Table 2) [9] also show that different classes have very different feature subsets. Class 1 has few features removed, i.e., on average 1.9 (Table 3). While for class 2, 3, 4 and 5, the number of features deleted in the 10 simulations are on average within the range of [9, 12].

Table 1 Feature selection results for the Ecoli data set
Table 2 Feature selection results for the Cleveland heart disease data set
Table 3 Classification accuracy comparisons among different feature selection methods for the two biomedical data sets

In Table 3, we present classification accuracies for 3 different conditions, i.e., without feature selection and with class-dependent and class-independent feature selection. The obvious improvement on the classification accuracy is for the Cleveland data set. Compared with the accuracy on the data without feature selection, our method has the accuracy increased by about 3 %. Compared with that of the class-independent method, our method has increased the accuracy from 56.23 to 58.61 %.

4 Conclusions

In this paper, we demonstrated an approach to class-dependent feature selection. We adopted class separability measure [11] to evaluate feature importance, based on which an optimal feature subset was determined for each class through the SVM. The experimental results for two biomedical data sets [27] show that each class has a different feature subset which includes representative features for classifying the current class from the other classes, and the corresponding classifier can improve or at least maintain the classification accuracy using those class-dependent feature subsets.