1 Introduction

A feature is an individual measurable property of a phenomenon being observed. The representation of raw input data uses many features, only some of which are relevant to the class. Feature selection for supervised classification can be accomplished on the basis of entropy information between features and classes. We use Shannon Entropy [10], Renyi’s and Tsallis Entropy [6] to calculate mutual information [3] between feature and class or feature-feature in this work. It is found that mRMR [8] is a practical and superior algorithm for feature selection and classification, however it does not perform well if lesser number of attributes present in datasets [8]. The main motivation behind our work is to develop an enhanced feature selection algorithm that performs consistently well in all kinds of datasets. An ensemble method for entropy-based feature selection is developed and evaluated using common machine learning algorithms on a variety of UCI gene expression datasets. We carry out comparative study among existing entropy-based feature selection methods. Our method eliminates irrelevant and redundant data and in majority cases it improves the performance of learning algorithms.

2 Related Work

In the past two decades, a good number of MI-based feature selection algorithms have been introduced. Two main important aspects of feature selection are: (i) minimum redundancy in terms of number of features and (ii) maximum relevance of a feature with a given class label. Some well-known MI-based feature selection algorithms are: Information Gain [1], Gain Ratio [3], mRMR [8] and its variant [9]. InfoGain and GainRatio select features based on relevancy only, however other mentioned MI-based feature selection algorithms select most relevant and least redundant features. From our study, we observe that mRMR is appropriate for large number of applications having large numbers of features [8]. It performs well on both continuous and discrete data.

To achieve minimum redundancy - maximum relevance for categorical variables [5], most researchers consider that if the feature values are uniformly distributed in different classes, its mutual information with these classes is zero. If a feature is highly differentially expressed for different classes, it should have large mutual information. Thus, mutual information can be considered as a measure to estimate relevance of features.

The mRMR algorithm aims to select a feature set S, which shows maximum relevance to a given class (features provide maximum information about the class) and are of less redundant. mRMR considers the mutual information of each feature against the classes, but also subtracts the redundancy of each feature with the already selected ones. mRMR follows filter criterion based on mutual information estimation. Instead of estimating the mutual information between a whole set of features and the class labels, the authors estimate it for each one of the selected features separately. On one hand, they maximize the relevance \(I(x_j; C)\) of each already selected individual feature and on the other hand they minimize the redundancy between \(x_j\) and the rest of selected features. This criterion can be expressed for selection of \(m^{th}\) feature is:

$$\begin{aligned} max_{x_j \in X-{S_{m-1}}}[I(x_j;C)- \frac{1}{m-1} \sum _{{x_i} \in S_{m-1}}I{(x_j;x_i)}]. \end{aligned}$$
(1)

This criterion can be used by a greedy algorithm, which in each iteration takes a single feature and decides whether to add it to the selected feature set, or to discard it, and this process is repeated till required set S of K optimal features is obtained. This implies that the \(m^{th}\) feature \(x_m\) will be selected only when a set of \((m-1)\) features i.e., \(S_{m-1}\) exits. We refer to the original mRMR method as \(mRMR_{MI}\). Another variant of the mRMR criterion [9] also exists (referred here as \(mRMR_{GR}\)). In [9], it is reformulated using a different representation of redundancy. The authors propose to use a coefficient of uncertainty which consists of dividing the MI value between two features \(x_j\) and \(x_i\) by the entropy of \(H(x_i)\), where \(x_i \in S_{m-1}\). The equation is as given below.

$$\begin{aligned} max_{x_j \in X- {S_{m-1}}}[I(x_j;C)- \frac{1}{m-1} \sum _{{x_i} \in S_{m-1}}\frac{I(x_j;x_i)}{H(x_i)}] \end{aligned}$$
(2)

In this study, we use these two variants of mRMR algorithms in our experiments, analyze their pros and cons, and introduce another variant of mRMR which is an effective combination of above two variants.

3 \(mRMR+:\) The Proposed Ensemble mRMR Algorithms

We carried out an exhaustive experimental study using the two variants of mRMR on benchmark datasets. Our observation is that the MI-based mRMR (i.e. \(mRMR_{MI}\)) does not perform well if the number of attributes in the dataset are less [8]. However, the second variant of mRMR can eliminate this disadvantage of \(mRMR_{MI}\). But, we have found from our exhaustive experimentation that most of the time \(mRMR_{GR}\) also performs poorly if the number of attributes is higher in the datasets.

Our proposal i.e., \(mRMR+\) is an effective combination of the above two mRMR variants through a weight function. We performed an exhaustive experimentation to determine the proper weight function dynamically. We experimentally found that in most of the cases, if MI value between two variables is more than GR (gain ratio) than \(mRMR_{MI}\) does not perform well. To eliminate this problem, we combine these two variants of mRMR (Eqs. 1 and 2) in such a way that the combination function performs consistently well for any number of variables. Our method performs well in almost all datasets than above discussed variants of mRMR. We perform a comparative analysis among all the three variants of mRMR using aforesaid three entropy measures. The proposed formulation of \(mRMR+\) for the selection of the \(m^{th}\) feature is as follows:

$$\begin{aligned} max_{x_j \in X-{S_{m-1}}}[I(x_j;C)-(\frac{l}{m-1}\sum _{{x_i}\in S_{m-1}}I{(x_j;x_i)}+\frac{1-l}{m-1} \sum _{{x_i} \in S_{m-1}}\frac{I(x_j;x_i)}{H(x_i)})]. \end{aligned}$$
(3)

Our method takes gene expression dataset as input and apply a discretization in preprocess step to eliminate noises from data. The value of weight function l is computed before finding out the top relevant feature based on MI value between feature and class. After that, using Eq. 3 we find out least redundant and maximum relevant feature from the remaining features and add one feature at a time to the selected feature list till requires K optimal features are selected.

Our proposed weight function take gene expression data as input and calculate \(m=Max(MI(x_{i},C)) \) and \(n=Max(GR(\frac{MI(x_i,C)}{H(C)}))\). If it is observed that \({m\ge n}\) then in our method weight (l) is calculated as \(l=1-\frac{n}{m}\) else weight (l) is calculated as \(l=\frac{m}{n}\). To select the \(m^{th}\) feature, the computational complexity of this incremental search is O(|S|.M) where M is the number of attributes in the dataset, which is similar to the MI-based mRMR algorithm [8].

Table 1. Dataset and accuracy (the 10-fold cross validation) classifiers

4 Experimental Results

To evaluate the usefulness of the different variants of mRMR algorithm and different entropy measures, five UCI machine learning datasets of gene expression profiles having classes \(\ge 2\) were chosen and presented in Table 1. The accuracy of 10-fold cross validation of classification methods for all features are reported in Table 1 using three different classification methods viz., Naive Bayes (NB) [7], Random Forest (RF) [4] and IBK [2]. Generally, RF performed better than other classification methods due to its suitability for high dimensional data. To discretize the datasets, we use same discretization technique as used by the two mentioned variants of mRMR. Due to the space constrains we are unable to present detail results. Figure 1(a) presents classification accuracy of NB classifier on lung cancer dataset in forward direction. The average classification accuracies of Shanon, Renyi’s and Tsallis entropy based MIs are 85.31%, 77.50%, 82.50%, respectively. Whereas, average classification accuracy of Shannon entropy based mRMR variants is 88.12% where our proposed method provides average 88.44% classification accuracy. Figure 1(b) reports results on colon tumor dataset based on NB classifier when we select top ranked features in forward direction. The average classification accuracies of Shanon, Renyi’s and Tsallis entropy based MIs are 86.94%, 86.61%, 86.61%, respectively. Shannon entropy based mRMR dominates other entropy based mRMR results. We found average classification accuracy of 88.87% in case of Shannon Entropy based mRMR variants where our proposed method provides average 89.52% classification accuracy. Table 2 reports average classification accuracy for the top ten selected features using NB classifier of different entropy based \(mRMR_{MI}\). We found that Shannon entropy based mRMR variants always dominate other entropy based mRMR variants. So, in remaining experimental results we only consider Shannon entropy based mRMR variants. Figure 1(c) reports classification accuracy of NB classifier on breast cancer dataset and the average classification accuracies of Shanon, Renyi’s and Tsallis entropy based MIs for this dataset are same (i.e., 95.29%). In case of Shannon Entropy based mRMR variants, we observe that average classification accuracy is 95.81% where our method provides average 95.85% classification accuracy. Figure 1(d) presents classification accuracy of NB classifier on breast cancer dataset and average classification accuracy of Shanon, Renyi’s and Tsallis entropy based MIs for this dataset are 90.37%, 90.467%, 89.90%, respectively. In case of Shannon entropy based mRMR, we found average classification accuracy is 90.97% among the three mRMR variants where our method provides average 91.03% classification accuracy. Figure 1(e), (f), (g) report results on NCI dataset using NB, IBK and RF classifier respectively. The average classification accuracy of Shanon, Renyi’s and Tsallis entropy based MIs for NCI dataset are 53.33%, 56.50%, 55.00%, respectively. \(mRMR+\) shows higher average classification accuracies for NB, IBK and RF classifier.

Fig. 1.
figure 1

(a) Accuracy of NB classifier on Lung Cancer. (b) Accuracy of NB classifier on Colon Tumor. (c) Accuracy of NB classifier on Breast Cancer. (d) Accuracy of NB classifier on Promoter. (e) Accuracy of NB classifier on NCI. (f) Accuracy of IBK classifier on NCI dataset. (g) Accuracy of RF classifier on NCI.

Table 2. Average classification accuracy of NB classifier in % for top 10 features

Finally, we present the effectiveness of our method in Table 2. Table 2 shows that Shannon entropy based \(mRMR_{MI}\) performs well in four out of five datasets than Renyi’s entropy based \(mRMR_{MI}\) and in all datasets than Tsallis entropy based \(mRMR_{MI}\). On the other hand, our method \(mRMR+\) based on Shannon entropy consistently performed well in all datasets in every aspect of our analysis.

5 Conclusions

Our method, referred to as \(mRMR+\), performs significantly well in comparison to its competing mRMR and its variant algorithm over five benchmark datasets. Our study also includes an exhaustive empirical study on three well known entropy measures, while selecting relevant and non-redundant features to achieve best possible classification accuracy.