Abstract
In machine learning, feature selection is a kind of important dimension reduction techniques, which aims to choose features with the best discriminant ability to avoid the issue of curse of dimensionality for subsequent processing. As a supervised feature selection method, Fisher score (FS) provides a feature evaluation criterion and has been widely used. However, FS ignores the association between features by assessing all features independently and loses the local information for fully connecting within-class samples. In order to solve these issues, this paper proposes a novel feature evaluation criterion based on FS, named iteratively local Fisher score (ILFS). Compared with FS, the new criterion pays more attention to the local structure of data by using K nearest neighbours instead of all samples when calculating the scatters of within-class and between-class. In order to consider the relationship between features, we calculate local Fisher scores of feature subsets instead of scores of single features, and iteratively select the current optimal feature to achieve this idea like sequential forward selection (SFS). Experimental results on UCI and TEP data sets show that the improved algorithm performs well in classification activities compared with some other state-of-the-art methods.
Similar content being viewed by others
1 Introduction
High-dimensional data would make traditional algorithms have a poor performance because of the curse of dimensionality [18]. In addition to increasing the computational burden of algorithms, high-dimensional data may cause the issue of over fitting for too many features. To reduce dimensionality, feature selection, one kind of efficient techniques, has been widely used, which aims to select useful features to form an optimal feature subset with a good discriminant ability for subsequent processing [6, 12]. Moreover, it is a key for all feature selection that how to measure the importance of features. Thus, feature selection methods require a feature evaluation criterion that would decide their performance.
As a filter method, Fisher score (FS) is independent of learning algorithms, and only uses a mathematical principle to give each feature a score. According to these scores, the best features can be selected to form the target feature subset. FS has been widely used. For face recognition, Chen et al. [7] combined FS with some traditional features such as log-likelihood and appearance-based features to form a new feature vector taking full advantage of both local and holistic features of human face. Dixit et al. [8] encoded the bag-of-semantics in the natural parameter space of the multinomial using a vector of Fisher scores to classify scenes for transfer tasks. In addition, Gu et al. [11] proposed a generalized Fisher score (GFS) that finds an optimal feature subset by maximizing the lower bound of traditional Fisher score. The optimization problem of GFS can be solved by using multivariate ridge regression and projected gradient descent. However, FS does not consider the possible relationship between features, such as combination or redundancy, instead of the influence of a single feature on the within-class and between-class scatters [3, 33]. Moreover, FS may lose the real scatters for computing scatters from the global view.
To solve the above issues in FS, this paper presents a novel feature selection method, called iteratively local Fisher score (ILFS). It is generally believed that the nearest samples are likely to have the same label, which means they have the greatest impact on classification. When evaluating a feature, it is beneficial to consider the local structure of the data set in some cases. The goal of ILFS is to make the local within-class scatter minimum and the local between-class scatter maximum. We use K nearest neighbors to represent the distribution and use only the distance of each group of K nearest neighbor samples in the calculation of divergence. Meanwhile, ILFS is to measure the importance of a feature subset instead of a single feature like FS. For the convenience of calculation, we use the idea of sequential forward selection (SFS). The feature selected in each iteration is the one that makes the current feature subset achieve the greatest local Fisher score. In doing so, ILFS takes into account the relationship between features. It can be seen from comparative experiments that ILFS has an excellent performance.
The rest of this paper is organized as follows. In Section 2, we give a brief overview of feature selection and then introduce the traditional FS and SFS in detail. Section 3 presents the novel feature evaluation criterion ILFS. Section 4 gives experiments on the UCI and TEP data sets. Finally, conclusions are drawn in Section 5.
2 Related works
In this section, we first simply introduce feature selection. Then, the related FS method is reviewed. Finally, we describe the scheme of sequential forward selection.
2.1 Feature selection
Feature selection is the process of removing redundant and unimportant features from the original feature set to obtain an optimal subset according to certain rules. As one of the methods of dimension reduction, it has attracted more and more attention in the fields like machine learning and pattern recognition. The advantages include deepening the understanding of features, reducing the amount of computation and improving the performance of classifier.
On the basis of whether the class label information exists or not, methods for feature selection can be divided into three categories: supervised [16], unsupervised [29] and semi-supervised [28]. According to the relationship with learners, methods for feature selection can be also divided into three categories: Filter, Wrapper and Embedded [30].
Filter methods are independent of learning algorithms. After all features are evaluated, the important features are combined into feature subsets for subsequent classification tasks. For example, Fisher-based [4] methods try to find features with the largest within-class distance and the smallest between-class distance at the same time. Relief-based methods [26] maximize the difference between the nearest neighbors with the same label and different labels of a given sample. In addition, filter methods also include max-relevance and min-redundancy (mRMR) [24], Chi-square test (Chi2) [20] and others. Filter feature selection methods are simple and fast, which can be easily extended to high-dimensional data. The evaluation results can be directly applied to various classification models. However, these methods may be weak in the subsequent processing owing to the lack of interaction with learning models.
Wrapper methods are related to learning algorithms. In the process of wrapper feature selection, a classifier needs to be trained to measure a feature subset according to the performance. The classifiers that can be used include decision tree (DT), K nearest neighbor classifier (KNN), Bayesian classifier (Bayes), support vector machine (SVM) and so on. Bugata et al. [5] proposed a weighted K nearest neighbours (WKNN) method for feature selection, which weights each feature and modifies the weight until the satisfactory result is obtained. The final feature subset obtained by wrapper methods is more suitable for the adopted learning model. However, when the learning method changes, it needs to go through feature selection again. Moreover, wrapper feature selection methods consume more time and are easier to over fit than the filter.
For embedded methods, the process of feature selection is embedded in the model training of a learning method and realized by optimizing an objective function. Compared with wrapper methods, embedded feature selection methods avoid the repeated training of the learning method when evaluating each feature subset, and can quickly get the best one. However, this kind of methods depends on the parameter adjustment to a great extent, so the performance is very sensitive to the parameters.
Some researchers improve the classical algorithms to achieve better results. For example, Zhou et al. [35] extended relevance and redundancy to conditional relevance and conditional redundancy and built a new frame integrating them to propose two methods: minimum conditional relevance-minimum conditional redundancy (MCRMCR) and minimum conditional relevance-minimum intra-class redundancy (MCRMICR). Zhang et al. [34] proposed the symmetrical complementary coefficient (SCC) to quantify features and interactions. On the basis of SCC, the ReliefF algorithm can be improved by screening out representative samples from the original data set. Researchers also tried to combine feature selection with other methods, such as rough set [21, 22], krill herd algorithm [1, 2], and Q-learning [23].
2.2 Fisher score
FS is one of the classical supervised feature selection methods, which seeks the optimal features that make the within-class scatter minimum and the between-class scatter maximum. Assume that there is a set of labeled samples \(\left \{(\mathbf {x}_{i}, y_{i}) \right \}_{i=1}^{n}\), where \(\mathbf {x}_{i}=[x_{i1}, x_{i2}, \cdots , x_{im}]^{T}\in \mathbb {R}^{m}\), yi ∈{1,2,⋯ ,c} is the label of xi to indicate its category, m is the dimension of samples, and c is the number of classes. FS first assigns a score to each feature of original data and then selects an optimal feature subsets according to the scores. The score of the k-th feature can be computed independently as
where ni is the number of the i-th class, \({\mu _{k}^{i}}\) and \({\sigma _{k}^{i}}\) are the mean and variance of the k-th feature in the i-th class, respectively, and μk is the mean of the k-th feature in all classes.
In (1), \({\sum }_{i=1}^{c} n_{i}({\mu _{k}^{i}}-\mu _{k})^{2}\) and \({\sum }_{i=1}^{c} n_{i}({\sigma _{k}^{i}})^{2}\) are the between-class scatter and the within-class scatter of the k-th feature, respectively. Thus, the score F(k) can represent the discriminant ability of the k-th feature. The greater the score F(k) is, the more discriminant ability the k-th feature has.
Since FS assesses all features independently, it ignores the association between features. Assume that there are two features with high scores, which is to say these two features have good discriminant ability separately. However, it is uncertain that they would improve the discriminant ability together. Similarly, it is also possible that two features with low scores may achieve a good classification performance when they are combined. In addition, FS computes both the between-class scatter and the within-class from the global view. It is possible that the resulted scatters could not reflect the real ones if data has a manifold structure.
2.3 Sequential forward selection
SFS is a heuristic search method that essentially adopts the idea of greed [13, 19, 31]. The target feature subset starts from an empty set. In each iteration, a feature is selected from candidates and added to the target feature subset to make the evaluation function optimal. The algorithm would stop until the termination condition is reached, say all features are in the target feature subset.
As a simple greedy algorithm, SFS has a fast search speed, but SFS can only add features during the search process, which is easy to fall into the local optimal solution.
3 Iteratively local fisher score
In this section, we propose a novel feature selection method, ILFS, which can address the shortcomings of FS mentioned above.
3.1 Local fisher score
It is unwise to construct a neighbourhood relationship from a global view if data have a structure of latent manifold. Here, we consider a local neighbourhood relationship that is constructed by using K nearest neighbours of a sample instead of all samples [14, 15]. The local neighbourhood relationship can be represented by two scatter matrixes S and D, where S is the local within-class scatter matrix, and D the local between-class scatter matrix. Two scatter matrixes are defined as:
and
where KNN(xi) is the set of K nearest neighbours of xi.
To consider the association between features, we define the local Fisher score of a feature subset. Without loss of generality, let A be the index set of features. Then, the local Fisher score of A is defined by
where \(\mathbf {x}_{i}^{A} \in \mathbb {R}^{|A|}\) is a sub-vector of xi with features in A and δ is a small positive constant that is introduced to avoid a zero denominator.
In (4), the numerator and the denominator denote the local scatters of between-class and within-class in a subspace A, respectively. Compared to (1), the local Fisher score focuses on the discriminant ability of a feature subset instead of a single feature. The higher the local Fisher score J(A) is, the more discriminant ability the feature subset A has. Moreover, J(A) contains the local information of original data. Thus, the selected feature subset can maintain the local structure of data well.
3.2 Iterative algorithm
Ignoring the relationship between features, FS calculates the score of all features independently. To remedy this defect, we measure a feature subset instead of a single feature and calculate its local Fisher score. To get an optimal feature subset, we need an iterative algorithm. In each iteration, we try to find an optimal feature from the remaining feature subset and add it to the previous optimal feature subset, where the generated new optimal feature subset has the highest local Fisher score.
Let G be the optimal feature subset in the previous iteration and its complementary set be \(\overline {G}=\{1,\cdots ,m\}-G\). In the current iteration, our mission is to pick up a feature from \(\overline {G}\) and form a new subset G. As long as \(\overline {G}\) is not empty, we can generate candidates and evaluate them. Let Ak = G ∪{k} be a candidate, where \(k\in \overline {G}\). In the current iteration, we need to calculate J(Ak) for \(\forall k\in \overline {G}\) and update the optimal feature subset G based on these scores. The update rule is
At the same time, we update \(\overline {G}=\{1,\cdots ,m\}-G\).
The feature subset G can also reflect feature ranking. The earlier the feature is selected, the more discriminant ability it has.
3.3 Algorithm analysis
First, we analyze the computational complexity of our algorithm. Assume that our task is to select r features from the original feature set. Then we would stop the iterative algorithm until |G| >= r. In other words, we need r iterations. The computational complexity in each iteration depends on not only the number of samples but also the number of selected and unselected features. Without loss of generality, let \(|G|=m^{\prime }\) in the current iteration. The computational complexity of calculating within-class and the between-class distances of one candidate feature subset is \(O(Kn(m^{\prime }+1))\), where K is the number of nearest neighbours, n is the number of samples, and \((m^{\prime }+1)\) is the size of the candidate subsets. We have a total of \((m-m^{\prime })\) candidate feature subsets, and need to select the one with the highest local Fisher score. The total time complexity is \(O(Kn(m^{\prime }+1)(m-m^{\prime }))\). In all r iterations, the upper bound of computational complexity is O(Knm2/4) when |G| = ⌊m/2⌋ or |G| = ⌈m/2⌉, where ⌊⋅⌋ and ⌈⋅⌉ are functions which round the element ⋅ to the nearest integers towards minus infinity and infinity, respectively. The highest complexity in iterations is related to the number of nearest neighbours, the number of samples, and square of the number of features.
Next, we consider reducing the computational complexity in each iteration. In the current iteration, assume that we have a candidate Ak = G ∪{k}, where \(k\in \overline {G}\). We can calculate the score of Ak by:
where \({s^{G}_{b}} = {\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\left \|\mathbf {x}^{G}_{i} - \mathbf {x}^{G}_{j}\right \|^{2} D_{ij}\) and \({s^{G}_{w}} = {\sum }_{i=1}^{n}\) \({\sum }_{j=1}^{n}\left \|\mathbf {x}^{G}_{i}-\mathbf {x}^{G}_{j}\right \|^{2} S_{ij}+\delta \) are the between-class and the within-class scatters in the previous iteration, respectively,
and
In (6), both \({s^{G}_{b}}\) and \({s^{G}_{w}}\) can be obtained according to the previous iteration. Therefore, we can get the score of the candidate Ak by only calculating \(s^{\{k\}}_{b}\) and \(s^{\{k\}}_{w}\) by (7) and (8).
Now, we analyze the computational complexity of the fast iterative algorithm. The computational complexity is O(Knm) when calculating within-class and the between-class distances of each feature before the iteration selection begins. In the current iteration, we only need a computational complexity of \(O(m-m^{\prime })\), where \(|G|=m^{\prime }\). The detail algorithm is described in Algorithm 1. The source code has been uploaded to GitHub (https://github.com/Min-Gan/ILFS).
We analyze the properties of our algorithm. Let G(t) be the optimal feature subset obtained in the t-th iteration. The following theorems describe the properties of ILFS.
Theorem 1
For the series {J(G(t))} generated by ILFS, the following inequalities
hold true.
Theorem 2
For the series {J(G(t))} generated by ILFS, the following inequalities
hold true, where \(k^{*}_{t}\) is the index of optimal feature selected in the t-th iteration.
The proof of Theorem 1 is given in Appendix A. Theorem 1 states that the series {J(G(t))} is monotonically decreasing. Theorem 2 can be directly derived from the proof of Theorem 1, so we ignore its proof. Theorem 2 describes the relationship between the score of optimal feature obtained in the current iteration and the score of optimal feature subset obtained in the previous iteration.
4 Experiments
This section is to test the ability of our method to select discriminant features and compare our method with other popular supervised algorithms. Experiments are implemented in JetBrains PyCharm 2019.2.2 or in MATLAB R2016a on a PC with an Inter Core I5 processor and 16GB RAM.
4.1 Experimental setting
Both KNN and SVM are used to classify data processed by feature selection algorithms. Compared methods include mRMR [24], Chi2 [20], WKNN [5], FS [12], GCMOB_FC [25] and our proposed method, which are described as follows:
- mRMR::
-
The mRMR algorithm uses mutual information to select features. The purpose of mRMR is to select a feature subset that not only has the greatest correlation with the label information, but also makes the correlation between features least.
- Chi2::
-
Chi2 is the idea of hypothesis testing commonly used in probability theory and mathematical statistics, which determines whether the hypothesis is true by observing the deviation between the actual value and the theoretical value to assign score for each feature.
- WKNN::
-
WKNN uses the distance and attribute weighted K-nearest neighbours and takes gradient descent as an iterative optimization algorithm for finding the function minima. In experiments, we choose the Euclidean distance as the distance calculation method, the stochastic gradient descent method as the optimization method, the cross entropy loss as the loss function, and the exponential function as the distance evaluation function.
- FS::
-
FS assumes that features with a strong discriminant ability should have a small within-class scatter and a large between-class scatter.
- GCMBO_FC::
-
GCMBO_FC is a feature selection method based on the monarch butterfly optimization and Fisher criterion. This algorithm combines these two components by binarizing continuous solution vectors to select features from original ones.
- ILFS::
-
ILFS is proposed in this paper. Let the constant δ = 10− 8 to avoid zero denominator.
For mRMR, WKNN, FS and ILFS, we evaluate each feature and select the feature with the highest score to join the target feature subset. For both Chi2 and GCMBO_FC, we generate feature subsets with specific size in turn for classification. We test the effect of these feature selection methods with KNN and SVM, respectively. The default value for the parameter K in KNN is 5 in the package of sklearn from Python. For SVM, the Gaussian radial basis function (RBF) kernel is used for all UCI datasets except for the KSD dataset with the linear kernel. The hyperparameters of SVM are set to be default according to the package of sklearn from Python.
4.2 Experiments on UCI data sets
4.2.1 Data description
To test the performance of compared algorithms, we conduct experiments on 11 UCI data sets [10], including Soybeansmall, Wine, Sonar, Heart, Zoo, Vehicle, Australian, MEU-Mobile KSD (KSD), Spambase, and two other data sets with more than 10,000 samples named Crowdsourced Mapping (Map) [17] and Online Shoppers Purchasing Intention (Shopper) [27]. The data information is summarized in Table 1, where data sets are sorted according to the number of samples.
Because these data sets do not provide training and test sets independently, we need to divide each of them into the training and test sets randomly according to the ratio of 1 : 1, except for large ones (Map and Shopper) by 4 : 1. We repeat the operation of data division 100 times.
4.2.2 Analysis of property and parameter
First, we show the properties of our method provided by Theorems 1 and 2 on the Spambase dataset. The curve of score vs. iteration is plotted in Fig. 1, where J(G) is the score of the optimal feature subset in the (t − 1)-th iteration and J({k∗}) is the score of the optimal feature selected in the t-th iteration. We can see that the curve of J(G) shows a downward trend that supports the conclusion in Theorem 1, and J(G) is always greater than J({k∗}) that validates the conclusion of Theorem 2.
Second, we discuss the influences of the parameter K on the accuracy of our algorithm by observing the experimental results of Vehicle and Wine datasets, as shown in Fig. 2, where the parameter K varies in the set {1,2,3,5,7,10,15,20,30,60}. Note that FS and iterative FS (IFS) are taken as baselines. IFS is a variant method of ILFS, which follows the iterative process of ILFS and calculates the score of the feature subset A according to the following formulation:
From Fig. 2, we can see that the idea of iteration really improves FS because IFS and ILFS are much better than FS. In addition, ILFS is indeed sensitive to the value of K. When K is very large, the local structure of original data would be ruined, which makes the accuracy of ILFS decrease and gradually approach to that of IFS only with the iterative thought. When the parameter K is very small, it is equivalent to using a training instance in a smaller field to perform score calculation, which is prone to overfitting. For different datasets, the optimal value for the parameter K is different. For example, the accuracy of ILFS is the best when K = 5 for the Vehicle dataset and when K = 15 for the Wine dataset. Thus, it is better that the range of K is from 5 and 20. In the following, the parameter K is selected from the set {5,10,15,20} through cross-validation.
4.2.3 Experimental results
We first rank all features using the compared algorithms and then record the average accuracy of feature subsets in ranking order by using KNN and SVM. Because it takes more than hours for WKNN to select features on large data sets, no comparison is made on both Map and Shopper here.
Using KNN, the curves of average accuracy obtained by KNN vs. features obtained by feature selection algorithms are shown in Fig. 3. All compared algorithms increase their discriminant ability as increasing the number of ranked features at the beginning and achieve a relatively better performance on most data sets, such as Soybeansmall, Wine, Sonar, Heart, Vehicle, Spambase and Map. On the Australian data set, Chi2 and FS, ILFS need only the top two features to achieve the their best performance. While other algorithms require more features. On the Zoo data set, Chi2, mRMR and ILFS require only the top one feature to obtain the best discriminant ability. According to Fig. 3, we list the best average classification accuracies and F1-score with the corresponding standard deviations obtained by KNN in Tables 2 and 3, where the highest values among six methods is in bold type. ILFS can effectively improve the classification accuracy on all data sets, which shows its good performance. Specially, ILFS gets the best accuracy 92.48% on the Wine data set, which is 19.12% higher in accuracy than FS. In addition, the accuracy rates are increased by 8.59% in Zoo, 8.51% in Vehicle, 4.35% in KSD, 4.0% in spambase and 3.95% in Heart.
The classification results obtained by SVM are listed in Tables 4 and 5, where ILFS also remains the best in eight out of 11 data sets. The accuracy rates are improved by 26.17% in Wine, 11.02% in Zoo, 6.36% in Vehicle, 6.28% in Spambase and 4.15% in Heart.
4.3 Experiments on the TEP data set
The Tennessee Eastman Process (TEP) data set is widely used as test control and troubleshooting models for complex industrial processes and came from the Tennessee Eastman simulation platform that is an open chemical model simulation platform developed by Eastman Chemical Company based on the actual chemical reaction process [9]. The TEP data set includes data collected under normal conditions and 21 different fault states, as shown in Table 6, each of which has 52 features. All features are normalized to the interval [0,1].
In the normal data set, there are 500 samples for training and 960 ones for test. For each fault state, there is a training set composed of 480 fault data and a test set composed of 960 observation data where the first 160 samples are normal ones. We take 500 training data from a normal data set and 480 ones from a fault data set to form the training set and perform fault detection on the test set for each fault. Here, we consider Faults 10, 11, 14, 16, 17, and 19, where Faults 14 and 17 are relatively easy to be distinguished from the normal data, and the other four Faults are difficult according to Ref. [32].
The curves of accuracy obtained by KNN vs. features on six faults from TEP are shown in Fig. 4. From all curves, we have a conclusion that not all features are helpful for fault detection, which implies the important of feature selection. On the basis of Fig. 4, we summary the best fault detection accuracy and F1-score in Tables 7 and 8, respectively. Moreover, the numbers of feature subsets with the highest accuracy obtained by KNN for all compared methods are list in Table 9. In addition, the results got by SVM are listed in the Tables 10, 11 and12.
Through these tables, we can see that ILFS effectively completes the task of feature selection for faults and finds key features, which improves the fault detection rate while reducing the number of features. ILFS is superior to other methods in both accuracy and F1-score on all six faults whether the classifier is KNN and SVM. For KNN, the best detection accuracy obtained by ILFS for Faults 10, 11, 14, 16, 17 and 19 increased by 6.04%, 31.46%, 39.17%, 20.42%, 6.15% and 36.98% to FS, respectively. For SVM, the accuracy rates are improved by 4.27%, 8.34%, 47.91%, 10.62%, 6.25% and 10.42% for six faults. According to Tables 9 and 12, ILFS as a whole can select less feature number to achieve better performance.
Now, we compare the running time required to obtain feature ranking using feature selection methods. Both Chi2 and FS measure features by calculating a score for each feature without iteration such that they have a relatively short running time. Thus, we consider only the comparison of mRMR, WKNN, GCMBO_FC and ILFS, as listed in Table 13. We can see that ILFS is much faster than WKNN, sightly slower than mRMR, and compared to GCMBO_FC.
ILFS takes an average of 3.743 seconds to complete the feature selection task, 2.808 seconds are used to obtain the scatter matrixes S and D, accounting for 75.02% of the total time. During this calculation, 2.759 seconds are used to calculate the distance between samples, accounting for 98.25%. That is to say, if the scatter matrixes can be obtained quickly without calculating all distances, or the distance matrix can be obtained before feature selection, a lot of running time can be saved.
To further observe the distribution of selected features, we take Fault 14 as an example and show the top two features obtained by compared methods in Fig. 5. Observation on Fig. 5 indicates that the combination of Features 20 and 8 can be used to determine Fault 14, which is found by ILFS. However, it is very hard to use other combinations obtained by compared methods to distinguish Fault 14 and the normal data. Specially, observing the top two features obtained by FS shows that Fault 14 and normal samples are greatly overlapping, which results in a poor discrimination. Obviously, associating features for feature selection is more effective than separately considering features.
4.4 Experimental summary
Through the above two groups of experiments, we can see that, ignoring the adaptability of the structure of the data set to different classifiers, ILFS performs better when KNN is used as the classifier. The main reason is ILFS uses the idea of K nearest neighbor in the evaluation process. Even so, ILFS can also be applied to other classifiers, such as SVM.
Experimental results show that ILFS can effectively improve the performance of FS. Compared with the embedded method, WKNN, ILFS is much faster while ensuring the accuracy. Especially in small sample data sets, ILFS avoids the over fitting issue caused by over dependence on the classification accuracy.
In the process of feature selection, the accuracy curve is relatively gentle and the descent speed is slow after reaching the optimal value owing to the consideration of the influence between features. It has a good fault tolerance when selecting the number of features in the subset. However, it is necessary for ILFS to find a better way to get the scatter matrixes S and D.
5 Conclusion
In this paper, we propose ILFS based on FS. ILFS can preserve the basic idea of FS that selects the feature with the greatest between-class scatter and the smallest within-class scatter. Unlike FS, ILFS calculates local between-class and within-class scatters by selecting K nearest neighbours in order to preserve the local structure of original data. ILFS iteratively generates an optimal feature subset by measuring the discriminant ability of candidate feature subsets. We conduct extensive experiments on UCI and TEP data sets. On the UCI data sets, we validate ILFS has a better ability to select discriminant features than compared methods including FS. On six faults from TEP, ILFS can select discriminant features for fault detection. All experimental results indicate that ILFS is much better than FS.
Although ILFS enhances FS by associating features, the computational complexity of ILFS is higher than FS. In future, we plan to decrease the computational complexity of ILFS without losing the ability to select discriminant features. In particular, we need to design a fast way to calculate the scatter matrixes S and D.
References
Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Appice A, Ceci M, Rawles S, Flach P (2004) Redundant feature elimination for multi-class problems. In: Proceedings of the twenty-first international conference on machine learning, p 5
Bishop CM et al (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Bugata P, Drotár P (2019) Weighted nearest neighbors feature selection. Knowl-Based Syst 163:749–761
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Elec Eng 40(1):16–28
Chen L, Man H, Nefian AV (2005) Face recognition based on multi-class mapping of fisher scores. Pattern Recogn 38(6):799–811
Dixit M, Li Y, Vasconcelos N (2019) Semantic fisher scores for task transfer: using objects to classify scenes. IEEE Trans Pattern Anal Mach Intell
Downs JJ, Vogel EF (1993) A plant-wide industrial process control problem. Comput Chem Eng 17(3):245–255
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Gu Q, Li Z, Han J (2012) Generalized fisher score for feature selection. arXiv:1202.3725
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
He B, Shah S, Maung C, Arnold G, Wan G, Schweitzer H (2019) Heuristic search algorithm for dimensionality reduction optimally combining feature selection and feature extraction. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 2280–2287
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Advances in neural information processing systems, pp 507–514
He X, Niyogi P (2004) Locality preserving projections. In: Advances in neural information processing systems, pp 153–160
Huang SH (2015) Supervised feature selection: a tutorial. Artif Intell Res 4(2):22–37
Johnson BA, Iizuka K (2016) Integrating openstreetmap crowdsourced data and landsat time-series imagery for rapid land use/land cover (lulc) mapping: Case study of the laguna de bay area of the philippines. Appl Geogr 67:140–149
Keogh EJ, Mueen A (2010) Curse of dimensionality
Lai H, Tang Y, Luo H, Pan Y (2011) Greedy feature selection for ranking. In: Proceedings of the 2011 15th international conference on computer supported cooperative work in design (CSCWD). IEEE, pp 42–46
Liu H, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE international conference on tools with artificial intelligence. IEEE, pp 388–391
Liu K, Yang X, Fujita H, Liu D, Yang X, Qian Y (2019) An efficient selector for multi-granularity attribute reduction. Inf Sci 505:457–472
Liu K, Yang X, Yu H, Mi J, Wang P, Chen X (2019) Rough set based semi-supervised feature selection via ensemble selector. Knowl-Based Syst 165:282–296
Moran M, Gordon G (2019) Curious feature selection. Inf Sci 485:42–54
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Qi X, Liu X, Boumaraf S (2019) A new feature selection method based on monarch butterfly optimization and fisher criterion. In: 2019 international joint conference on neural networks (IJCNN). IEEE, pp 1–6
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn 53(1-2):23–69
Sakar CO, Polat SO, Katircioglu M, Kastro Y (2019) Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and lstm recurrent neural networks. Neural Comput Appl 31 (10):6893–6908
Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ (2017) A survey on semi-supervised feature selection methods. Pattern Recogn 64:141–158
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-trinidad JF (2020) A review of unsupervised feature selection methods. Artif Intell Rev 53(2):907–948
Stańczyk U (2015) Feature evaluation by filter, wrapper, and embedded approaches. In: Feature selection for data and pattern recognition. Springer, pp 29–44
Ververidis D, Kotropoulos C (2005) Sequential forward feature selection with low computational cost. In: 2005 13Th european signal processing conference. IEEE, pp 1–4
Xue Y, Zhang L, Wang B, Zhang Z, Li F (2018) Nonlinear feature selection using gaussian kernel svm-rfe for fault diagnosis. Appl Intell 48(10):3306–3331
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zhang R, Zhang Z (2020) Feature selection with symmetrical complementary coefficient for quantifying feature interactions. Appl Intell 50(1):101–118
Zhou H, Zhang Y, Zhang Y, Liu H (2019) Feature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy. Appl Intell 49(3):883–896
Acknowledgments
This work was supported in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 19KJA550002, by the Six Talent Peak Project of Jiangsu Province of China under Grant No. XYDXX-054, by the Priority Academic Program Development of Jiangsu Higher Education Institutions, and by the Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supported by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 19KJA550002, the Six Talent Peak Project of Jiangsu Province of China under Grant No. XYDXX-054, and the Priority Academic Program Development of Jiangsu Higher Education Institutions.
Appendix A: Proof of Theorem 1
Appendix A: Proof of Theorem 1
Proof
We adopt the mathematical induction method to prove Theorem 1. Fist, we generate the series {J(G(t))}, t = 1,⋯ ,m using the rule (5).
For t = 1, G(0) = ∅ and \(\overline {G(0)}=\{1,\cdots ,m\}\). The highest score can be obtained by
where \(s_{b}^{G(0)}=0\) and \(s_{w}^{G(0)}=\delta \). In other words, J(G(1)) is generated by maximizing the local Fisher score of single feature. Then \(G(1)=\{k_{1}^{*}\}\) according to (5), and \(\overline {G(1)}=\{1,\cdots ,m\}-G(1)\), where \(k_{1}^{*}\) is the optimal feature selected in the first iteration. From now, \(k_{t}^{*}\) denotes the optimal feature selected in the t-th iteration.
For t = 2, J(G(2)) can be represented by
Naturally, the score of the feature subset \(\{k_{2}^{*}\}\) is less than or equal to that of the feature subset \(\{k_{1}^{*}\}\). Namely
By comparing (12) and (13), we have
Note that \(s_{w}^{G(1)}>0\) and \(s^{\{k_{2}^{*}\}}_{w}\geq 0\). Substituting (14) into (15), we have
which means that Theorem 1 holds in the first two iterations.
Now, we assume that Theorem 1 holds in the (t − 1)-th and t-th iterations. Namely,
For ∀t, J(G(t)) has the form:
The difference of J(G(t)) and J(G(t + 1)) can be described as
Because the denominator of (19) is greater than 0, we just consider the numerator of (19) that can be further represented as
Because (17) holds true, we have the following inequalities
where \(k,k^{*}_{t}\in \overline {G(t-1)}\) and \(k\neq k^{*}_{t}\). We separately derive the three inequalities and get
and
Because (24) holds true for \(\forall k\in \overline {G(t-1)}\) and \(k_{t+1}^{*} \in \overline {G(t)} \subset \overline {G(t-1)}\) in (20), we can rewrite (20) as
Substituting (25) into (19), we have
Consequently, we complete the proof of Theorem 1 by using the mathematical induction method.□
Rights and permissions
About this article
Cite this article
Gan, M., Zhang, L. Iteratively local fisher score for feature selection. Appl Intell 51, 6167–6181 (2021). https://doi.org/10.1007/s10489-020-02141-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-02141-0