1 Introduction

High-dimensional data would make traditional algorithms have a poor performance because of the curse of dimensionality [18]. In addition to increasing the computational burden of algorithms, high-dimensional data may cause the issue of over fitting for too many features. To reduce dimensionality, feature selection, one kind of efficient techniques, has been widely used, which aims to select useful features to form an optimal feature subset with a good discriminant ability for subsequent processing [6, 12]. Moreover, it is a key for all feature selection that how to measure the importance of features. Thus, feature selection methods require a feature evaluation criterion that would decide their performance.

As a filter method, Fisher score (FS) is independent of learning algorithms, and only uses a mathematical principle to give each feature a score. According to these scores, the best features can be selected to form the target feature subset. FS has been widely used. For face recognition, Chen et al. [7] combined FS with some traditional features such as log-likelihood and appearance-based features to form a new feature vector taking full advantage of both local and holistic features of human face. Dixit et al. [8] encoded the bag-of-semantics in the natural parameter space of the multinomial using a vector of Fisher scores to classify scenes for transfer tasks. In addition, Gu et al. [11] proposed a generalized Fisher score (GFS) that finds an optimal feature subset by maximizing the lower bound of traditional Fisher score. The optimization problem of GFS can be solved by using multivariate ridge regression and projected gradient descent. However, FS does not consider the possible relationship between features, such as combination or redundancy, instead of the influence of a single feature on the within-class and between-class scatters [3, 33]. Moreover, FS may lose the real scatters for computing scatters from the global view.

To solve the above issues in FS, this paper presents a novel feature selection method, called iteratively local Fisher score (ILFS). It is generally believed that the nearest samples are likely to have the same label, which means they have the greatest impact on classification. When evaluating a feature, it is beneficial to consider the local structure of the data set in some cases. The goal of ILFS is to make the local within-class scatter minimum and the local between-class scatter maximum. We use K nearest neighbors to represent the distribution and use only the distance of each group of K nearest neighbor samples in the calculation of divergence. Meanwhile, ILFS is to measure the importance of a feature subset instead of a single feature like FS. For the convenience of calculation, we use the idea of sequential forward selection (SFS). The feature selected in each iteration is the one that makes the current feature subset achieve the greatest local Fisher score. In doing so, ILFS takes into account the relationship between features. It can be seen from comparative experiments that ILFS has an excellent performance.

The rest of this paper is organized as follows. In Section 2, we give a brief overview of feature selection and then introduce the traditional FS and SFS in detail. Section 3 presents the novel feature evaluation criterion ILFS. Section 4 gives experiments on the UCI and TEP data sets. Finally, conclusions are drawn in Section 5.

2 Related works

In this section, we first simply introduce feature selection. Then, the related FS method is reviewed. Finally, we describe the scheme of sequential forward selection.

2.1 Feature selection

Feature selection is the process of removing redundant and unimportant features from the original feature set to obtain an optimal subset according to certain rules. As one of the methods of dimension reduction, it has attracted more and more attention in the fields like machine learning and pattern recognition. The advantages include deepening the understanding of features, reducing the amount of computation and improving the performance of classifier.

On the basis of whether the class label information exists or not, methods for feature selection can be divided into three categories: supervised [16], unsupervised [29] and semi-supervised [28]. According to the relationship with learners, methods for feature selection can be also divided into three categories: Filter, Wrapper and Embedded [30].

Filter methods are independent of learning algorithms. After all features are evaluated, the important features are combined into feature subsets for subsequent classification tasks. For example, Fisher-based [4] methods try to find features with the largest within-class distance and the smallest between-class distance at the same time. Relief-based methods [26] maximize the difference between the nearest neighbors with the same label and different labels of a given sample. In addition, filter methods also include max-relevance and min-redundancy (mRMR) [24], Chi-square test (Chi2) [20] and others. Filter feature selection methods are simple and fast, which can be easily extended to high-dimensional data. The evaluation results can be directly applied to various classification models. However, these methods may be weak in the subsequent processing owing to the lack of interaction with learning models.

Wrapper methods are related to learning algorithms. In the process of wrapper feature selection, a classifier needs to be trained to measure a feature subset according to the performance. The classifiers that can be used include decision tree (DT), K nearest neighbor classifier (KNN), Bayesian classifier (Bayes), support vector machine (SVM) and so on. Bugata et al. [5] proposed a weighted K nearest neighbours (WKNN) method for feature selection, which weights each feature and modifies the weight until the satisfactory result is obtained. The final feature subset obtained by wrapper methods is more suitable for the adopted learning model. However, when the learning method changes, it needs to go through feature selection again. Moreover, wrapper feature selection methods consume more time and are easier to over fit than the filter.

For embedded methods, the process of feature selection is embedded in the model training of a learning method and realized by optimizing an objective function. Compared with wrapper methods, embedded feature selection methods avoid the repeated training of the learning method when evaluating each feature subset, and can quickly get the best one. However, this kind of methods depends on the parameter adjustment to a great extent, so the performance is very sensitive to the parameters.

Some researchers improve the classical algorithms to achieve better results. For example, Zhou et al. [35] extended relevance and redundancy to conditional relevance and conditional redundancy and built a new frame integrating them to propose two methods: minimum conditional relevance-minimum conditional redundancy (MCRMCR) and minimum conditional relevance-minimum intra-class redundancy (MCRMICR). Zhang et al. [34] proposed the symmetrical complementary coefficient (SCC) to quantify features and interactions. On the basis of SCC, the ReliefF algorithm can be improved by screening out representative samples from the original data set. Researchers also tried to combine feature selection with other methods, such as rough set [21, 22], krill herd algorithm [1, 2], and Q-learning [23].

2.2 Fisher score

FS is one of the classical supervised feature selection methods, which seeks the optimal features that make the within-class scatter minimum and the between-class scatter maximum. Assume that there is a set of labeled samples \(\left \{(\mathbf {x}_{i}, y_{i}) \right \}_{i=1}^{n}\), where \(\mathbf {x}_{i}=[x_{i1}, x_{i2}, \cdots , x_{im}]^{T}\in \mathbb {R}^{m}\), yi ∈{1,2,⋯ ,c} is the label of xi to indicate its category, m is the dimension of samples, and c is the number of classes. FS first assigns a score to each feature of original data and then selects an optimal feature subsets according to the scores. The score of the k-th feature can be computed independently as

$$ F(k)=\frac{{\sum}_{i=1}^{c} n_{i}({\mu_{k}^{i}}-\mu_{k})^{2}}{{\sum}_{i=1}^{c} n_{i}({\sigma_{k}^{i}})^{2}} $$
(1)

where ni is the number of the i-th class, \({\mu _{k}^{i}}\) and \({\sigma _{k}^{i}}\) are the mean and variance of the k-th feature in the i-th class, respectively, and μk is the mean of the k-th feature in all classes.

In (1), \({\sum }_{i=1}^{c} n_{i}({\mu _{k}^{i}}-\mu _{k})^{2}\) and \({\sum }_{i=1}^{c} n_{i}({\sigma _{k}^{i}})^{2}\) are the between-class scatter and the within-class scatter of the k-th feature, respectively. Thus, the score F(k) can represent the discriminant ability of the k-th feature. The greater the score F(k) is, the more discriminant ability the k-th feature has.

Since FS assesses all features independently, it ignores the association between features. Assume that there are two features with high scores, which is to say these two features have good discriminant ability separately. However, it is uncertain that they would improve the discriminant ability together. Similarly, it is also possible that two features with low scores may achieve a good classification performance when they are combined. In addition, FS computes both the between-class scatter and the within-class from the global view. It is possible that the resulted scatters could not reflect the real ones if data has a manifold structure.

2.3 Sequential forward selection

SFS is a heuristic search method that essentially adopts the idea of greed [13, 19, 31]. The target feature subset starts from an empty set. In each iteration, a feature is selected from candidates and added to the target feature subset to make the evaluation function optimal. The algorithm would stop until the termination condition is reached, say all features are in the target feature subset.

As a simple greedy algorithm, SFS has a fast search speed, but SFS can only add features during the search process, which is easy to fall into the local optimal solution.

3 Iteratively local fisher score

In this section, we propose a novel feature selection method, ILFS, which can address the shortcomings of FS mentioned above.

3.1 Local fisher score

It is unwise to construct a neighbourhood relationship from a global view if data have a structure of latent manifold. Here, we consider a local neighbourhood relationship that is constructed by using K nearest neighbours of a sample instead of all samples [14, 15]. The local neighbourhood relationship can be represented by two scatter matrixes S and D, where S is the local within-class scatter matrix, and D the local between-class scatter matrix. Two scatter matrixes are defined as:

$$ S_{ij} = \left\{\! \begin{array}{ll} 1, & \text{if}~ (\mathbf{x}_{j}\!\in\! KNN(\mathbf{x}_{i})\text{ or }\mathbf{x}_{j} \!\in\! KNN(\mathbf{x}_{i}))\text{ and }y_{i} = y_{j}\\ 0, & \text{otherwise} \end{array}\right. $$
(2)

and

$$ D_{ij} = \left\{\! \begin{array}{ll} 1, & \text{if}~(\mathbf{x}_{j}\!\in\! \mathit{KNN}(\mathbf{x}_{i})\text{ or }\mathbf{x}_{j} \!\in\! \mathit{KNN}(\mathbf{x}_{i}))\text{ and } y_{i} \!\neq\! y_{j}\\ 0, & \text{otherwise} \end{array}\right. , $$
(3)

where KNN(xi) is the set of K nearest neighbours of xi.

To consider the association between features, we define the local Fisher score of a feature subset. Without loss of generality, let A be the index set of features. Then, the local Fisher score of A is defined by

$$ J(A)=\frac{{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\|\mathbf{x}^{A}_{i}-\mathbf{x}^{A}_{j}\|^{2} D_{ij}}{{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\|\mathbf{x}^{A}_{i}-\mathbf{x}^{A}_{j}\|^{2} S_{ij}+\delta}, $$
(4)

where \(\mathbf {x}_{i}^{A} \in \mathbb {R}^{|A|}\) is a sub-vector of xi with features in A and δ is a small positive constant that is introduced to avoid a zero denominator.

In (4), the numerator and the denominator denote the local scatters of between-class and within-class in a subspace A, respectively. Compared to (1), the local Fisher score focuses on the discriminant ability of a feature subset instead of a single feature. The higher the local Fisher score J(A) is, the more discriminant ability the feature subset A has. Moreover, J(A) contains the local information of original data. Thus, the selected feature subset can maintain the local structure of data well.

3.2 Iterative algorithm

Ignoring the relationship between features, FS calculates the score of all features independently. To remedy this defect, we measure a feature subset instead of a single feature and calculate its local Fisher score. To get an optimal feature subset, we need an iterative algorithm. In each iteration, we try to find an optimal feature from the remaining feature subset and add it to the previous optimal feature subset, where the generated new optimal feature subset has the highest local Fisher score.

Let G be the optimal feature subset in the previous iteration and its complementary set be \(\overline {G}=\{1,\cdots ,m\}-G\). In the current iteration, our mission is to pick up a feature from \(\overline {G}\) and form a new subset G. As long as \(\overline {G}\) is not empty, we can generate candidates and evaluate them. Let Ak = G ∪{k} be a candidate, where \(k\in \overline {G}\). In the current iteration, we need to calculate J(Ak) for \(\forall k\in \overline {G}\) and update the optimal feature subset G based on these scores. The update rule is

$$ \text{If}~k^{*}=\mathop{\arg\max}_{k\in\overline{G}}\\J(A_{k}),~\text{then}~ G \leftarrow G\cup\{k^{*}\}. $$
(5)

At the same time, we update \(\overline {G}=\{1,\cdots ,m\}-G\).

The feature subset G can also reflect feature ranking. The earlier the feature is selected, the more discriminant ability it has.

3.3 Algorithm analysis

First, we analyze the computational complexity of our algorithm. Assume that our task is to select r features from the original feature set. Then we would stop the iterative algorithm until |G| >= r. In other words, we need r iterations. The computational complexity in each iteration depends on not only the number of samples but also the number of selected and unselected features. Without loss of generality, let \(|G|=m^{\prime }\) in the current iteration. The computational complexity of calculating within-class and the between-class distances of one candidate feature subset is \(O(Kn(m^{\prime }+1))\), where K is the number of nearest neighbours, n is the number of samples, and \((m^{\prime }+1)\) is the size of the candidate subsets. We have a total of \((m-m^{\prime })\) candidate feature subsets, and need to select the one with the highest local Fisher score. The total time complexity is \(O(Kn(m^{\prime }+1)(m-m^{\prime }))\). In all r iterations, the upper bound of computational complexity is O(Knm2/4) when |G| = ⌊m/2⌋ or |G| = ⌈m/2⌉, where ⌊⋅⌋ and ⌈⋅⌉ are functions which round the element ⋅ to the nearest integers towards minus infinity and infinity, respectively. The highest complexity in iterations is related to the number of nearest neighbours, the number of samples, and square of the number of features.

Next, we consider reducing the computational complexity in each iteration. In the current iteration, assume that we have a candidate Ak = G ∪{k}, where \(k\in \overline {G}\). We can calculate the score of Ak by:

$$ \begin{array}{@{}rcl@{}} \mathit{J}(\mathit{A}_{k})\!&=&\!\frac{{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\left\|\mathbf{x}^{A_{k}}_{i}-\mathbf{x}^{A_{k}}_{j}\right\|^{2} D_{ij}}{{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\left\|\mathbf{x}^{A_{k}}_{i}-\mathbf{x}^{A_{k}}_{j}\right\|^{2} S_{ij}+\delta}\\ \!&=&\! \frac{{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\left\|\mathbf{x}^{G}_{i} - \mathbf{x}^{G}_{j}\right\|^{2} D_{ij} + {\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\left\|\mathbf{x}^{\{k\}}_{i} - \mathbf{x}^{\{k\}}_{j}\right\|^{2} D_{ij}}{{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\left\|\mathbf{x}^{G}_{i} - \mathbf{x}^{G}_{j}\right\|^{2} S_{ij} + {\sum}_{i=1}^{n}{\sum}_{j=1}^{n}\left\|\mathbf{x}^{\{k\}}_{i} - \mathbf{x}^{\{k\}}_{j}\right\|^{2} S_{ij} + \delta}\\ \!&=&\! \frac{{s^{G}_{b}}+s^{\{k\}}_{b}}{{s^{G}_{w}}+s^{\{k\}}_{w}} \end{array} $$
(6)

where \({s^{G}_{b}} = {\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\left \|\mathbf {x}^{G}_{i} - \mathbf {x}^{G}_{j}\right \|^{2} D_{ij}\) and \({s^{G}_{w}} = {\sum }_{i=1}^{n}\) \({\sum }_{j=1}^{n}\left \|\mathbf {x}^{G}_{i}-\mathbf {x}^{G}_{j}\right \|^{2} S_{ij}+\delta \) are the between-class and the within-class scatters in the previous iteration, respectively,

$$ s^{\{k\}}_{b}=\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\left\|\mathbf{x}^{\{k\}}_{i}-\mathbf{x}^{\{k\}}_{j}\right\|^{2} D_{ij} $$
(7)

and

$$ s^{\{k\}}_{w}=\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\left\|\mathbf{x}^{\{k\}}_{i}-\mathbf{x}^{\{k\}}_{j}\right\|^{2} S_{ij} $$
(8)

In (6), both \({s^{G}_{b}}\) and \({s^{G}_{w}}\) can be obtained according to the previous iteration. Therefore, we can get the score of the candidate Ak by only calculating \(s^{\{k\}}_{b}\) and \(s^{\{k\}}_{w}\) by (7) and (8).

Now, we analyze the computational complexity of the fast iterative algorithm. The computational complexity is O(Knm) when calculating within-class and the between-class distances of each feature before the iteration selection begins. In the current iteration, we only need a computational complexity of \(O(m-m^{\prime })\), where \(|G|=m^{\prime }\). The detail algorithm is described in Algorithm 1. The source code has been uploaded to GitHub (https://github.com/Min-Gan/ILFS).

figure a

We analyze the properties of our algorithm. Let G(t) be the optimal feature subset obtained in the t-th iteration. The following theorems describe the properties of ILFS.

Theorem 1

For the series {J(G(t))} generated by ILFS, the following inequalities

$$ \begin{array}{@{}rcl@{}} J(G(t+1))\leq J(G(t)), t=1,\cdots,m-1 \end{array} $$
(9)

hold true.

Theorem 2

For the series {J(G(t))} generated by ILFS, the following inequalities

$$ \begin{array}{@{}rcl@{}} \frac{s^{\{k^{*}_{t}\}}_{b}}{s^{\{k^{*}_{t}\}}_{w}} \leq J(G(t)), t=1,\cdots,m-1 \end{array} $$
(10)

hold true, where \(k^{*}_{t}\) is the index of optimal feature selected in the t-th iteration.

The proof of Theorem 1 is given in Appendix A. Theorem 1 states that the series {J(G(t))} is monotonically decreasing. Theorem 2 can be directly derived from the proof of Theorem 1, so we ignore its proof. Theorem 2 describes the relationship between the score of optimal feature obtained in the current iteration and the score of optimal feature subset obtained in the previous iteration.

4 Experiments

This section is to test the ability of our method to select discriminant features and compare our method with other popular supervised algorithms. Experiments are implemented in JetBrains PyCharm 2019.2.2 or in MATLAB R2016a on a PC with an Inter Core I5 processor and 16GB RAM.

4.1 Experimental setting

Both KNN and SVM are used to classify data processed by feature selection algorithms. Compared methods include mRMR [24], Chi2 [20], WKNN [5], FS [12], GCMOB_FC [25] and our proposed method, which are described as follows:

mRMR::

The mRMR algorithm uses mutual information to select features. The purpose of mRMR is to select a feature subset that not only has the greatest correlation with the label information, but also makes the correlation between features least.

Chi2::

Chi2 is the idea of hypothesis testing commonly used in probability theory and mathematical statistics, which determines whether the hypothesis is true by observing the deviation between the actual value and the theoretical value to assign score for each feature.

WKNN::

WKNN uses the distance and attribute weighted K-nearest neighbours and takes gradient descent as an iterative optimization algorithm for finding the function minima. In experiments, we choose the Euclidean distance as the distance calculation method, the stochastic gradient descent method as the optimization method, the cross entropy loss as the loss function, and the exponential function as the distance evaluation function.

FS::

FS assumes that features with a strong discriminant ability should have a small within-class scatter and a large between-class scatter.

GCMBO_FC::

GCMBO_FC is a feature selection method based on the monarch butterfly optimization and Fisher criterion. This algorithm combines these two components by binarizing continuous solution vectors to select features from original ones.

ILFS::

ILFS is proposed in this paper. Let the constant δ = 10− 8 to avoid zero denominator.

For mRMR, WKNN, FS and ILFS, we evaluate each feature and select the feature with the highest score to join the target feature subset. For both Chi2 and GCMBO_FC, we generate feature subsets with specific size in turn for classification. We test the effect of these feature selection methods with KNN and SVM, respectively. The default value for the parameter K in KNN is 5 in the package of sklearn from Python. For SVM, the Gaussian radial basis function (RBF) kernel is used for all UCI datasets except for the KSD dataset with the linear kernel. The hyperparameters of SVM are set to be default according to the package of sklearn from Python.

4.2 Experiments on UCI data sets

4.2.1 Data description

To test the performance of compared algorithms, we conduct experiments on 11 UCI data sets [10], including Soybeansmall, Wine, Sonar, Heart, Zoo, Vehicle, Australian, MEU-Mobile KSD (KSD), Spambase, and two other data sets with more than 10,000 samples named Crowdsourced Mapping (Map) [17] and Online Shoppers Purchasing Intention (Shopper) [27]. The data information is summarized in Table 1, where data sets are sorted according to the number of samples.

Table 1 Information of 11 UCI data sets

Because these data sets do not provide training and test sets independently, we need to divide each of them into the training and test sets randomly according to the ratio of 1 : 1, except for large ones (Map and Shopper) by 4 : 1. We repeat the operation of data division 100 times.

4.2.2 Analysis of property and parameter

First, we show the properties of our method provided by Theorems 1 and 2 on the Spambase dataset. The curve of score vs. iteration is plotted in Fig. 1, where J(G) is the score of the optimal feature subset in the (t − 1)-th iteration and J({k}) is the score of the optimal feature selected in the t-th iteration. We can see that the curve of J(G) shows a downward trend that supports the conclusion in Theorem 1, and J(G) is always greater than J({k}) that validates the conclusion of Theorem 2.

Fig. 1
figure 1

Curves of score vs. iteration on the Spambase dataset

Second, we discuss the influences of the parameter K on the accuracy of our algorithm by observing the experimental results of Vehicle and Wine datasets, as shown in Fig. 2, where the parameter K varies in the set {1,2,3,5,7,10,15,20,30,60}. Note that FS and iterative FS (IFS) are taken as baselines. IFS is a variant method of ILFS, which follows the iterative process of ILFS and calculates the score of the feature subset A according to the following formulation:

$$ F(A)=\frac{{\sum}_{i=1}^{c} n_{i}{\sum}_{k\in A}({\mu}_{k}^{i}-\mu_{k})^{2}}{{\sum}_{i=1}^{c} n_{i}{\sum}_{k\in A}({\sigma_{k}^{i}})^{2}} $$
(11)
Fig. 2
figure 2

Accuracy vs. ranked feature with different parameter K on a Vehicle and b Wine datasets

From Fig. 2, we can see that the idea of iteration really improves FS because IFS and ILFS are much better than FS. In addition, ILFS is indeed sensitive to the value of K. When K is very large, the local structure of original data would be ruined, which makes the accuracy of ILFS decrease and gradually approach to that of IFS only with the iterative thought. When the parameter K is very small, it is equivalent to using a training instance in a smaller field to perform score calculation, which is prone to overfitting. For different datasets, the optimal value for the parameter K is different. For example, the accuracy of ILFS is the best when K = 5 for the Vehicle dataset and when K = 15 for the Wine dataset. Thus, it is better that the range of K is from 5 and 20. In the following, the parameter K is selected from the set {5,10,15,20} through cross-validation.

Fig. 3
figure 3

Curves of accuracy obtained by KNN vs. features on 11 UCI data sets: a Soybeansmall, b Wine, c Sonar, d Hear, e Zoo, f Vehicle, g Australian, h MEU-Mobile KSD (KSD), i Spambase, j Map, and k Shopper

4.2.3 Experimental results

We first rank all features using the compared algorithms and then record the average accuracy of feature subsets in ranking order by using KNN and SVM. Because it takes more than hours for WKNN to select features on large data sets, no comparison is made on both Map and Shopper here.

Using KNN, the curves of average accuracy obtained by KNN vs. features obtained by feature selection algorithms are shown in Fig. 3. All compared algorithms increase their discriminant ability as increasing the number of ranked features at the beginning and achieve a relatively better performance on most data sets, such as Soybeansmall, Wine, Sonar, Heart, Vehicle, Spambase and Map. On the Australian data set, Chi2 and FS, ILFS need only the top two features to achieve the their best performance. While other algorithms require more features. On the Zoo data set, Chi2, mRMR and ILFS require only the top one feature to obtain the best discriminant ability. According to Fig. 3, we list the best average classification accuracies and F1-score with the corresponding standard deviations obtained by KNN in Tables 2 and 3, where the highest values among six methods is in bold type. ILFS can effectively improve the classification accuracy on all data sets, which shows its good performance. Specially, ILFS gets the best accuracy 92.48% on the Wine data set, which is 19.12% higher in accuracy than FS. In addition, the accuracy rates are increased by 8.59% in Zoo, 8.51% in Vehicle, 4.35% in KSD, 4.0% in spambase and 3.95% in Heart.

Table 2 Mean and standard deviations of accuracy (%) obtained by KNN on UCI data sets
Table 3 Mean and standard deviations of F1-score (%) obtained by KNN on UCI data sets

The classification results obtained by SVM are listed in Tables 4 and 5, where ILFS also remains the best in eight out of 11 data sets. The accuracy rates are improved by 26.17% in Wine, 11.02% in Zoo, 6.36% in Vehicle, 6.28% in Spambase and 4.15% in Heart.

Table 4 Mean and standard deviations of accuracy (%) obtained by SVM on UCI data sets
Table 5 Mean and standard deviations of F1-score (%) obtained by SVM on UCI data sets

4.3 Experiments on the TEP data set

The Tennessee Eastman Process (TEP) data set is widely used as test control and troubleshooting models for complex industrial processes and came from the Tennessee Eastman simulation platform that is an open chemical model simulation platform developed by Eastman Chemical Company based on the actual chemical reaction process [9]. The TEP data set includes data collected under normal conditions and 21 different fault states, as shown in Table 6, each of which has 52 features. All features are normalized to the interval [0,1].

Table 6 Faults for TEP

In the normal data set, there are 500 samples for training and 960 ones for test. For each fault state, there is a training set composed of 480 fault data and a test set composed of 960 observation data where the first 160 samples are normal ones. We take 500 training data from a normal data set and 480 ones from a fault data set to form the training set and perform fault detection on the test set for each fault. Here, we consider Faults 10, 11, 14, 16, 17, and 19, where Faults 14 and 17 are relatively easy to be distinguished from the normal data, and the other four Faults are difficult according to Ref. [32].

The curves of accuracy obtained by KNN vs. features on six faults from TEP are shown in Fig. 4. From all curves, we have a conclusion that not all features are helpful for fault detection, which implies the important of feature selection. On the basis of Fig. 4, we summary the best fault detection accuracy and F1-score in Tables 7 and 8, respectively. Moreover, the numbers of feature subsets with the highest accuracy obtained by KNN for all compared methods are list in Table 9. In addition, the results got by SVM are listed in the Tables 1011 and12.

Fig. 4
figure 4

Curves of accuracy obtained by KNN vs. features on six faults from TEP data set: a Fault 10, b Fault 11, c Fault 14, d Fault 16, e Fault 17, and f Fault 19

Table 7 Best fault detection accuracy (%) obtained by KNN
Table 8 Best fault detection F1-score (%) obtained by KNN
Table 9 Number of the optimal feature subset with the highest accuracy obtained by KNN
Table 10 Best fault detection accuracy (%) obtained by SVM
Table 11 Best fault detection F1-score (%) obtained by SVM
Table 12 Number of the optimal feature subset with the highest accuracy obtained by SVM

Through these tables, we can see that ILFS effectively completes the task of feature selection for faults and finds key features, which improves the fault detection rate while reducing the number of features. ILFS is superior to other methods in both accuracy and F1-score on all six faults whether the classifier is KNN and SVM. For KNN, the best detection accuracy obtained by ILFS for Faults 10, 11, 14, 16, 17 and 19 increased by 6.04%, 31.46%, 39.17%, 20.42%, 6.15% and 36.98% to FS, respectively. For SVM, the accuracy rates are improved by 4.27%, 8.34%, 47.91%, 10.62%, 6.25% and 10.42% for six faults. According to Tables 9 and 12, ILFS as a whole can select less feature number to achieve better performance.

Now, we compare the running time required to obtain feature ranking using feature selection methods. Both Chi2 and FS measure features by calculating a score for each feature without iteration such that they have a relatively short running time. Thus, we consider only the comparison of mRMR, WKNN, GCMBO_FC and ILFS, as listed in Table 13. We can see that ILFS is much faster than WKNN, sightly slower than mRMR, and compared to GCMBO_FC.

Table 13 Running time (s) of feature selection algorithm to complete the six fault detection

ILFS takes an average of 3.743 seconds to complete the feature selection task, 2.808 seconds are used to obtain the scatter matrixes S and D, accounting for 75.02% of the total time. During this calculation, 2.759 seconds are used to calculate the distance between samples, accounting for 98.25%. That is to say, if the scatter matrixes can be obtained quickly without calculating all distances, or the distance matrix can be obtained before feature selection, a lot of running time can be saved.

To further observe the distribution of selected features, we take Fault 14 as an example and show the top two features obtained by compared methods in Fig. 5. Observation on Fig. 5 indicates that the combination of Features 20 and 8 can be used to determine Fault 14, which is found by ILFS. However, it is very hard to use other combinations obtained by compared methods to distinguish Fault 14 and the normal data. Specially, observing the top two features obtained by FS shows that Fault 14 and normal samples are greatly overlapping, which results in a poor discrimination. Obviously, associating features for feature selection is more effective than separately considering features.

Fig. 5
figure 5

Top two features ranked by different feature selection methods for Fault 14: a mRMR, b Chi2, c WKNN, d FS, e GCMBO_FC, and f ILFS

4.4 Experimental summary

Through the above two groups of experiments, we can see that, ignoring the adaptability of the structure of the data set to different classifiers, ILFS performs better when KNN is used as the classifier. The main reason is ILFS uses the idea of K nearest neighbor in the evaluation process. Even so, ILFS can also be applied to other classifiers, such as SVM.

Experimental results show that ILFS can effectively improve the performance of FS. Compared with the embedded method, WKNN, ILFS is much faster while ensuring the accuracy. Especially in small sample data sets, ILFS avoids the over fitting issue caused by over dependence on the classification accuracy.

In the process of feature selection, the accuracy curve is relatively gentle and the descent speed is slow after reaching the optimal value owing to the consideration of the influence between features. It has a good fault tolerance when selecting the number of features in the subset. However, it is necessary for ILFS to find a better way to get the scatter matrixes S and D.

5 Conclusion

In this paper, we propose ILFS based on FS. ILFS can preserve the basic idea of FS that selects the feature with the greatest between-class scatter and the smallest within-class scatter. Unlike FS, ILFS calculates local between-class and within-class scatters by selecting K nearest neighbours in order to preserve the local structure of original data. ILFS iteratively generates an optimal feature subset by measuring the discriminant ability of candidate feature subsets. We conduct extensive experiments on UCI and TEP data sets. On the UCI data sets, we validate ILFS has a better ability to select discriminant features than compared methods including FS. On six faults from TEP, ILFS can select discriminant features for fault detection. All experimental results indicate that ILFS is much better than FS.

Although ILFS enhances FS by associating features, the computational complexity of ILFS is higher than FS. In future, we plan to decrease the computational complexity of ILFS without losing the ability to select discriminant features. In particular, we need to design a fast way to calculate the scatter matrixes S and D.