High-dimensional supervised feature selection via optimized kernel mutual information
Introduction
Feature selection is one of the important issues in expert and intelligent system technology, which uses both the input and output variables to predict the relationships between the features and class labels. Prediction models have been used for various expert and intelligent systems applications including multi-agent systems, knowledge management, neural networks, knowledge discovery, data and text mining, multimedia mining, and genetic algorithms. The models generally involve a number of features. However, not all of these features are equally important for a specific task. Some of them may be redundant or even irrelevant. Better performance may be achieved by discarding some features.
Many applications in pattern recognition require the identification of the most characteristic features of a given data set D that contains N samples and M features, with . The data set D usually includes a large amount of irrelevant, redundant, and unnecessary information, which degrades recognition performance. The feature-selection method (Cheriet, Kharma, Liu, & Suen, 2007) selects a subset G of features from M (|G| < M) such that D is optimized based on G according to criterion J. The goal is to maximize the predictive accuracy of the data within D and minimize the cost of extracting features within G.
Feature selection focuses on finding the best subspace, where the total number of subspaces in the original data set D is 2M. Given the number k (k < M), the number of subspaces with dimension less than k is . Thus, D is high dimensional for a large feature number M, so thoroughly searching the subspace of features is difficult. To address this issue, sequential-search-based methods to select features have been proposed. Blum and Langley (1997) grouped feature-selection methods into three types: filter, wrapper, and embedded. Filter methods (Almuallim, Dietterich, 1994, Kira, Rendell, 1992) provide quick estimates of the value of features and filter the irrelevant or redundant features before they are fed into the classifier. In contrast, wrapper methods (Kohavi & John, 1997) usually interact with a classifier, so the classifier performance will directly affect the quality of the feature subsets. Finally, in embedded methods (Lal, Chapelle, Weston, & Elisseeff, 2006), feature selection is embedded into the classifier, so the two are not independent and execute simultaneously.
Additionally, feature-selection methods can be categorized as unsupervised, supervised, or semi-supervised. Unsupervised methods were developed without using class labels and include joint embedding learning and sparse regression (JELSR) (Hou, Nie, Li, Yi, & Wu, 2014), matrix factorization (MF) (Wang, Pedrycz, Zhu, & Zhu, 2015), k-nearest-neighbor (Chan & Kim, 2015), feature similarity feature selection (FSFS) (Mitra, Murthy, & Pal, 2002), Laplacian score (LS) (He, Cai, & Niyogi, 2005), and regularized self-representation (Zhu, Zuo, Zhang, Hu, & Shiu, 2015), all of which offer many efficient algorithms for unsupervised feature selection. Supervised feature-selection methods search for features of an input vector by predicting the class label, the existing methods include ReliefF (Kira & Rendell, 1992), Fisher score, correlation, kernel optimize (Kopt) (Xiong, Swamy, & Ahmad, 2005), kernel class separability (KCS) (Wang, 2008), generalized multiple kernel learning (GMKL) (Varma & Babu, 2009), scaled class separability selection(SCSS) (Ramona, Richard, & David, 2012), spectral feature with minimum redundancy (MRSF) (Zhao, Wang, & Liu, 2010), Hilbert-Schmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Lkopf, 2005), the HSIC-based greedy feature selection criterion (Song, Smola, Gretton, Bedo, & Borgwardt, 2012), the sparse additive models (SpAM) (Ravikumar, Lafferty, Liu, & Wasserman, 2009), Hilbert–Schmidt feature selection (HSFS) (Masaeli, Fung, & Dy, 2010), and centered kernel target alignment (cKTA) (Cortes, Mohri, & Rostamizadeh, 2014), feature-wise kernelized Lasso (HSIC Lasso) (Yamada, Jitkrittum, Sigal, Xing, & Sugiyama, 2014). The method proposed herein and the methods we compare it with are all supervised methods.
The popular MI method (Eriksson, Kim, Kang, & Lee, 2005) constructs decision trees to rank variables and also serves as a metric for feature selection. An MI method based on Shannon information uses information-theoretic ranking with the dependency between two variables serving as metric and uses entropy to represent relationships between an observed variable x and an output result y. The MI of x and y is defined by their probabilistic density functions p(x) and p(y), respectively, and their joint probabilistic density function p(x, y) (Battiti, 1994). To rank features, some works report that the union of individually good features does not necessarily lead to good recognition performance; in other words “the m best features are not the best m features.” Although MI can decrease the redundancy with respect to the original features and select the best features with minimal redundancy, the joint probability of features and the target class increases, so the redundancy among features may decrease (Pudil, Novovičová, & Kittler, 1994).
To select good features by using the statistical dependency distribution, Peng, Long, and Ding (2005) proposed the minimal-redundancy-maximal-relevance criterion (mRMR), which uses a feature-selection method based on MI. The method provides maximal dependency, maximal relevance, and minimal redundancy. The selected features have the maximal joint dependency on the target class, which is called “maximal dependency,” but it is hard to implement, so the relevant approximate dependency uses the MI between feature and target class. Minimal redundancy reduces the redundancy resulting from maximum relevance, so the redundancy metric is computed from the MI among the selected features. Experiments with mRMR improved the classification accuracy for some data sets. For more details about research into mRMR, see Refs. Ding and Peng (2005) and Zhao et al. (2010).
The definition of MI is based on the feature entropy and class label; however, it favors features with many values. Some features can be very simple, so the feature value is an integer with a small range. However, some feature values are floating points with very wide ranges, which needs more computation to obtain a ratio that reflects the correlation between features and class. Another problem is inconsistency (Dash & Liu, 2003): consider n samples with the same range of feature-value but m1 of these samples belong to class 1 and the remaining mi samples belong to class i. The largest feature is m1, so the inconsistency is and the inconsistency rate is the sum of the inconsistencies divided by the size N of the set. Reference Dash and Liu (2003) shows that the time complexity of computing the inconsistency rate is close to O(N), so the rate is also monotonic and is tolerant to noise; however, it is only available for discrete values. Thus, the rate must be discretized for continuous features, which will seriously affect the computation complexity and consume more memory resources. Occasionally, the computation is interrupted when the feature number is too large for the memory. This problem is discussed in detail below.
To resolve these drawbacks of the MI method, the kernel-based methods (Gretton, Herbrich, Smola, 2003, Lin, Ying, Chen, Lee, 2008, Sakai, Sugiyama, 2014) are imported to enhance the MI, the Hilbert–Schmidt independence criterion (HSIC) using kernel-based independence measures is introduced in Refs. Gretton et al. (2005) and Song et al. (2012). These approaches are popular for mapping the data to a nonlinear high-dimensional space (Alzate, Suykens, 2012, Schölkopf, Smola, Müller, 1998). Multi-kernel learning (Wang, Bensmail, & Gao, 2014) has been applied to feature selection with a sparse representation on the manifold can handle noise features and nonlinear data. The kernel-based feature-selection method integrates a linear combination of features with the criterion. Real applications of the kernel concern the type of kernel and parameters, so while cross validation may optimize the kernel, it consumes more time and is easy to over-fit.
Traditional feature selection methods (Kira & Rendell, 1992) based on the assumption of linear dependency between input features and output values, they cannot capture non-linear dependency. KCS (Wang, 2008) cKTA (Cortes et al., 2014) are not necessarily positive definite, and thus the objective functions can be non-convex. Furthermore, for the kernel-based methods (Gretton, Herbrich, Smola, 2003, Varma, Babu, 2009, Xiong, Swamy, Ahmad, 2005), output y should be transformed by the non-linear kernel function ϕ(), this highly limits the flexibility of capturing non-linear dependency, an advantage of the formulation is that the global optimal solution can be computed efficiently. Thus, it is scalable to high-dimensional feature selection problems. Finally, an output y should be a real number in SpAM (Ravikumar et al., 2009), meaning that SpAM cannot deal with structured outputs such as multi-label and graph data. Greedy search strategies such as forward selection/backward elimination are used in mRMR (Peng et al., 2005) HSIC (Gretton et al., 2005). However, the greedy approaches tend to produce a locally optimal feature set. To the best of our knowledge, the convex feature selection method is able to deal with high-dimensional non-linearly related features. In addition, the output Gram matrix L is used to select features in HSIC Lasso (Yamada et al., 2014), which can naturally incorporate structured outputs via kernels. All feature methods are summarized in Table 1.
To address this problem, we propose herein an approach that combines the goodness of the kernel function and the MI method to obtain a high-dimensional supervised feature-selection framework called optimized kernel mutual information (OKMI) with joint kernel learning, maximum relevance, and minimum redundancy. Instead of using MI to characterize high-dimensional data by the feature and class probability, we embed a kernel function into the MI to form a new framework. Widely used types of kernel functions, including polynomial, Gaussian, exponential, and sigmoid, can be seen as special cases in the OKMI framework. After repeated comparisons and analyses, we chose a suitable kernel to satisfy the requirements of any given problem. By comparing experimental results with related high-dimensional supervised-feature-selection methods, OKMI facilitates the integration of other methods to rapidly find a compact subset of features. Experimental results reveal that OKMI improves effectiveness in classification accuracy over a wide range of data sets.
The motivation of this work can be summarized as follows:
- (1)
Theoretical analysis. The OKMI framework integrates kernel learning and MI. Although both have been used for feature selection, no advanced theoretical analysis of their combination explains why they select optimal features. Therefore, the first step in this paper is to theoretically analyze the OKMI by using an optimization method.
- (2)
Computational complexity. Traditional MI methods compute the probability of distribution to implement feature selection. The range of feature values will seriously affect the probability, and occasionally the size of the feature space and the discrete data process are too costly, which causes programs with high computational complexity to crash. The OKMI method can avoid this problem by finding the optimal features at very low cost.
- (3)
Classification accuracy. Any theoretical analysis must be verified experimentally. We therefore implement various experiments to prove that the OKMI method is an efficient method for supervised feature selection. We compare it with other methods through comprehensive experiments that involve different kernels, classifiers, and different types of data sets. The results show that the OKMI method with the proposed algorithm is effective and robust for a wide range of real applications of feature selection.
The remainder of this paper is organized as follows: Section 2 introduces some background, previous works related to supervised feature selection, and also theoretically analyses of the kernel and MI. Section 3 presents the OKMI method and its feature-selection algorithm, which consists of schemes to select the optimal squeezed features. Section 4 discusses implementation issues involving several kernels and classifiers. The results of experiments on various types of data sets are described in Section 5, including genes, handwritten digits, images, and microarray real-world data sets. Discussions and conclusion are presented in Section 6. The theoretical analysis and experiments described herein focus on supervised feature selection.
Section snippets
Using optimized kernel mutual information to select features
The MI method is an important way to learn the mapping of a large number of input features to output class labels. In this section, we review the related methods and present a criterion of OKMI for feature selection. We analyze theoretically the OKMI method and explain why it is suitable.
Implementation of optimized kernel mutual information feature-selection algorithm
This section describes the design of an iterative update algorithm based on the above criterion to select an optimal subset of features. In the previous section, we proposed a theoretical OKMI method that left many issues unresolved, including how to determine the candidate subset from the feature set and how to choose a suitable kernel to enhance the performance. However, according to the OKMI method, the redundant features sometimes need to be removed from the selected-feature subset; a
Discussion on kernel and classifier
In this section, we discuss the choice of kernel and classifier for use in experiments. The kernel is chosen to meet the requirements, and the experiments are analyzed to determine how best to use various classifiers.
Experiments
In this section, we test our proposed OKMI method to select features from eight public data sets. The data sets have different features and samples, and the number of features in some data sets is large. However, few features are in the some sets, and the number of samples is large, so the ranges of the feature values differ greatly. Thus, various data sets present a greater challenge to the proposed method: they require a highly accurate and robust OKMI method under complex experimental
Conclusions
In this paper, we propose an OKMI approach to select features by using a kernel function and mutual information. The OKMI method improves the selection and avoids the complex computation of the joint probability density. We report the results of an experiment that uses eight data sets, with over 10 000 features in some data sets. The average rate of correct classification is greater than those produced by the compared methods. The OKMI method integrates kernel function, classifier, and mutual
Acknowledgments
This work was supported by the Guangdong Provincial Government of China through the “Computational Science Innovative Research Team” program and the Guangdong Province Key Laboratory of Computational Science at Sun Yat-Sen University, the Technology Program of GuangDong (Grant no. 2012B091100334), the National Natural Science Foundation of China (Grant no. 11471012), and the China Scholarship Council (Grant no. 201506385010).
References (41)
- et al.
Learning boolean concepts in the presence of many irrelevant features
Artificial Intelligence
(1994) - et al.
Hierarchical kernel spectral clustering
Neural Networks
(2012) - et al.
Selection of relevant features and examples in machine learning
Artificial Intelligence
(1997) - et al.
Consistency-based search in feature selection
Artificial Intelligence
(2003) - et al.
Wrappers for feature subset selection
Artificial Intelligence
(1997) - et al.
Particle swarm optimization for parameter determination and feature selection of support vector machines
Expert Systems with Applications
(2008) - et al.
Floating search methods in feature selection
Pattern Recognition Letters
(1994) - et al.
Feature selection and multi-kernel learning for sparse representation on a manifold
Neural Networks
(2014) - et al.
Subspace learning for unsupervised feature selection via matrix factorization
Pattern Recognition
(2015) - et al.
Unsupervised feature selection by regularized self-representation
Pattern Recognition
(2015)
Using mutual information for selecting features in supervised neural net learning
IEEE Transactions on Neural Networks
Sequential random k-nearest neighbor feature selection for high-dimensional data
Expert Systems with Applications
Choosing multiple parameters for support vector machines
Machine Learning
Character recognition systems: A guide for students and practitioners
Algorithms for learning kernels based on centered alignment
Journal of Machine Learning Research
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
Minimum redundancy feature selection from microarray gene expression data
Journal of Bioinformatics and Computational Biology
An information-theoretic perspective on feature selection in speaker recognition
IEEE Signal Processing Letters
Measuring statistical dependence with hilbert-schmidt norms
Proceedings of international conference on algorithmic learning theory
The kernel mutual information
Proceedings of IEEE international conference on acoustics, speech, and signal processing
Cited by (19)
Lung nodule detection algorithm based on rank correlation causal structure learning
2023, Expert Systems with ApplicationsCitation Excerpt :Yu and Liu (2003) proposed a fast correlation filtering (FCBF) method based on normalized mutual information, and He, Cai, and Niyogi (2005) proposed a Laplacian scoring algorithm (LS) based on Laplacian feature mapping and local preserving projection. Bi, Tan, Lai, and Suen (2018) proposed an optimized kernel mutual information method (OKMI), and Qian, Mao, Tang, and Wang (2019) proposed a feature selection algorithm based on maximum joint conditional mutual information (MCJMI). In order to apply the feature selection method to the semantic feature prediction of lung nodules, Guo, Cheng, Wang, and Guo (2013) proposed a practical feature selection method based on the optimal feature subset and its application in chest radiograph lung nodule detection, Liu et al. (2014)) based on Fisher criterion and genetic optimization feature selection method to identify common lung CT signs, Yang, Shen, Yu, and Chen (2019) proposed feature selection based on maximum correlation and minimum redundancy to predict the semantic features of lung nodules (MRMR), Maleki, Zeinali, and Niaki (2021) proposed a k-NN method for the prognosis of lung cancer using genetic algorithm for feature selection, and all achieved excellent detection performance.
Low-rank sparse feature selection for image classification
2022, Expert Systems with ApplicationsCitation Excerpt :In recent years, the feature selection algorithm has been widely studied because it can improve the interpretability of the model and reduce the complexity of dimensionality reduction (Lipton, 2018; Zebari et al., 2020). Feature selection can be divided into supervised feature selection (Bi et al., 2018) and unsupervised feature selection (Miao et al., 2021). Among them, supervised feature selection is generally classified into two models: Filter Model (Venegas et al., 2019) and Wrapper Model (Hancer, 2019).
Wind speed forecasting using deep neural network with feature selection
2020, NeurocomputingCitation Excerpt :Since it is difficult to calculate the entropy according to the definition -(Eqn. (6)–(8), the MI estimation methods are widely used in practical applications. Comparing with kernel method [35], and k-nearest neighbor method [36], histogram method [34] is adopted to estimate MI, due to its simplicity and computational efficiency. By using the MI feature selection, the impact of different factors that contribute to the WSF model can be evaluated quantitatively, so as to optimally determine the input vector of the DNN.
Quantum based Whale Optimization Algorithm for wrapper feature selection
2020, Applied Soft Computing JournalOverlap in Automatic Root Cause Analysis in Manufacturing: An Information Theory-Based Approach
2023, Applied Sciences (Switzerland)