On the role of sparsity in feature selection and an innovative method LRMI

doi:10.1016/j.neucom.2018.09.017

Neurocomputing

Volume 321, 10 December 2018, Pages 237-250

https://doi.org/10.1016/j.neucom.2018.09.017 Get rights and content

Abstract

Feature selection is used in many applications in machine learning and bioinformatics. As a popular approach, feature selection can be implemented in the filter-manner based on the sparse solution of the l₁ regularization. Most study of the l₁ regularization concentrates on investigating the iteration solution of the problem or focuses on adapting sparsity to different applications. It is necessary to explore more deeply about how the sparsity learned with the l₁ regularization contributes to feature selection. In this paper, we make an effort to analyze the role of the l₁ regularization in feature selection from the perspective of information theory. We discover that the l₁ regularization contributes to minimizing the redundancy in feature selection. To avoid the complex computation of the l₁ optimization, we propose a novel feature selection algorithm, i.e. the Laplacian regularization based on mutual information (LRMI), which realizes the minimization of the redundancy in a new way, and incorporates the l₂ norm to achieve automatic grouping. Extensive experimental results demonstrate the superiority of LRMI over several traditional l₁ regularization based feature selection algorithms with less time consumption.

Introduction

In computer vision, machine learning and data mining, the object is usually represented as high-dimensional feature vectors. During the processes of various applications, high-dimensional data usually consumes an enormous amount of computing time and storage space. Moreover, most of the existing machine learning methods, such as classification, regression or other tasks, are more adaptive to low-dimensional data. The computation of high-dimensional data can become much more complex and challenging. As a solution to this issue, feature selection (also known as variable selection) [1], [2], [3], [4] is performed to choose a representative subset from the high dimensional features. The subset is expected to eliminate redundancy and bear sufficient information about the original high-dimensional feature set for specific learning tasks.

According to different evaluation metrics, feature selection algorithms can roughly be classified into three categories, i.e., filter-manner, wrapper, and embedded methods [4]. The filter-manner methods, such as Variance or Fisher score [5], [6], rely on general characteristics of the data to evaluate and select feature subsets without the interference of the learning algorithm. The wrapper models take the performance of the learning algorithms as the evaluation criterion. In other words, the purpose of the wrapper methods is to select the most suitable feature subset for the learning task. For embedded methods, the process of feature selection and the training of learning model are integrated and accomplished simultaneously.

The theory of sparse representation [7], [8] and compressed sensing [9] has been integrated in embedded feature selection. These feature selection methods have been widely used in face authentication and detection [10], [11], face attributes classification [12], mass spectrometry data classification [13] and gene expression [14]. They usually resort to sparsity solution based on the regularization techniques such as the l₁ norm constraint or the l₁ penalty term. Hence, sparse representation is also called as the l₁ regularization, the l₁ penalty or the l₁ norm in some cases.

The theoretical understanding of these methods is mostly focused on the sparsity ensured by the l₁ regularization with concerns on two major aspects. One is about various solutions to the l₁ regularization, the other is about multiple applications of sparsity-inducing feature selection. The mechanism about the sparsity learned with the l₁ regularization is not fully revealed in feature selection. What role does the l₁ regularization play in the process of feature selection? Is it really necessary to adopt the l₁ regularization to select features subset? By understanding the real merit of the l₁ regularization in feature selection, could more efficient methods be developed?

After solving the objective function containing the l₁ regularization, a coefficient vector is obtained as the sparsity value for each feature. The larger the value is, the more important the corresponding feature is. However, in the validation experiments in our previous work [15], [16], we discover that features corresponding to smaller weights can also obtain good results in classification tasks. It is very attractive to explore the behind mechanism of such phenomena that seems to be odd to the presumption about the sparsity-inducing feature selection.

Inspired by such observation, we dedicate to reevaluate the role of sparsity in feature selection from the perspective of information theory. We analyze the variable dependency among features selected with the l₁ regularization sparsity. We discover that the l₁ regularization serves to minimize redundancy among selected features. It is known that the sparsity obtained with the l₁ regularization are very time-consuming. The solution often resorts to solving either an NP-hard problem or an alternative problem that involves a costly iterative optimization [17]. For this reason, we propose a novel feature selection method, i.e. the Laplacian regularization based on mutual information (LRMI), to solve the above problem from two aspects. Firstly, the mutual information is adopted to measure the feature correlation to realize the redundancy minimization. Secondly, we integrate the essential structures of the features to identify important variables by introducing the l₂ norm in the objective function. Besides its special group effect [18], the l₂ norm leads to a closed-form solution and avoids the time-consuming optimization of the l₁ regularization. In addition, different from the filter-manner of the other mutual information based feature selection methods [19], [20], [21], LRMI realizes feature selection in the embedded manner. To evaluate our approach, we conduct extensive experiments on several public data sets. The experimental results prove the effectiveness of our method. We also develop the proposed model to a parallel realization to improve its time efficiency.

As an extension to our previous work [16], the main contributions of this paper are as follows:

•
We use information theory to evaluate the role of the l₁ regularization in feature selection and discover that sparsity intrinsically acts as the item of minimal redundancy in feature selection.
•
We propose a new feature selection method based on mutual information instead of to the l₁ regularization, which not only realizes the minimum redundancy among features but also utilizes the structural information of the feature. The new method has a closed-form solution and avoids the time-consuming optimization iteration of the l₁ regularization.
•
By experiments in several public data sets, we demonstrate the superiority of the approach over several mainstream l₁ regularization algorithms.
•
We develop a parallel computing realization of the proposed model. The training time of our method is only 50 percent of that of the traditional methods.

The remainder of this paper is organized as follows. Section 2 describes the related work. In Section 3, we discuss the role of the l₁ regularization sparsity in feature selection. We present a novel feature selection model in Section 5. The experimental results are summarized in Section 5. We draw the conclusions in Section 6.

Section snippets

Related work

The proposed work is inspired by a detailed analysis of the l₁ regularization based feature selection. In this section, we first introduce several classical feature selection algorithms based on the l₁ regularization, which are used in experiments as references. We also introduce the work on variable dependency analysis with mutual information.

The role of l₁ regularization in feature selection

To understand the role of the l₁ regularization in feature selection, we divide the discussion into three subsections. Firstly, we introduce the sparsity obtained with the l₁ regularization by using the general solution process of loss function with the l₁ norm. Secondly, we illustrate that the features not be selected by the basic routine of sparsity also contain abundant information for discriminant tasks. Finally, we provide a theoretical explanation about the role of sparsity in feature

LRMI method

It is known that the computation of the l₁ regularization typically requires solving either an NP-problem or an alternative problem involving costly iterative optimization [17]. In Section 3, we reveal that the l₁ regularization has the property of keeping the minimum redundancy among the selected features. There are many ways to achieve the redundancy minimization. We propose a novel feature selection model LRMI based on mutual information. LRMI uses the mutual information to minimize the

Experimental evaluation

The performance of LRMI is compared with several other regularization based embedded methods for feature selection, i.e. Lasso (l₁ regularization) [28], Ridge (l₂ regularization) [31], Fused Lasso [30], Elastic Net [18] and Group Lasso [32]. As reviewed in Section 2.2.1, these methods are chosen for two reasons. First, they are all suitable for embedded feature selection based on the l₁ or l₂ regularization. Secondly, they are reported to result in good performance in literature. After all, the

Conclusions

In this paper, we illustrate the role of the l₁ regularization in feature selection from the perspective of information theory. We prove that the l₁ regularization has the effect of minimizing the redundancy of selected feature subset. Since the l₁ regularization based feature selection is very time-consuming and neglects the group effect among features, we propose a novel feature selection method LRMI. We replace the l₁ regularization with the mutual information to measure the correlation or

Acknowledgments

The work is funded by the National Natural Science Foundation of China (Nos. 61773375, 61170155).

Yuchun Fang, Associate Professor. She gained her Ph.D. from the Institute of Automation, Chinese Academy of Sciences in 2003. From 2003 to 2004, she worked as a post-doctoral researcher at the France National Research Institute on Information and Automation (INRIA). Since 2005, she has worked at the School of Computer Engineering and Sciences, Shanghai University. She is a member of IEEE, ACM, and CCF (Chinese Computer Federation). Her current research interests include multimedia, pattern

References (53)

H. Cheng et al.
Sparse representation and learning in visual recognition: theory and applications
Signal Proces.
(2013)
N.X. Vinh et al.
Can high-order dependencies improve mutual information based feature selection?
Pattern Recognit.
(2016)
H. Anne-Claire et al.
The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Plos One
(2011)
C.C. Aggarwal
Feature selection for classification: A review
Data Classification
(2014)
Y. Saeys et al.
Wld: review of feature selection techniques in bioinformatics
Bioinformatics
(2007)
I. Guyon et al.
An introduction to variable and feature selection
J. Mach. Learn. Res.
(2003)
I. Guyon
Pattern classification
Pattern Anal. Appl.
(1998)
Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection (2012) arXiv preprint...
J. Wright et al.
Sparse representation for computer vision and pattern recognition
Proc. IEEE
(2010)
M. Elad
Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing
(2010)

A. Destrero et al.

A sparsity-enforcing method for learning face features

IEEE Trans. Image Process.

(2009)

A. Destrero et al.

A regularized framework for feature selection in face detection and authentication

Int. J. Comput. Vis.

(2009)

Y. Fang et al.

Multi-instance feature learning based on sparse representation for facial expression recognition

Proceedings of the International Conference on Multimedia Modeling

(2015)

F. Nie et al.

Efficient and robust feature selection via joint ℓ2,1-norms minimization

Proceedings of the 23rd International Conference on Neural Information Processing Systems

(2010)

F.X.W. Xiyi Hang

Sparse representation for classification of tumors using gene expression data

J. Biomed. Biotechnol.

(2009)

C. Yu et al.

Sparse representation for feature selection in face demographic classification

J. Comput. Inf. Syst.

(2014)

Q. Yuan et al.

A Novel Automatic Grouping Algorithm for Feature Selection

CCF Chinese Conference on Computer Vision

(2017)

D.L. Donoho

Compressed sensing

IEEE Trans. Inf. Theory

(2006)

H. Zou et al.

Regularization and variable selection via the elastic net

J. R. Stat. Soc.

(2005)

H.H. Yang et al.

Data visualization and feature selection: new algorithms for nongaussian data

Adv. Neural Inf. Process. Syst.

(2000)

H. Peng et al.

Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

(2005)

G. Brown et al.

Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

J. Mach. Learn. Res.

(2012)

E.J. Cands

Compressive sampling

Marta Sanz Sol

(2006)

D.L. Donoho

For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution

Commun. Pure Appl. Math.

(2007)

A.M. Bruckstein et al.

From sparse solutions of systems of equations to sparse modeling of signals and images

Soc. Ind. Appl. Math Rev

(2009)

A.Y. Ng

Feature selection, L 1 vs. L 2 regularization, and rotational invariance

Proceedings of the twenty-first international conference on Machine learning

(2004)

Cited by (0)

Qiulong Yuan, received B.S. degree in Ningxia University, Ningxia, China, in 2015. He is now a graduate student at Shanghai University, China. His research interests include machine learning and pattern recognition.

Zhaoxiang Zhang, received the B.S. degree in electronic science and technology from the University of Science and Technology of China, Hefei, China, in 2004 and the Ph.D. degree from the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2009. In 2009, he joined the School of Computer Science and Engineering, Beihang University, Beijing, as an Assistant Professor, from 2009 to 2011, an Associate Professor, from 2012 to 2015, and the Vice-Director of the Department of Computer Application Technology, from 2014 to 2015. In 2015, he returned to the Institute of Automation, Chinese Academy of Sciences, as a Full Professor. His current research interests include computer vision, pattern recognition, machine learning, and brain-inspired neural network and brain-inspired learning. Dr. Zhang is an Associate Editor or a Guest Editor of some internal journals, such as, Neurocomputing, Pattern Recognition Letters, and IEEE ACCESS.

View full text

On the role of sparsity in feature selection and an innovative method LRMI

Abstract

Introduction

Section snippets

Related work

The role of l1 regularization in feature selection

LRMI method

Experimental evaluation

Conclusions

Acknowledgments

Signal Proces.

Pattern Recognit.

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Plos One

Feature selection for classification: A review

Data Classification

Wld: review of feature selection techniques in bioinformatics

Bioinformatics

An introduction to variable and feature selection

J. Mach. Learn. Res.

Pattern classification

Pattern Anal. Appl.

Sparse representation for computer vision and pattern recognition

Proc. IEEE

Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing

A sparsity-enforcing method for learning face features

IEEE Trans. Image Process.

A regularized framework for feature selection in face detection and authentication

Int. J. Comput. Vis.

Multi-instance feature learning based on sparse representation for facial expression recognition

Proceedings of the International Conference on Multimedia Modeling

Efficient and robust feature selection via joint &#8467;2,1-norms minimization

Proceedings of the 23rd International Conference on Neural Information Processing Systems

Sparse representation for classification of tumors using gene expression data

J. Biomed. Biotechnol.

Sparse representation for feature selection in face demographic classification

J. Comput. Inf. Syst.

A Novel Automatic Grouping Algorithm for Feature Selection

CCF Chinese Conference on Computer Vision

Compressed sensing

IEEE Trans. Inf. Theory

Regularization and variable selection via the elastic net

J. R. Stat. Soc.

Data visualization and feature selection: new algorithms for nongaussian data

Adv. Neural Inf. Process. Syst.

Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

J. Mach. Learn. Res.

Compressive sampling

Marta Sanz Sol

For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution

Commun. Pure Appl. Math.

From sparse solutions of systems of equations to sparse modeling of signals and images

Soc. Ind. Appl. Math Rev

Feature selection, L 1 vs. L 2 regularization, and rotational invariance

Proceedings of the twenty-first international conference on Machine learning

The role of l₁ regularization in feature selection

Efficient and robust feature selection via joint ℓ2,1-norms minimization