Elsevier

Neurocomputing

Volume 321, 10 December 2018, Pages 237-250
Neurocomputing

On the role of sparsity in feature selection and an innovative method LRMI

https://doi.org/10.1016/j.neucom.2018.09.017Get rights and content

Abstract

Feature selection is used in many applications in machine learning and bioinformatics. As a popular approach, feature selection can be implemented in the filter-manner based on the sparse solution of the l1 regularization. Most study of the l1 regularization concentrates on investigating the iteration solution of the problem or focuses on adapting sparsity to different applications. It is necessary to explore more deeply about how the sparsity learned with the l1 regularization contributes to feature selection. In this paper, we make an effort to analyze the role of the l1 regularization in feature selection from the perspective of information theory. We discover that the l1 regularization contributes to minimizing the redundancy in feature selection. To avoid the complex computation of the l1 optimization, we propose a novel feature selection algorithm, i.e. the Laplacian regularization based on mutual information (LRMI), which realizes the minimization of the redundancy in a new way, and incorporates the l2 norm to achieve automatic grouping. Extensive experimental results demonstrate the superiority of LRMI over several traditional l1 regularization based feature selection algorithms with less time consumption.

Introduction

In computer vision, machine learning and data mining, the object is usually represented as high-dimensional feature vectors. During the processes of various applications, high-dimensional data usually consumes an enormous amount of computing time and storage space. Moreover, most of the existing machine learning methods, such as classification, regression or other tasks, are more adaptive to low-dimensional data. The computation of high-dimensional data can become much more complex and challenging. As a solution to this issue, feature selection (also known as variable selection) [1], [2], [3], [4] is performed to choose a representative subset from the high dimensional features. The subset is expected to eliminate redundancy and bear sufficient information about the original high-dimensional feature set for specific learning tasks.

According to different evaluation metrics, feature selection algorithms can roughly be classified into three categories, i.e., filter-manner, wrapper, and embedded methods [4]. The filter-manner methods, such as Variance or Fisher score [5], [6], rely on general characteristics of the data to evaluate and select feature subsets without the interference of the learning algorithm. The wrapper models take the performance of the learning algorithms as the evaluation criterion. In other words, the purpose of the wrapper methods is to select the most suitable feature subset for the learning task. For embedded methods, the process of feature selection and the training of learning model are integrated and accomplished simultaneously.

The theory of sparse representation [7], [8] and compressed sensing [9] has been integrated in embedded feature selection. These feature selection methods have been widely used in face authentication and detection [10], [11], face attributes classification [12], mass spectrometry data classification [13] and gene expression [14]. They usually resort to sparsity solution based on the regularization techniques such as the l1 norm constraint or the l1 penalty term. Hence, sparse representation is also called as the l1 regularization, the l1 penalty or the l1 norm in some cases.

The theoretical understanding of these methods is mostly focused on the sparsity ensured by the l1 regularization with concerns on two major aspects. One is about various solutions to the l1 regularization, the other is about multiple applications of sparsity-inducing feature selection. The mechanism about the sparsity learned with the l1 regularization is not fully revealed in feature selection. What role does the l1 regularization play in the process of feature selection? Is it really necessary to adopt the l1 regularization to select features subset? By understanding the real merit of the l1 regularization in feature selection, could more efficient methods be developed?

After solving the objective function containing the l1 regularization, a coefficient vector is obtained as the sparsity value for each feature. The larger the value is, the more important the corresponding feature is. However, in the validation experiments in our previous work [15], [16], we discover that features corresponding to smaller weights can also obtain good results in classification tasks. It is very attractive to explore the behind mechanism of such phenomena that seems to be odd to the presumption about the sparsity-inducing feature selection.

Inspired by such observation, we dedicate to reevaluate the role of sparsity in feature selection from the perspective of information theory. We analyze the variable dependency among features selected with the l1 regularization sparsity. We discover that the l1 regularization serves to minimize redundancy among selected features. It is known that the sparsity obtained with the l1 regularization are very time-consuming. The solution often resorts to solving either an NP-hard problem or an alternative problem that involves a costly iterative optimization [17]. For this reason, we propose a novel feature selection method, i.e. the Laplacian regularization based on mutual information (LRMI), to solve the above problem from two aspects. Firstly, the mutual information is adopted to measure the feature correlation to realize the redundancy minimization. Secondly, we integrate the essential structures of the features to identify important variables by introducing the l2 norm in the objective function. Besides its special group effect [18], the l2 norm leads to a closed-form solution and avoids the time-consuming optimization of the l1 regularization. In addition, different from the filter-manner of the other mutual information based feature selection methods [19], [20], [21], LRMI realizes feature selection in the embedded manner. To evaluate our approach, we conduct extensive experiments on several public data sets. The experimental results prove the effectiveness of our method. We also develop the proposed model to a parallel realization to improve its time efficiency.

As an extension to our previous work [16], the main contributions of this paper are as follows:

  • We use information theory to evaluate the role of the l1 regularization in feature selection and discover that sparsity intrinsically acts as the item of minimal redundancy in feature selection.

  • We propose a new feature selection method based on mutual information instead of to the l1 regularization, which not only realizes the minimum redundancy among features but also utilizes the structural information of the feature. The new method has a closed-form solution and avoids the time-consuming optimization iteration of the l1 regularization.

  • By experiments in several public data sets, we demonstrate the superiority of the approach over several mainstream l1 regularization algorithms.

  • We develop a parallel computing realization of the proposed model. The training time of our method is only 50 percent of that of the traditional methods.

The remainder of this paper is organized as follows. Section 2 describes the related work. In Section 3, we discuss the role of the l1 regularization sparsity in feature selection. We present a novel feature selection model in Section 5. The experimental results are summarized in Section 5. We draw the conclusions in Section 6.

Section snippets

Related work

The proposed work is inspired by a detailed analysis of the l1 regularization based feature selection. In this section, we first introduce several classical feature selection algorithms based on the l1 regularization, which are used in experiments as references. We also introduce the work on variable dependency analysis with mutual information.

The role of l1 regularization in feature selection

To understand the role of the l1 regularization in feature selection, we divide the discussion into three subsections. Firstly, we introduce the sparsity obtained with the l1 regularization by using the general solution process of loss function with the l1 norm. Secondly, we illustrate that the features not be selected by the basic routine of sparsity also contain abundant information for discriminant tasks. Finally, we provide a theoretical explanation about the role of sparsity in feature

LRMI method

It is known that the computation of the l1 regularization typically requires solving either an NP-problem or an alternative problem involving costly iterative optimization [17]. In Section 3, we reveal that the l1 regularization has the property of keeping the minimum redundancy among the selected features. There are many ways to achieve the redundancy minimization. We propose a novel feature selection model LRMI based on mutual information. LRMI uses the mutual information to minimize the

Experimental evaluation

The performance of LRMI is compared with several other regularization based embedded methods for feature selection, i.e. Lasso (l1 regularization) [28], Ridge (l2 regularization) [31], Fused Lasso [30], Elastic Net [18] and Group Lasso [32]. As reviewed in Section 2.2.1, these methods are chosen for two reasons. First, they are all suitable for embedded feature selection based on the l1 or l2 regularization. Secondly, they are reported to result in good performance in literature. After all, the

Conclusions

In this paper, we illustrate the role of the l1 regularization in feature selection from the perspective of information theory. We prove that the l1 regularization has the effect of minimizing the redundancy of selected feature subset. Since the l1 regularization based feature selection is very time-consuming and neglects the group effect among features, we propose a novel feature selection method LRMI. We replace the l1 regularization with the mutual information to measure the correlation or

Acknowledgments

The work is funded by the National Natural Science Foundation of China (Nos. 61773375, 61170155).

Yuchun Fang, Associate Professor. She gained her Ph.D. from the Institute of Automation, Chinese Academy of Sciences in 2003. From 2003 to 2004, she worked as a post-doctoral researcher at the France National Research Institute on Information and Automation (INRIA). Since 2005, she has worked at the School of Computer Engineering and Sciences, Shanghai University. She is a member of IEEE, ACM, and CCF (Chinese Computer Federation). Her current research interests include multimedia, pattern

References (53)

  • H. Cheng et al.

    Sparse representation and learning in visual recognition: theory and applications

    Signal Proces.

    (2013)
  • N.X. Vinh et al.

    Can high-order dependencies improve mutual information based feature selection?

    Pattern Recognit.

    (2016)
  • H. Anne-Claire et al.

    The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

    Plos One

    (2011)
  • C.C. Aggarwal

    Feature selection for classification: A review

    Data Classification

    (2014)
  • Y. Saeys et al.

    Wld: review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • I. Guyon

    Pattern classification

    Pattern Anal. Appl.

    (1998)
  • Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection (2012) arXiv preprint...
  • J. Wright et al.

    Sparse representation for computer vision and pattern recognition

    Proc. IEEE

    (2010)
  • M. Elad

    Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing

    (2010)
  • A. Destrero et al.

    A sparsity-enforcing method for learning face features

    IEEE Trans. Image Process.

    (2009)
  • A. Destrero et al.

    A regularized framework for feature selection in face detection and authentication

    Int. J. Comput. Vis.

    (2009)
  • Y. Fang et al.

    Multi-instance feature learning based on sparse representation for facial expression recognition

    Proceedings of the International Conference on Multimedia Modeling

    (2015)
  • F. Nie et al.

    Efficient and robust feature selection via joint ℓ2,1-norms minimization

    Proceedings of the 23rd International Conference on Neural Information Processing Systems

    (2010)
  • F.X.W. Xiyi Hang

    Sparse representation for classification of tumors using gene expression data

    J. Biomed. Biotechnol.

    (2009)
  • C. Yu et al.

    Sparse representation for feature selection in face demographic classification

    J. Comput. Inf. Syst.

    (2014)
  • Q. Yuan et al.

    A Novel Automatic Grouping Algorithm for Feature Selection

    CCF Chinese Conference on Computer Vision

    (2017)
  • D.L. Donoho

    Compressed sensing

    IEEE Trans. Inf. Theory

    (2006)
  • H. Zou et al.

    Regularization and variable selection via the elastic net

    J. R. Stat. Soc.

    (2005)
  • H.H. Yang et al.

    Data visualization and feature selection: new algorithms for nongaussian data

    Adv. Neural Inf. Process. Syst.

    (2000)
  • H. Peng et al.

    Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • G. Brown et al.

    Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

    J. Mach. Learn. Res.

    (2012)
  • E.J. Cands

    Compressive sampling

    Marta Sanz Sol

    (2006)
  • D.L. Donoho

    For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution

    Commun. Pure Appl. Math.

    (2007)
  • A.M. Bruckstein et al.

    From sparse solutions of systems of equations to sparse modeling of signals and images

    Soc. Ind. Appl. Math Rev

    (2009)
  • A.Y. Ng

    Feature selection, L 1 vs. L 2 regularization, and rotational invariance

    Proceedings of the twenty-first international conference on Machine learning

    (2004)
  • Cited by (0)

    Yuchun Fang, Associate Professor. She gained her Ph.D. from the Institute of Automation, Chinese Academy of Sciences in 2003. From 2003 to 2004, she worked as a post-doctoral researcher at the France National Research Institute on Information and Automation (INRIA). Since 2005, she has worked at the School of Computer Engineering and Sciences, Shanghai University. She is a member of IEEE, ACM, and CCF (Chinese Computer Federation). Her current research interests include multimedia, pattern recognition, machine learning and image processing.

    Qiulong Yuan, received B.S. degree in Ningxia University, Ningxia, China, in 2015. He is now a graduate student at Shanghai University, China. His research interests include machine learning and pattern recognition.

    Zhaoxiang Zhang, received the B.S. degree in electronic science and technology from the University of Science and Technology of China, Hefei, China, in 2004 and the Ph.D. degree from the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2009. In 2009, he joined the School of Computer Science and Engineering, Beihang University, Beijing, as an Assistant Professor, from 2009 to 2011, an Associate Professor, from 2012 to 2015, and the Vice-Director of the Department of Computer Application Technology, from 2014 to 2015. In 2015, he returned to the Institute of Automation, Chinese Academy of Sciences, as a Full Professor. His current research interests include computer vision, pattern recognition, machine learning, and brain-inspired neural network and brain-inspired learning. Dr. Zhang is an Associate Editor or a Guest Editor of some internal journals, such as, Neurocomputing, Pattern Recognition Letters, and IEEE ACCESS.

    View full text