Elsevier

Pattern Recognition

Volume 48, Issue 5, May 2015, Pages 1827-1835
Pattern Recognition

Sparse discriminative feature selection

https://doi.org/10.1016/j.patcog.2014.10.021Get rights and content

Highlights

  • The proposed method selects features that can preserve the sparse reconstructive relationship of the data.

  • A greedy algorithm and a joint selection algorithm are devised to efficiently solve the proposed formulation.

  • We incorporate discriminative analysis and l2;1_norm minimization into a joint feature selection.

Abstract

As sparse representation-based classifier (SRC) is developed, it has drawn more and more attentions in dimension reduction. In this paper, we introduce SRC based measurement criterion into feature selection, and then propose a novel method called sparse discriminative feature selection. Our objective function aims to find a subset of features, which minimize the within-class reconstruction residual and simultaneously maximize the between-class reconstruction residual in the subspace of selected features. A greedy algorithm and a joint selection algorithm are devised to efficiently solve the proposed combinatorial optimization formulation. In particular, our joint selection algorithm adds l2,1-norm minimization into the objective function, which reduces the redundant and learns features weights simultaneously. A new iterative algorithm is also developed to optimize the proposed objective function. Experiments on benchmark data sets demonstrate the performance of our feature selection method.

Introduction

In many areas, such as computer vision, pattern recognition and gene expression array analysis, data are characterized by high dimensional feature vectors. In practice, only a small subset of features are really important and discriminative. Consequently, dimensionality reduction is necessary, and it can be mainly categorized into feature extraction and feature selection. Feature extraction transforms features from a high dimensional space into a low dimensional space; while feature selection chooses a subset of features by eliminating the redundant features based on certain criteria. Compared with feature extraction which creates new representations of features, feature selection keeps their original physical meanings, and thus it could facilitate the explantation of results in data analysis.

Feature selection mainly focuses on search strategies and measurement criteria. According to search strategies, feature selection methods can be classified into three main families: filter, wrapper, and embedded methods. The filter methods [1], [2], [3], [4], [5] evaluate the importance of features by using the statistical properties of data without considering any knowledge of classifiers. The wrapper methods [6], [7] evaluate feature subsets tightly coupled with a specific learning algorithm that will ultimately be employed. Embedded methods [8], [9] evaluate the goodness of selected features in the process of model construction.

Feature selection is essentially a matter of combinatorial optimization, especially it is NP hard to find a global optimal solution. To address this issue, traditional feature selection methods individually evaluate each feature weight characterizing certain statistical or geometric properties of data points, rank them accordingly and then select features one by one. However, they cannot provide any guarantee of global optimality. Besides, they are quite likely to neglect the interaction and dependency between different features. Therefore, researchers have introduced sparsity regularization into joint feature selection which takes into account the feature correlation [10], [11], [12]. Nie [10] proposes a l2,1-norm regularization model for sparse feature selection. Cai [13] incorporates spectral regression and l1-norm regularization, and proposes a two-step approach. Yang [14] combines the manifold learning and l2,1-norm minimization into joint feature selection.

Recently, the theory of sparse representation has been successfully integrated with compressed sensing [15], [16], image analysis [17], [18], [19], and dimension reduction [20], [21], [22], [23], [24]. All these sparse representation based dimension reduction methods borrow the idea of sparse representation classification (SRC) [25], and they are essentially feature extraction methods. Unfortunately, so far all feature selection methods have no direct connection to SRC. Thus, our goal is to introduce SRC into measurement criterion for feature selection. Meanwhile, different from traditional feature searching strategies that select features one by one from the whole feature set, we select the best feature subset in batch mode. The reasons presented above motivate us to develop a SRC-based joint feature selection.

In this paper, we introduce SRC based measurement criterion into a novel supervised feature selection method, which is coined sparse discriminative feature selection. Our objective function selects features by minimizing the within-class reconstruction residual and simultaneously maximizing the between-class reconstruction residual in the selected feature subset. Its appealing characteristics are summarized as follows:

  • For measurement criterion: The proposed method selects features that simultaneously preserve sparse reconstructive relationship of the data and discriminative information. Practically the importance of a feature (or feature subset) is evaluated by the ratio of between-class reconstruction residual to within-class reconstruction residual in the subset of selected features. Sharing the advantages of sparsity representation that can reflect intrinsic geometric properties of the data, our selected features contain natural discriminative information.

  • For search strategies: To provide more choices for effectiveness and efficiency in practical applications, a greedy algorithm and a joint selection algorithm are devised to efficiently solve the proposed combinatorial optimization formulation. We incorporate discriminative analysis and l2,1-norm minimization into this joint feature selection, which simultaneously exploits feature correlations and select the most discriminative features from the whole feature space.

  • For optimization methods: For two different search strategies, we offer two solutions. Like traditional feature selection algorithms, our greedy algorithm evaluates the importance of each feature individually. Our joint selection algorithm efficiently solves the corresponding optimization problem by two sub-problems, i.e., the generalized eigenvectors problem and norm regularization problem which can be solved by the inexact Augmented Lagrange Multiplier (ALM) [26] with theoretically provable convergence.

The remainder of this paper is organized as follows. In Section 2, we briefly review SRC steered discriminative projection. We present our sparse discriminative feature selection in Section 3. The experiments on benchmark data sets are demonstrated in Section 4. Finally, we draw a conclusion in Section 5.

Section snippets

SRC steered discriminative projection (SRC-DP)

Essentially SRC represents a given test sample as a linear combination of all training samples. A naturally good solution of representation coefficients are sparse and their sparse components are supposed to concentrate on the training samples with the same class label as the test sample.

Although SRC claims it is insensitive to feature extraction, an effective and efficient projection matrix can lead to higher classification rate at a lower dimensionality [21], [22]. A similarity of [21], [27],

Basic idea and algorithm

In this section, we will present the objective function of the proposed sparse discriminative feature selection, and its optimization algorithms.

We define PRN×N as feature selection matrix which satisfies: (1) P has only ‘0’ or ‘1’ as its elements; (2) each row (or column) of P has no more than one ‘1’; (3) in order to indicate that d features are selected, only d rows (or columns) contain one ‘1’ exactly, and the remaining (Nd) rows are just zero vectors. Consequently, for x, x=PTx is its

Experiments and analysis

In this section, we run experiments on 6 datasets as shown in Table 1: Ionosphere, Spambase, Sonar, USPS, Extended YaleB and CMU PIE. The first three real-world data sets are available from UCI machine learning benchmark data set2; USPS is a handwritten digit dataset; Extended YaleB and CMU PIE are two standard face databases which are used in [31]. Extended YaleB database contains 16,128 face images of 38 human subjects under 9 poses and 64

Conclusions and future work

By introducing SRC based measurement criterion into feature selection, we proposed sparse discriminative feature selection, then designed a greedy algorithm and a joint selection algorithm to efficiently solve the proposed objective function. Experiment results showed that the proposed JSDFS consistently outperformed others in classification. In our future work, we will improve SRC and then apply this novel SRC based measurement criterion to feature selection.

Conflict of interest

None declared.

Acknowledgements

This work is supported by the National Science Foundation of China (Grant no. 61202134), National Science Fund for Distinguished Young Scholars (Grant no. 61125305), China Postdoctoral Science Foundation (Grant no. AD41431), and the Postdoctoral Science Foundation of Jiangsu Province (Grant no. AD41358).

Hui Yan received her B.S. degree and Ph.D. degree from the School of Computer Science and Technology, Nanjing University of Science and Technology (NUST), Nanjing, China, in 2005 and 2011, respectively. In 2009, she was a visiting student at the Department of Electrical and Computer Engineering at National University of Singapore, Singapore. She is currently a lecturer at the School of Computer Science and Engineering, NUST. Her research interests include pattern recognition, computer vision

References (38)

  • M. Masaeli, G. Fung, J.G. Dy, From transformation-based dimensionality reduction to feature selection, in:...
  • H. Liu, X. Wu, S. Zhang, Feature selection using hierarchical feature clustering, in: Conference on Information and...
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • A. Rakotomamonjy

    Variable selection using svm based criteria

    J. Mach. Learn. Res.

    (2003)
  • V. Vapnik

    Statistical Learning Theory

    (1998)
  • J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, in: Advances in Neural Information...
  • F.P. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via joint l2,1-norms minimization, in:...
  • Z.G. Ma et al.

    Web image annotation via subspace-sparsity collaborated feature selection

    IEEE Trans. Multimed.

    (2012)
  • Q.Q. Gu, Z.H. Li, J.W. Han, Joint feature selection and subspace learning, in: International Joint Conference on...
  • Cited by (48)

    • Feature selection via Non-convex constraint and latent representation learning with Laplacian embedding

      2022, Expert Systems with Applications
      Citation Excerpt :

      With the rapid development of information technology, high-dimensional data has appeared in many fields, such as computer vision and pattern recognition (Yan & Yang, 2015).

    • Dual space latent representation learning for unsupervised feature selection

      2021, Pattern Recognition
      Citation Excerpt :

      Therefore, it is necessary to overcome the "dimensional disaster" caused by large-scale high-dimensional data. Experiments show that effective dimensionality reduction methods can not only reduce the cost of data processing, but also effectively improve the performance of clustering algorithm [3,4]. Feature selection is one of the common dimensionality reduction methods, which is designed to select a representative subset to represent the original data [5].

    View all citing articles on Scopus

    Hui Yan received her B.S. degree and Ph.D. degree from the School of Computer Science and Technology, Nanjing University of Science and Technology (NUST), Nanjing, China, in 2005 and 2011, respectively. In 2009, she was a visiting student at the Department of Electrical and Computer Engineering at National University of Singapore, Singapore. She is currently a lecturer at the School of Computer Science and Engineering, NUST. Her research interests include pattern recognition, computer vision and machine learning.

    Jian Yang received his B.S.degree in mathematics from Xuzhou Normal University, Xuzhou, China, in 1995, his M.S. degree in applied mathematics from Changsha Railway University, Changsha, China, in 1998, and his Ph.D. degree in pattern recognition and intelligence systems from Nanjing University of Science and Technology (NUST), Nanjing, China, in 2002. He was a Post-Doctoral Researcher at the University of Zaragoza, Spain, in 2003. From 2004 to 2006, he was a Post-Doctoral Fellow at the Biometrics Centre of Hong Kong Polytechnic University, Hong Kong. From 2006 to 2007, at the Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA. He is currently a Professor at the School of Computer Science and Technology, NUST. He has authored more than 80 academic papers in pattern recognition and computer vision. His journal papers have been cited more than 1800 times in the ISI Web of Science, and 3000 times in the Web of Scholar Google. His current research interests include pattern recognition, computer vision and machine learning.

    View full text