Sparse discriminative feature selection

doi:10.1016/j.patcog.2014.10.021

Pattern Recognition

Volume 48, Issue 5, May 2015, Pages 1827-1835

https://doi.org/10.1016/j.patcog.2014.10.021 Get rights and content

Highlights

•
The proposed method selects features that can preserve the sparse reconstructive relationship of the data.
•
A greedy algorithm and a joint selection algorithm are devised to efficiently solve the proposed formulation.
•
We incorporate discriminative analysis and l2;1_norm minimization into a joint feature selection.

Abstract

As sparse representation-based classifier (SRC) is developed, it has drawn more and more attentions in dimension reduction. In this paper, we introduce SRC based measurement criterion into feature selection, and then propose a novel method called sparse discriminative feature selection. Our objective function aims to find a subset of features, which minimize the within-class reconstruction residual and simultaneously maximize the between-class reconstruction residual in the subspace of selected features. A greedy algorithm and a joint selection algorithm are devised to efficiently solve the proposed combinatorial optimization formulation. In particular, our joint selection algorithm adds $l_{2, 1} - norm$ minimization into the objective function, which reduces the redundant and learns features weights simultaneously. A new iterative algorithm is also developed to optimize the proposed objective function. Experiments on benchmark data sets demonstrate the performance of our feature selection method.

Introduction

In many areas, such as computer vision, pattern recognition and gene expression array analysis, data are characterized by high dimensional feature vectors. In practice, only a small subset of features are really important and discriminative. Consequently, dimensionality reduction is necessary, and it can be mainly categorized into feature extraction and feature selection. Feature extraction transforms features from a high dimensional space into a low dimensional space; while feature selection chooses a subset of features by eliminating the redundant features based on certain criteria. Compared with feature extraction which creates new representations of features, feature selection keeps their original physical meanings, and thus it could facilitate the explantation of results in data analysis.

Feature selection mainly focuses on search strategies and measurement criteria. According to search strategies, feature selection methods can be classified into three main families: filter, wrapper, and embedded methods. The filter methods [1], [2], [3], [4], [5] evaluate the importance of features by using the statistical properties of data without considering any knowledge of classifiers. The wrapper methods [6], [7] evaluate feature subsets tightly coupled with a specific learning algorithm that will ultimately be employed. Embedded methods [8], [9] evaluate the goodness of selected features in the process of model construction.

Feature selection is essentially a matter of combinatorial optimization, especially it is NP hard to find a global optimal solution. To address this issue, traditional feature selection methods individually evaluate each feature weight characterizing certain statistical or geometric properties of data points, rank them accordingly and then select features one by one. However, they cannot provide any guarantee of global optimality. Besides, they are quite likely to neglect the interaction and dependency between different features. Therefore, researchers have introduced sparsity regularization into joint feature selection which takes into account the feature correlation [10], [11], [12]. Nie [10] proposes a $l_{2, 1} - norm$ regularization model for sparse feature selection. Cai [13] incorporates spectral regression and l₁-norm regularization, and proposes a two-step approach. Yang [14] combines the manifold learning and $l_{2, 1} - norm$ minimization into joint feature selection.

Recently, the theory of sparse representation has been successfully integrated with compressed sensing [15], [16], image analysis [17], [18], [19], and dimension reduction [20], [21], [22], [23], [24]. All these sparse representation based dimension reduction methods borrow the idea of sparse representation classification (SRC) [25], and they are essentially feature extraction methods. Unfortunately, so far all feature selection methods have no direct connection to SRC. Thus, our goal is to introduce SRC into measurement criterion for feature selection. Meanwhile, different from traditional feature searching strategies that select features one by one from the whole feature set, we select the best feature subset in batch mode. The reasons presented above motivate us to develop a SRC-based joint feature selection.

In this paper, we introduce SRC based measurement criterion into a novel supervised feature selection method, which is coined sparse discriminative feature selection. Our objective function selects features by minimizing the within-class reconstruction residual and simultaneously maximizing the between-class reconstruction residual in the selected feature subset. Its appealing characteristics are summarized as follows:

•
For measurement criterion: The proposed method selects features that simultaneously preserve sparse reconstructive relationship of the data and discriminative information. Practically the importance of a feature (or feature subset) is evaluated by the ratio of between-class reconstruction residual to within-class reconstruction residual in the subset of selected features. Sharing the advantages of sparsity representation that can reflect intrinsic geometric properties of the data, our selected features contain natural discriminative information.
•
For search strategies: To provide more choices for effectiveness and efficiency in practical applications, a greedy algorithm and a joint selection algorithm are devised to efficiently solve the proposed combinatorial optimization formulation. We incorporate discriminative analysis and $l_{2, 1}$ -norm minimization into this joint feature selection, which simultaneously exploits feature correlations and select the most discriminative features from the whole feature space.
•
For optimization methods: For two different search strategies, we offer two solutions. Like traditional feature selection algorithms, our greedy algorithm evaluates the importance of each feature individually. Our joint selection algorithm efficiently solves the corresponding optimization problem by two sub-problems, i.e., the generalized eigenvectors problem and norm regularization problem which can be solved by the inexact Augmented Lagrange Multiplier (ALM) [26] with theoretically provable convergence.

The remainder of this paper is organized as follows. In Section 2, we briefly review SRC steered discriminative projection. We present our sparse discriminative feature selection in Section 3. The experiments on benchmark data sets are demonstrated in Section 4. Finally, we draw a conclusion in Section 5.

Section snippets

SRC steered discriminative projection (SRC-DP)

Essentially SRC represents a given test sample as a linear combination of all training samples. A naturally good solution of representation coefficients are sparse and their sparse components are supposed to concentrate on the training samples with the same class label as the test sample.

Although SRC claims it is insensitive to feature extraction, an effective and efficient projection matrix can lead to higher classification rate at a lower dimensionality [21], [22]. A similarity of [21], [27],

Basic idea and algorithm

In this section, we will present the objective function of the proposed sparse discriminative feature selection, and its optimization algorithms.

We define $P \in R^{N \times N}$ as feature selection matrix which satisfies: (1) P has only ‘0’ or ‘1’ as its elements; (2) each row (or column) of P has no more than one ‘1’; (3) in order to indicate that d features are selected, only d rows (or columns) contain one ‘1’ exactly, and the remaining $(N - d)$ rows are just zero vectors. Consequently, for x, $x' = P^{T} x$ is its

Experiments and analysis

In this section, we run experiments on 6 datasets as shown in Table 1: Ionosphere, Spambase, Sonar, USPS, Extended YaleB and CMU PIE. The first three real-world data sets are available from UCI machine learning benchmark data set²; USPS is a handwritten digit dataset; Extended YaleB and CMU PIE are two standard face databases which are used in [31]. Extended YaleB database contains 16,128 face images of 38 human subjects under 9 poses and 64

Conclusions and future work

By introducing SRC based measurement criterion into feature selection, we proposed sparse discriminative feature selection, then designed a greedy algorithm and a joint selection algorithm to efficiently solve the proposed objective function. Experiment results showed that the proposed JSDFS consistently outperformed others in classification. In our future work, we will improve SRC and then apply this novel SRC based measurement criterion to feature selection.

Conflict of interest

None declared.

Acknowledgements

This work is supported by the National Science Foundation of China (Grant no. 61202134), National Science Fund for Distinguished Young Scholars (Grant no. 61125305), China Postdoctoral Science Foundation (Grant no. AD41431), and the Postdoctoral Science Foundation of Jiangsu Province (Grant no. AD41358).

Hui Yan received her B.S. degree and Ph.D. degree from the School of Computer Science and Technology, Nanjing University of Science and Technology (NUST), Nanjing, China, in 2005 and 2011, respectively. In 2009, she was a visiting student at the Department of Electrical and Computer Engineering at National University of Singapore, Singapore. She is currently a lecturer at the School of Computer Science and Engineering, NUST. Her research interests include pattern recognition, computer vision

References (38)

C.L. Yang et al.
Image collection summarization via dictionary learning for sparse representation
Pattern Recognit.
(2013)
J. Yang et al.
Beyond sparsitythe role of l₁-optimizer in pattern classification
Pattern Recognit.
(2012)
L.S. Qiao et al.
Sparsity preserving projections with applications to face recognition
Pattern Recognit.
(2010)
C.Y. Lu et al.
Optimized projections for sparse representation based classification
Neurocomputing
(2013)
J. Gui et al.
Discriminant sparse neighborhood preserving embedding for face recognition
Pattern Recognit.
(2012)
L.M. Zhang et al.
Graph optimization for dimensionality reduction with sparsity constraints
Pattern Recognit.
(2012)
K. Kira et al.
A practical approach to feature selection
Mach. Learn.
(1992)
R. Duda et al.
Pattern Recognit.
(2001)
X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Neural Information Processing Systems, Vancouver,...
Z. Zhao, H. Liu, Sprectral feature selection for supervised and unsupervised learning, in: International Conference on...

M. Masaeli, G. Fung, J.G. Dy, From transformation-based dimensionality reduction to feature selection, in:...

H. Liu, X. Wu, S. Zhang, Feature selection using hierarchical feature clustering, in: Conference on Information and...

I. Guyon et al.

An introduction to variable and feature selection

J. Mach. Learn. Res.

(2003)

A. Rakotomamonjy

Variable selection using svm based criteria

J. Mach. Learn. Res.

(2003)

V. Vapnik

Statistical Learning Theory

(1998)

J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, in: Advances in Neural Information...

F.P. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via joint l2,1-norms minimization, in:...

Z.G. Ma et al.

Web image annotation via subspace-sparsity collaborated feature selection

IEEE Trans. Multimed.

(2012)

Q.Q. Gu, Z.H. Li, J.W. Han, Joint feature selection and subspace learning, in: International Joint Conference on...

Cited by (48)

Feature selection via Non-convex constraint and latent representation learning with Laplacian embedding
2022, Expert Systems with Applications
Citation Excerpt :
With the rapid development of information technology, high-dimensional data has appeared in many fields, such as computer vision and pattern recognition (Yan & Yang, 2015).
In unsupervised feature selection, the relationship between pseudo-labels is often ignored, and the interconnection information between the data is not fully utilized. In order to solve these problems, this paper proposes a feature selection method via non-convex constraint and latent representation learning with Laplacian embedding (NLRL-LE). NLRL-LE keeps the correlation between the pseudo-labels to make the pseudo-label closer to the true label. And it combines with the interconnection information between data, learns the latent representation matrix to guide feature selection. Specifically, first, NLRL-LE regards each pseudo-label as a latent feature of the sample, constructs a latent feature graph, and retains the inherent attributes of the pseudo-labels. Second, latent representation learning is performed in the space which is made up of the latent feature space and data space. Since the latent feature graph retains the correlation between pseudo-labels, latent representation learning considers the interconnection information between data, and the information contained in the latent representation space is more complete. In addition, in order to make full use of pseudo-labels, the learned latent representation matrix is used as pseudo-label information to provide cluster labels in the latent representation space to guide feature selection. Finally, non-negative and l_2,1-2-norm non-convex constraint are applied to the feature transformation matrix. The combination of non-negative constraint and non-convex constraint, compared with convex constraint, can ensure the row sparsity of the feature transformation matrix, select low-redundant features, and improve the feature selection effect. The experimental results show that the ACC and NMI of the NLRL-LE are better than the other seven compared algorithms on twelve datasets.
Feature selection based on non-negative spectral feature learning and adaptive rank constraint
2022, Knowledge-Based Systems
Unsupervised feature selection plays a significant role in data classification and clustering. General regression models cannot directly exploit the information on the feature space and fail to accurately describe the local geometric structure of data during feature selection. To address these problems, this paper proposes the feature selection algorithm, which is based on non-negative spectral feature learning and adaptive rank constraint (NNSAFS). First, the algorithm utilizes the residual term in sparse regression to ensure that the learned low-dimensional subspaces have greater fault tolerance and introduces a feature graph on the sparse transformation matrix to reveal the manifold information on the feature space. This sparse transformation matrix is the projection matrix because the introduction of feature graphs on this matrix can connect manifold learning to feature selection. In addition, traditional spectral clustering algorithms usually construct fixed similarity graphs for clustering analysis. The NNSAFS algorithm imposes a rank constraint on the clustering indicator matrix, which is equivalent to the graph regularization that can recover accurate local structure information. Moreover, the similarity matrix in the regularization term is constructed by using the maximum entropy theory, which can increase the adaptability of manifold learning. Finally, the algorithm imposes the $l_{1}$ -norm constraint on the projection matrix, which makes the selected features more conducive to clustering performance. The clustering performance of the NNSAFS algorithm is evaluated against seven other unsupervised feature selection algorithms on nine benchmark datasets. The experimental results show that the features selected by the proposed algorithm are more discriminative and outperform other algorithms in the clustering task.¹
Dual space latent representation learning for unsupervised feature selection
2021, Pattern Recognition
Citation Excerpt :
Therefore, it is necessary to overcome the "dimensional disaster" caused by large-scale high-dimensional data. Experiments show that effective dimensionality reduction methods can not only reduce the cost of data processing, but also effectively improve the performance of clustering algorithm [3,4]. Feature selection is one of the common dimensionality reduction methods, which is designed to select a representative subset to represent the original data [5].
In real-world applications, data instances are not only related to high-dimensional features, but also interconnected with each other. However, the interconnection information has not been fully exploited for feature selection. To address this issue, we propose a novel feature selection algorithm, called dual space latent representation learning for unsupervised feature selection (DSLRL), which exploits the internal association information of data space and feature space to guide feature selection. Firstly, based on latent representation learning in data space, DSLRL produces dual space latent representation learning, which characterizes the inherent structure of data space and feature space, respectively. Secondly, in order to overcome the problem of the lack of label information, DSLRL optimizes the low-dimensional latent representation matrix of data space as a pseudo-label matrix to provide clustering indicators. Moreover, the latent representation matrix of feature space is unified with the transformation matrix to benefit the matching of the data matrix and the clustering indicator matrix. In addition, DSLRL uses non-negative and orthogonal conditions to constrain the sparse transform matrix, making it more accurate for evaluating features. Finally, an alternating method is employed to optimize the objective function. Compared with seven state-of-the-art algorithms, experimental results on twelve datasets show the effectiveness of DSLRL.
Robust Jointly Sparse Regression with Generalized Orthogonal Learning for Image Feature Selection
2019, Pattern Recognition
Ridge regression (RR) and its variants are fundamental methods for multivariable data analysis, which have been widely used to deal with different problems in pattern recognition or classification. However, these methods have their common drawback. That is, the number of the learned projections is limited by the number of class. Moreover, most of these methods do not consider the local structure of the data, which makes them less competitive in the case when data are lying on a lower dimensional manifold. Therefore, in this paper, we propose a robust jointly sparse regression method to integrate the locality geometric structure with generalized orthogonality constraint and joint sparsity into a regression modal to address these problems. The optimization model can be solved by an alternatively iterative algorithm using orthogonal matching pursuit (OMP) and singular value decomposition. Experimental results on face and non-face image database demonstrate the superiority of the proposed method. The matlab code can be found at http://www.scholat.com/laizhihui.
Local discriminative based sparse subspace learning for feature selection
2019, Pattern Recognition
Subspace learning is a matrix decomposition method. Some algorithms apply subspace learning to feature selection, but they ignore the local discriminative information contained in data. In this paper, we propose a new unsupervised feature selection algorithm to address this issue, which is called local discriminative based sparse subspace learning for feature selection (LDSSL). We first introduce a local discriminant model in our feature selection framework of subspace learning. This model preserves both the local discriminant structure and local geometric structure of the data, simultaneously. It can not only improve the discriminate ability of the algorithm, but also utilize the local geometric structure information contained in data. Local discriminant model is a linear model, which cannot deal with nonlinear data effectively. Therefore, we need to kernelize the local discriminant model to get a nonlinear version. We next introduce the L₁-norm to constrain the feature selection matrix, and this can ensure the sparsity of the feature selection matrix and improve the algorithm's discriminate ability. Then we give the objective function, convergence proof and iterative update rules of the algorithm. We compare LDSSL with eight state-of-the-art algorithms on six datasets. The experimental results show that LDSSL is more effective than eight other feature selection algorithms.
Identifying the most informative features using a structurally interacting elastic net
2019, Neurocomputing
Feature selection can efficiently identify the most informative features with respect to the target feature used in training. However, state-of-the-art vector-based methods are unable to encapsulate the relationships between feature samples into the feature selection process, thus leading to significant information loss. To address this problem, we propose a new graph-based structurally interacting elastic net method for feature selection. Specifically, we commence by constructing feature graphs that can incorporate pairwise relationship between samples. With the feature graphs to hand, we propose a new information theoretic criterion to measure the joint relevance of different pairwise feature combinations with respect to the target feature graph representation. This measure is used to obtain a structural interaction matrix where the elements represent the proposed information theoretic measure between feature pairs. We then formulate a new optimization model through the combination of the structural interaction matrix and an elastic net regression model for the feature subset selection problem. This allows us to (a) preserve the information of the original vectorial space, (b) remedy the information loss of the original feature space caused by using graph representation, and (c) promote a sparse solution and also encourage correlated features to be selected. Because the proposed optimization problem is non-convex, we develop an efficient alternating direction multiplier method (ADMM) to locate the optimal solutions. Extensive experiments on various datasets demonstrate the effectiveness of the proposed method.

View all citing articles on Scopus

Jian Yang received his B.S.degree in mathematics from Xuzhou Normal University, Xuzhou, China, in 1995, his M.S. degree in applied mathematics from Changsha Railway University, Changsha, China, in 1998, and his Ph.D. degree in pattern recognition and intelligence systems from Nanjing University of Science and Technology (NUST), Nanjing, China, in 2002. He was a Post-Doctoral Researcher at the University of Zaragoza, Spain, in 2003. From 2004 to 2006, he was a Post-Doctoral Fellow at the Biometrics Centre of Hong Kong Polytechnic University, Hong Kong. From 2006 to 2007, at the Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA. He is currently a Professor at the School of Computer Science and Technology, NUST. He has authored more than 80 academic papers in pattern recognition and computer vision. His journal papers have been cited more than 1800 times in the ISI Web of Science, and 3000 times in the Web of Scholar Google. His current research interests include pattern recognition, computer vision and machine learning.

View full text

Sparse discriminative feature selection

Highlights

Abstract

Introduction

Section snippets

SRC steered discriminative projection (SRC-DP)

Basic idea and algorithm

Experiments and analysis

Conclusions and future work

Conflict of interest

Acknowledgements

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Neurocomputing

Pattern Recognit.

Pattern Recognit.

Mach. Learn.

Pattern Recognit.

An introduction to variable and feature selection

J. Mach. Learn. Res.

Variable selection using svm based criteria

J. Mach. Learn. Res.

Statistical Learning Theory

Web image annotation via subspace-sparsity collaborated feature selection

IEEE Trans. Multimed.