Low-rank structure preserving for unsupervised feature selection
Introduction
Data acquisition becomes progressively convenient with the rapid development of computer hardware and Internet. In many practical areas, such as face recognition, video surveillance, signal processing and gene micro-arrays, data is represented by a large matrix therefore its subsequent processing is increasingly time and storage space consuming [1], [2]. The most popular method to overcome such problem is dimensionality reduction. Dimensionality reduction not only is effective to accelerate algorithm execution, but also might help with final classification or clustering accuracy because research has shown that data is generally highly correlated in real world, and its intrinsic dimension is often smaller than that of its original space [3], [4].
One of the primary means of dimensionality reduction is feature selection. Feature selection keeps the original meaning of the feature and maintains feature interpretability. According to the availability of label information, feature selection algorithms can be broadly classified as supervised, semi-supervised and unsupervised. We focus our interest in unsupervised feature selection even though it is more challenging due to lack of label information in any massive data, but nonetheless is less costly than supervised scenarios. To date, most unsupervised feature selection methods cover the following aspects: cluster structure learning [5], [6], [7], [8], [9], [10], [11], data reconstruction [12], [13], [14], local similarity preserving [11], [15], [16], and a combination of the former methods [7], [8], [9], [12], [17], [18], [19], [20].
The goal of feature selection for unsupervised learning is to identify a feature subset that best keeps the intrinsic cluster structure hidden in data according to the specified clustering criteria [21]. It has been demonstrated that spectral clustering can capture the cluster structure for samples effectively [22]. For example, multi-cluster feature selection (MCFS) [5], by using spectral analysis methods, detects the intrinsic dimension or “flat” embedding of data, then measures the importance of a feature by a l1-norm regularized sparse regression model. Joint embedding learning and spectral regression (JELSR) [6] combines the embedding learning and sparse regression to perform feature selection.
The cluster structure also can be revealed by predicting cluster indicators, which can be regarded as approximations of class labels. The cluster indicator matrix is usually generated by non-negative matrix factorization or graph Laplacian. The typical approaches include: non-negative discriminative feature selection (NDFS) [8], robust unsupervised feature selection (RUFS) [9], feature selection via clustering-guided sparse structural learning (CGSSL) [7], embedding unsupervised feature selection (EUFS) [10]. Nevertheless, these methods require a priori knowledge of category numbers based on pseudo labels to construct the indicator matrix.
Another popular way to perform feature selection is based on minimizing the reconstruction error. The importance of the features is reflected by computing a project matrix in subspace learning or coefficient matrix in self-representation. The typical subspace learning methods based on data reconstruction include: matrix factorization feature selection (MFFS) [12], its extension, (e.g., discriminative sparse subspace learning (DSSL)[19], global and local structure preserving sparse subspace learning (GLoSS) [20], and coupled dictionary learning for unsupervised feature selection(CDL-FS) [13]). CDL-FS introduces a coupled analysis-synthesis learning dictionary framework based on data reconstruction, in which the synthesis dictionary is used to reconstruct samples and the analysis dictionary to encode samples. Regularized self-representation(RSR) [14] learns the correlation between features by a sparsely regularized self-representation matrix, then exploits the most representative features to reconstruct other features.
In the absence of label information, similarity is a pivotal factor and should be considered in feature selection. A number of algorithms aim to assess the features’ capability in preserving sample similarity. Such similarity of samples can be inferred from predefined similarity measures from local and global perspectives, separately or simultaneously. For instance, Laplacian score [15] and similarity preserving feature selection (SPFS) [11] rank features based on an affinity matrix (corresponding to degree matrix and Laplacian matrix) by different evaluation criteria. Graph embedding is also used as an implementation to keep the local similarity, which can be combined with two other methods (i.e., cluster structure learning and data reconstruction) in terms of regularization term [6], [7], [9], [19], [20].
The aforementioned algorithms reveal that the accuracy of the clustering indicators matrix or the affinity matrix significantly influence the final result of feature selection, especially in embedding methods in which all variables are learned simultaneously. However, real data usually include noises and outliers. Using original data to estimate indicators matrix or construct affinity matrix is unreliable. Recent work (e.g., structured optimal graph feature selection (SOGFS) [23] and feature selection with adaptive structure learning (FSASL) [17]) provide feasible solution to this problem. SOGFS adaptively learns local manifold structure to ensure that the affinity matrix is more accurate. FSASL uses the selected features to preserve global and sparse reconstruction structure via row sparse transformation matrix. Nevertheless, the sample distribution cannot be well maintained in this reconstruction process, as the samples may be reconstructed overly dependent on few samples when the l1-minimization is imposed on coefficient matrix [24]. Therefore, it becomes necessary to find a more suitable and accurate metric to hold the real structure of samples.
Similar to cluster indicators, representation coefficients reflect data distribution as well [13]. To this end, we try to introduce the low-rank representation coefficients to characterize the global structure of samples which are represented by selected features. From the perspective of samples, low-rank representation (LRR) may learn more correct cluster structure when samples are noise free and drawn from multiple subspaces by imposing the low-rank constraint on the representation coefficient matrix. Unfortunately, un-informative or dis-informative features in high dimensional data will have adverse effects on estimating a coefficient matrix which can also be regarded as an affinity matrix. To overcome this problem, we construct the representation coefficient matrix in low dimensional embedding space. By using the l2, 1-regularized projection matrix, redundant and irrelevant features are removed. Therefore each sample has a more compact expression and can be clustered to its true class by a more accurate affinity matrix. Consequently the intrinsic structure of data can be preserved and used to identify representative features. From the perspective of features, the similarity between the selected features is weakened and noise is reduced via the l2, 1-regularized projection matrix. Then, the most representative features, which can be used to reconstruct other features, will be selected. Conversely, the correlation coefficient matrix of samples will not satisfy the low-rank constraint and all the features cannot be completely reconstructed if the selected features are not sufficiently discriminative. Through the afore mentioned analysis we find that sample structure preserving, feature reconstruction and feature selection are complementary. It is worth pointing out that our proposed method takes account of l2, 1-norm regularized reconstruction error minimization as GLoSS [20] did, but differs from GLoSS’s focus on the intrinsic geometry structure of the manifold which exhibits local stability in the embedding. One major precondition is closely related to the smoothness assumption (the manifold assumption), that the underlying manifold is sufficiently smooth so that it can be well approximated by connecting the sample points with a neighborhood graph. Our proposed model is under the practical hypothesis that observed samples are drawn from a mixture of several low-dimensional subspaces, and the data matrix stacked from all the vectorized observations should be approximately of low-rank. Hence the selected features effectively preserve the similarity of samples from the same subspace, and perform well in classification or clustering tasks.
In light of these analyses, we aim to learn a suitable dictionary, at the same time to carry out subspace clustering and unsupervised feature selection in this work. According to the projection matrix in the dictionary, features are ranked according to the ability of reconstructing the original features and the similarity of the samples are kept. We propose a low-rank structure preserving algorithm for unsupervised feature selection (LRPFS). The procedure is illustrated in Fig. 1.
Our main contributions consist of the following three aspects:
- 1.
In order to weaken the “similarity” between the selected features, at the same time to enhance the “similarity” between the highly correlated samples, we propose a novel unsupervised feature selection model by exploiting the group sparse regularized data reconstruction and low-rank regularized cluster structure preserving simultaneously. Low-rank constraint is used to learn a suitable dictionary to preserve the subspace structures of data samples. Meanwhile, the sparse constraint removes redundant features, which boosts to learning the dictionary for data reconstruction. The learned dictionary plays an important role in bridging these two sub-tasks.
- 2.
According to the learned dictionary, our method represents the samples in a much efficient representation where affinity matrix can be constructed by cleaner data instead of the original data. Then with this refined characterization of structure, valuable features are selected effectively.
- 3.
We design a practical and simple algorithm to solve the proposed optimization problem. Extensive experiments are conducted on six real-world datasets coming from various areas and compared to twelve popular unsupervised feature selection algorithms. The experimental results demonstrate that our proposed method can achieve more promising performances on different datasets. In addition, we analyze the sensitivity of the parameters and observe that our method can maintain a stable effect over a wide range of parameters.
The layout of this paper is organized as follows. In Section 2, we give a brief review of related work. The details of LRSFS are introduced in Section 3. We apply the block coordinate descent (BCD) method and fast iterative shrinkage-shareholding algorithm (FISTA) to solve the optimization problem in Section 4. Section 5 is devoted to experimental results and analyses. Finally, Section 6 concludes the paper.
Section snippets
Related work
Before introducing our proposed method, we cite two related topics, including robust principal component analysis (RPCA) and low-rank representation (LRR). To facilitate the presented works, the symbols in this paper are listed in Table 1.
Data from real applications can be frequently characterized by low-rank structure. If all the clean data samples are stacked as column vectors of a matrix, the matrix should be approximately low rank. Thus, exploring the low-rank subspace structures becomes a
Problem statement
In this section, we propose to learn a dictionary via minimizing reconstruction error and preserving data cluster structure simultaneously. Based on this dictionary, feature selection is carried out indirectly on the projection matrix. The proposed method can select the informative features effectively even input features are corrupted. We will state our model from two aspects in Sections 3.1 and 3.2, respectively.
Optimization and algorithms
In this section we will present an algorithm to solve our LRPFS model. It is obvious that (7) is non-convex jointly with respect to all optimization variables, i.e., W, Z and H. However, it is convex with regard to one of them while the others are fixed. According to this characteristic, we choose the block coordinate descent (BCD) method of Gauss–Seidel type [35] to solve this problem. It separates (7) with respect to W, H and Z into three independent sub-problems, which can be cyclically
Experiments
In this section, we illustrate the applicability of our proposed unsupervised feature selection method LRPFS compared to twelve popular algorithms on six real world datasets. As our method belongs to the feature selection algorithm under the guidance of the multiple subspaces clustering method, following previous unsupervised feature selection works [7], [20], we demonstrate our method in terms of clustering. Moreover, we give the sensitivity analysis of the parameters in the objective function.
Conclusions
In this work, from the nature of the feature selection task, we set out to reduce the redundant features and noise while maintaining the inherent distribution of the samples. Thus, we propose an unsupervised feature selection method (LRPFS) combining sparsity and low-rankness. Essentially, using l2, 1-norm minimization encourages row sparsity for feature selection, but lacks grouping effect for sample structure preserving. The grouping effect generated by nuclear norm ensures that the samples
Acknowledgment
The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Natural Science Fund of China under Grant Nos. U1713208, 61472187, 61602244 and 61772276, the 973 Program No. 2014CB349303, Program for Changjiang Scholars, the Natural Science Foundation of Jiangsu Province under grant no. BK20170857, the fundamental research funds for the central universities no. 30918011321 and
Wei Zheng received the B.S. and the M.S. degrees from the School of Automation Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2004 and 2007, respectively. She is working toward the Ph.D. degree in the School of Computer Science and Engineering, Nanjing University of Science and Technology (NUST), Nanjing, China. Her research interests include pattern recognition, data mining, and machine learning.
References (42)
- et al.
Angle 2DPCA: a new formulation for 2DPCA
IEEE Trans. Cybern.
(2017) - et al.
Embedded unsupervised feature selection.
Proceedings of the Twenty Ninth AAAI Conference on Artificial Intelligence
(2015) - et al.
Subspace learning for unsupervised feature selection via matrix factorization
Pattern Recognit.
(2015) - et al.
Unsupervised feature selection by regularized self-representation
Pattern Recognit.
(2015) - et al.
Discriminative sparse subspace learning and its application to unsupervised feature selection
ISA Trans.
(2016) - et al.
Global and local structure preserving sparse subspace learning: an iterative approach to unsupervised feature selection
Pattern Recognit.
(2016) - et al.
On the schatten norm for matrix based subspace learning and classification
Neurocomputing
(2016) - et al.
Feature selection for high-dimensional data
Prog. Artif. Intell.
(2016) - R. Silipo, I. Adae, A. Hart, M. Berthold, Seven techniques for dimensionality reduction,...
- et al.
Feature selection via global redundancy minimization
IEEE Trans. Knowl. Data Eng.
(2015)
Unsupervised feature selection for multi-cluster data
Proceedings of the Sixteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Joint embedding learning and sparse regression: a framework for unsupervised feature selection
IEEE Trans. Cybern.
Clustering-guided sparse structural learning for unsupervised feature selection
IEEE Trans. Knowl. Data Eng.
Unsupervised feature selection using nonnegative spectral analysis.
Proceedings of the Twenty Sixth AAAI Conference on Artificial Intelligence
Robust unsupervised feature selection.
Proceedings of the Twenty Third International Joint Conference on Artificial Intelligence
Spectral feature selection for supervised and unsupervised learning
Proceedings of the Twenty Fourth International Conference on Machine Learning
Coupled dictionary learning for unsupervised feature selection
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
Laplacian score for feature selection
Proceedings of the Advances in Neural Information Processing Systems
On similarity preserving feature selection
IEEE Trans. Knowl. Data Eng.
Unsupervised feature selection with adaptive structure learning
Proceedings of the Twenty First ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Sparse representation based fisher discrimination dictionary learning for image classification
Int. J. Comput. Vis.
Cited by (16)
Structure learning with consensus label information for multi-view unsupervised feature selection
2024, Expert Systems with ApplicationsBi-level ensemble method for unsupervised feature selection
2023, Information FusionUnsupervised feature selection with adaptive multiple graph learning
2020, Pattern RecognitionCitation Excerpt :Conventional methods often construct a graph from the data to represent such structure. According to the way of constructing the graph, the existing methods can be roughly categorized into two classes: (1) using a pre-defined graph, which often pre-defines a graph and selects features to preserve such graph structure [19–21]; (2) leaning an adaptive graph, which learns an adaptive graph simultaneously in the process of feature selection [5,17,22]. Note that both the two classes only use one single graph (either a pre-defined graph or an adaptive graph) to represent the structure of data.
Unsupervised feature selection for balanced clustering
2020, Knowledge-Based SystemsSecond-Order Unsupervised Feature Selection via Knowledge Contrastive Distillation
2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
Wei Zheng received the B.S. and the M.S. degrees from the School of Automation Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2004 and 2007, respectively. She is working toward the Ph.D. degree in the School of Computer Science and Engineering, Nanjing University of Science and Technology (NUST), Nanjing, China. Her research interests include pattern recognition, data mining, and machine learning.
Chunyan Xu received the B.Sc. degree from Shandong Normal University in 2007 and the M.Sc. degree from Huazhong Normal University in 2010 and the Ph.D.degree in the School of Computer Science and Technology, Huazhong University of Science and Technology in 2015. She is a visiting scholar at National University of Singapore from 2013 to 2015. She is now working in the school of Computer Science and Engineering, Nanjing University of Science and Technology.
Jian Yang received the Ph.D. degree from Nanjing University of Science and Technology (NUST), on the subject of pattern recognition and intelligence systems in 2002. In 2003, he was a postdoctoral researcher at the University of Zaragoza. From 2004 to 2006, he was a Postdoctoral Fellow at Biometrics Centre of Hong Kong Polytechnic University. From 2006 to 2007, he was a Postdoctoral Fellow at Department of Computer Science of New Jersey Institute of Technology. Now, he is a Chang-Jiang professor in the School of Computer Science and Engineering of NUST. He is the author of more than 100 scientific papers in pattern recognition and computer vision. His journal papers have been cited more than 4000 times in the ISI Web of Science, and 9000 times in the Web of Scholar Google. His research interests include pattern recognition, computer vision and machine learning. Currently, he is/was an associate editor of Pattern Recognition Letters, IEEE Trans. Neural Networks and Learning Systems, and Neurocomputing. He is a Fellow of IAPR.
Junbin Gao received the B.Sc. degree in computational mathematics from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 1982, and the Ph.D. degree from the Dalian University of Technology, Dalian, China, in 1991. He is a Professor with the Discipline of Business Analytics, University of Sydney Business School, The Universtiy of Sydney, Sydney, NSW, Australia. He was a Senior Lecturer and a Lecturer of computer science from 2001 to 2005 with University of New England, Armidale, NSW, Australia. From 1982 to 2001, he was an Associate Lecturer, a Lecturer, an Associate Professor, and a Professor with the Department of Mathematics, HUST. From 2002 to 2015, he was a Professor of computing science with the School of Computing and Mathematics, Charles Sturt University, Bathurst, Australia. His current research interests include machine learning, data mining, Bayesian learning and inference, and image analysis.
Fa Zhu is currently pursuing Ph.D. degree from the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, PR China. He is a visiting Ph.D. student in the Centre for Artificial Intelligence (CAI), and the Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia. His current research interests include pattern recognition and machine learning.