Elsevier

Neurocomputing

Volume 314, 7 November 2018, Pages 360-370
Neurocomputing

Low-rank structure preserving for unsupervised feature selection

https://doi.org/10.1016/j.neucom.2018.06.010Get rights and content

Abstract

Unsupervised feature selection has been widely applied to machine learning and pattern recognition, as it does not require class labels. The majority of the popular unsupervised feature selection methods focus on various forms of reconstruction, and minimize the reconstruction residual by discarding features with low contributions. However, they cannot effectively preserve the data distribution in multiple subspaces, because the sample structure information is not substantially utilized to constrain the selected features. In this paper, we propose a low-rank structure preserving method for unsupervised feature selection (LRPFS) to address this shortcoming. The data matrix consisting selected features is assumed as a dictionary, which is learned by a low-rank constraint to preserve the subspace structure. Meanwhile, we further leverage the sparse penalty to remove the redundancy features, and thus obtain the discriminative features with intrinsic structures. In this way, the sample distribution can be preserved by low-rank constraint more precisely via using discriminative features. In turn, the refined sample structure boosts the selection of more representative features. The effectiveness of our method is supported by both theoretical and experimental results.

Introduction

Data acquisition becomes progressively convenient with the rapid development of computer hardware and Internet. In many practical areas, such as face recognition, video surveillance, signal processing and gene micro-arrays, data is represented by a large matrix therefore its subsequent processing is increasingly time and storage space consuming [1], [2]. The most popular method to overcome such problem is dimensionality reduction. Dimensionality reduction not only is effective to accelerate algorithm execution, but also might help with final classification or clustering accuracy because research has shown that data is generally highly correlated in real world, and its intrinsic dimension is often smaller than that of its original space [3], [4].

One of the primary means of dimensionality reduction is feature selection. Feature selection keeps the original meaning of the feature and maintains feature interpretability. According to the availability of label information, feature selection algorithms can be broadly classified as supervised, semi-supervised and unsupervised. We focus our interest in unsupervised feature selection even though it is more challenging due to lack of label information in any massive data, but nonetheless is less costly than supervised scenarios. To date, most unsupervised feature selection methods cover the following aspects: cluster structure learning [5], [6], [7], [8], [9], [10], [11], data reconstruction [12], [13], [14], local similarity preserving [11], [15], [16], and a combination of the former methods [7], [8], [9], [12], [17], [18], [19], [20].

The goal of feature selection for unsupervised learning is to identify a feature subset that best keeps the intrinsic cluster structure hidden in data according to the specified clustering criteria [21]. It has been demonstrated that spectral clustering can capture the cluster structure for samples effectively [22]. For example, multi-cluster feature selection (MCFS) [5], by using spectral analysis methods, detects the intrinsic dimension or “flat” embedding of data, then measures the importance of a feature by a l1-norm regularized sparse regression model. Joint embedding learning and spectral regression (JELSR) [6] combines the embedding learning and sparse regression to perform feature selection.

The cluster structure also can be revealed by predicting cluster indicators, which can be regarded as approximations of class labels. The cluster indicator matrix is usually generated by non-negative matrix factorization or graph Laplacian. The typical approaches include: non-negative discriminative feature selection (NDFS) [8], robust unsupervised feature selection (RUFS) [9], feature selection via clustering-guided sparse structural learning (CGSSL) [7], embedding unsupervised feature selection (EUFS) [10]. Nevertheless, these methods require a priori knowledge of category numbers based on pseudo labels to construct the indicator matrix.

Another popular way to perform feature selection is based on minimizing the reconstruction error. The importance of the features is reflected by computing a project matrix in subspace learning or coefficient matrix in self-representation. The typical subspace learning methods based on data reconstruction include: matrix factorization feature selection (MFFS) [12], its extension, (e.g., discriminative sparse subspace learning (DSSL)[19], global and local structure preserving sparse subspace learning (GLoSS) [20], and coupled dictionary learning for unsupervised feature selection(CDL-FS) [13]). CDL-FS introduces a coupled analysis-synthesis learning dictionary framework based on data reconstruction, in which the synthesis dictionary is used to reconstruct samples and the analysis dictionary to encode samples. Regularized self-representation(RSR) [14] learns the correlation between features by a sparsely regularized self-representation matrix, then exploits the most representative features to reconstruct other features.

In the absence of label information, similarity is a pivotal factor and should be considered in feature selection. A number of algorithms aim to assess the features’ capability in preserving sample similarity. Such similarity of samples can be inferred from predefined similarity measures from local and global perspectives, separately or simultaneously. For instance, Laplacian score [15] and similarity preserving feature selection (SPFS) [11] rank features based on an affinity matrix (corresponding to degree matrix and Laplacian matrix) by different evaluation criteria. Graph embedding is also used as an implementation to keep the local similarity, which can be combined with two other methods (i.e., cluster structure learning and data reconstruction) in terms of regularization term [6], [7], [9], [19], [20].

The aforementioned algorithms reveal that the accuracy of the clustering indicators matrix or the affinity matrix significantly influence the final result of feature selection, especially in embedding methods in which all variables are learned simultaneously. However, real data usually include noises and outliers. Using original data to estimate indicators matrix or construct affinity matrix is unreliable. Recent work (e.g., structured optimal graph feature selection (SOGFS) [23] and feature selection with adaptive structure learning (FSASL) [17]) provide feasible solution to this problem. SOGFS adaptively learns local manifold structure to ensure that the affinity matrix is more accurate. FSASL uses the selected features to preserve global and sparse reconstruction structure via row sparse transformation matrix. Nevertheless, the sample distribution cannot be well maintained in this reconstruction process, as the samples may be reconstructed overly dependent on few samples when the l1-minimization is imposed on coefficient matrix [24]. Therefore, it becomes necessary to find a more suitable and accurate metric to hold the real structure of samples.

Similar to cluster indicators, representation coefficients reflect data distribution as well [13]. To this end, we try to introduce the low-rank representation coefficients to characterize the global structure of samples which are represented by selected features. From the perspective of samples, low-rank representation (LRR) may learn more correct cluster structure when samples are noise free and drawn from multiple subspaces by imposing the low-rank constraint on the representation coefficient matrix. Unfortunately, un-informative or dis-informative features in high dimensional data will have adverse effects on estimating a coefficient matrix which can also be regarded as an affinity matrix. To overcome this problem, we construct the representation coefficient matrix in low dimensional embedding space. By using the l2, 1-regularized projection matrix, redundant and irrelevant features are removed. Therefore each sample has a more compact expression and can be clustered to its true class by a more accurate affinity matrix. Consequently the intrinsic structure of data can be preserved and used to identify representative features. From the perspective of features, the similarity between the selected features is weakened and noise is reduced via the l2, 1-regularized projection matrix. Then, the most representative features, which can be used to reconstruct other features, will be selected. Conversely, the correlation coefficient matrix of samples will not satisfy the low-rank constraint and all the features cannot be completely reconstructed if the selected features are not sufficiently discriminative. Through the afore mentioned analysis we find that sample structure preserving, feature reconstruction and feature selection are complementary. It is worth pointing out that our proposed method takes account of l2, 1-norm regularized reconstruction error minimization as GLoSS [20] did, but differs from GLoSS’s focus on the intrinsic geometry structure of the manifold which exhibits local stability in the embedding. One major precondition is closely related to the smoothness assumption (the manifold assumption), that the underlying manifold is sufficiently smooth so that it can be well approximated by connecting the sample points with a neighborhood graph. Our proposed model is under the practical hypothesis that observed samples are drawn from a mixture of several low-dimensional subspaces, and the data matrix stacked from all the vectorized observations should be approximately of low-rank. Hence the selected features effectively preserve the similarity of samples from the same subspace, and perform well in classification or clustering tasks.

In light of these analyses, we aim to learn a suitable dictionary, at the same time to carry out subspace clustering and unsupervised feature selection in this work. According to the projection matrix in the dictionary, features are ranked according to the ability of reconstructing the original features and the similarity of the samples are kept. We propose a low-rank structure preserving algorithm for unsupervised feature selection (LRPFS). The procedure is illustrated in Fig. 1.

Our main contributions consist of the following three aspects:

  • 1.

    In order to weaken the “similarity” between the selected features, at the same time to enhance the “similarity” between the highly correlated samples, we propose a novel unsupervised feature selection model by exploiting the group sparse regularized data reconstruction and low-rank regularized cluster structure preserving simultaneously. Low-rank constraint is used to learn a suitable dictionary to preserve the subspace structures of data samples. Meanwhile, the sparse constraint removes redundant features, which boosts to learning the dictionary for data reconstruction. The learned dictionary plays an important role in bridging these two sub-tasks.

  • 2.

    According to the learned dictionary, our method represents the samples in a much efficient representation where affinity matrix can be constructed by cleaner data instead of the original data. Then with this refined characterization of structure, valuable features are selected effectively.

  • 3.

    We design a practical and simple algorithm to solve the proposed optimization problem. Extensive experiments are conducted on six real-world datasets coming from various areas and compared to twelve popular unsupervised feature selection algorithms. The experimental results demonstrate that our proposed method can achieve more promising performances on different datasets. In addition, we analyze the sensitivity of the parameters and observe that our method can maintain a stable effect over a wide range of parameters.

The layout of this paper is organized as follows. In Section 2, we give a brief review of related work. The details of LRSFS are introduced in Section 3. We apply the block coordinate descent (BCD) method and fast iterative shrinkage-shareholding algorithm (FISTA) to solve the optimization problem in Section 4. Section 5 is devoted to experimental results and analyses. Finally, Section 6 concludes the paper.

Section snippets

Related work

Before introducing our proposed method, we cite two related topics, including robust principal component analysis (RPCA) and low-rank representation (LRR). To facilitate the presented works, the symbols in this paper are listed in Table 1.

Data from real applications can be frequently characterized by low-rank structure. If all the clean data samples are stacked as column vectors of a matrix, the matrix should be approximately low rank. Thus, exploring the low-rank subspace structures becomes a

Problem statement

In this section, we propose to learn a dictionary via minimizing reconstruction error and preserving data cluster structure simultaneously. Based on this dictionary, feature selection is carried out indirectly on the projection matrix. The proposed method can select the informative features effectively even input features are corrupted. We will state our model from two aspects in Sections 3.1 and 3.2, respectively.

Optimization and algorithms

In this section we will present an algorithm to solve our LRPFS model. It is obvious that (7) is non-convex jointly with respect to all optimization variables, i.e., W,  Z and H. However, it is convex with regard to one of them while the others are fixed. According to this characteristic, we choose the block coordinate descent (BCD) method of Gauss–Seidel type [35] to solve this problem. It separates (7) with respect to W, H and Z into three independent sub-problems, which can be cyclically

Experiments

In this section, we illustrate the applicability of our proposed unsupervised feature selection method LRPFS compared to twelve popular algorithms on six real world datasets. As our method belongs to the feature selection algorithm under the guidance of the multiple subspaces clustering method, following previous unsupervised feature selection works [7], [20], we demonstrate our method in terms of clustering. Moreover, we give the sensitivity analysis of the parameters in the objective function.

Conclusions

In this work, from the nature of the feature selection task, we set out to reduce the redundant features and noise while maintaining the inherent distribution of the samples. Thus, we propose an unsupervised feature selection method (LRPFS) combining sparsity and low-rankness. Essentially, using l2, 1-norm minimization encourages row sparsity for feature selection, but lacks grouping effect for sample structure preserving. The grouping effect generated by nuclear norm ensures that the samples

Acknowledgment

The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Natural Science Fund of China under Grant Nos. U1713208, 61472187, 61602244 and 61772276, the 973 Program No. 2014CB349303, Program for Changjiang Scholars, the Natural Science Foundation of Jiangsu Province under grant no. BK20170857, the fundamental research funds for the central universities no. 30918011321 and

Wei Zheng received the B.S. and the M.S. degrees from the School of Automation Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2004 and 2007, respectively. She is working toward the Ph.D. degree in the School of Computer Science and Engineering, Nanjing University of Science and Technology (NUST), Nanjing, China. Her research interests include pattern recognition, data mining, and machine learning.

References (42)

  • D. Cai et al.

    Unsupervised feature selection for multi-cluster data

    Proceedings of the Sixteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2010)
  • C. Hou et al.

    Joint embedding learning and sparse regression: a framework for unsupervised feature selection

    IEEE Trans. Cybern.

    (2014)
  • Z. Li et al.

    Clustering-guided sparse structural learning for unsupervised feature selection

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • Z. Li et al.

    Unsupervised feature selection using nonnegative spectral analysis.

    Proceedings of the Twenty Sixth AAAI Conference on Artificial Intelligence

    (2012)
  • M. Qian et al.

    Robust unsupervised feature selection.

    Proceedings of the Twenty Third International Joint Conference on Artificial Intelligence

    (2013)
  • Z. Zhao et al.

    Spectral feature selection for supervised and unsupervised learning

    Proceedings of the Twenty Fourth International Conference on Machine Learning

    (2007)
  • P. Zhu et al.

    Coupled dictionary learning for unsupervised feature selection

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    (2016)
  • X. He et al.

    Laplacian score for feature selection

    Proceedings of the Advances in Neural Information Processing Systems

    (2005)
  • Z. Zhao et al.

    On similarity preserving feature selection

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • L. Du et al.

    Unsupervised feature selection with adaptive structure learning

    Proceedings of the Twenty First ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2015)
  • M. Yang et al.

    Sparse representation based fisher discrimination dictionary learning for image classification

    Int. J. Comput. Vis.

    (2014)
  • Cited by (16)

    • Unsupervised feature selection with adaptive multiple graph learning

      2020, Pattern Recognition
      Citation Excerpt :

      Conventional methods often construct a graph from the data to represent such structure. According to the way of constructing the graph, the existing methods can be roughly categorized into two classes: (1) using a pre-defined graph, which often pre-defines a graph and selects features to preserve such graph structure [19–21]; (2) leaning an adaptive graph, which learns an adaptive graph simultaneously in the process of feature selection [5,17,22]. Note that both the two classes only use one single graph (either a pre-defined graph or an adaptive graph) to represent the structure of data.

    • Second-Order Unsupervised Feature Selection via Knowledge Contrastive Distillation

      2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
    View all citing articles on Scopus

    Wei Zheng received the B.S. and the M.S. degrees from the School of Automation Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2004 and 2007, respectively. She is working toward the Ph.D. degree in the School of Computer Science and Engineering, Nanjing University of Science and Technology (NUST), Nanjing, China. Her research interests include pattern recognition, data mining, and machine learning.

    Chunyan Xu received the B.Sc. degree from Shandong Normal University in 2007 and the M.Sc. degree from Huazhong Normal University in 2010 and the Ph.D.degree in the School of Computer Science and Technology, Huazhong University of Science and Technology in 2015. She is a visiting scholar at National University of Singapore from 2013 to 2015. She is now working in the school of Computer Science and Engineering, Nanjing University of Science and Technology.

    Jian Yang received the Ph.D. degree from Nanjing University of Science and Technology (NUST), on the subject of pattern recognition and intelligence systems in 2002. In 2003, he was a postdoctoral researcher at the University of Zaragoza. From 2004 to 2006, he was a Postdoctoral Fellow at Biometrics Centre of Hong Kong Polytechnic University. From 2006 to 2007, he was a Postdoctoral Fellow at Department of Computer Science of New Jersey Institute of Technology. Now, he is a Chang-Jiang professor in the School of Computer Science and Engineering of NUST. He is the author of more than 100 scientific papers in pattern recognition and computer vision. His journal papers have been cited more than 4000 times in the ISI Web of Science, and 9000 times in the Web of Scholar Google. His research interests include pattern recognition, computer vision and machine learning. Currently, he is/was an associate editor of Pattern Recognition Letters, IEEE Trans. Neural Networks and Learning Systems, and Neurocomputing. He is a Fellow of IAPR.

    Junbin Gao received the B.Sc. degree in computational mathematics from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 1982, and the Ph.D. degree from the Dalian University of Technology, Dalian, China, in 1991. He is a Professor with the Discipline of Business Analytics, University of Sydney Business School, The Universtiy of Sydney, Sydney, NSW, Australia. He was a Senior Lecturer and a Lecturer of computer science from 2001 to 2005 with University of New England, Armidale, NSW, Australia. From 1982 to 2001, he was an Associate Lecturer, a Lecturer, an Associate Professor, and a Professor with the Department of Mathematics, HUST. From 2002 to 2015, he was a Professor of computing science with the School of Computing and Mathematics, Charles Sturt University, Bathurst, Australia. His current research interests include machine learning, data mining, Bayesian learning and inference, and image analysis.

    Fa Zhu is currently pursuing Ph.D. degree from the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, PR China. He is a visiting Ph.D. student in the Centre for Artificial Intelligence (CAI), and the Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia. His current research interests include pattern recognition and machine learning.

    View full text