Elsevier

Neurocomputing

Volume 275, 31 January 2018, Pages 523-532
Neurocomputing

Robust data representation using locally linear embedding guided PCA

https://doi.org/10.1016/j.neucom.2017.08.053Get rights and content

Abstract

Locally Linear Embedding (LLE) is widely used for embedding data on a nonlinear manifold. It aims to preserve the local neighborhood structure on the data manifold. Our work begins with a new observation that LLE has a natural robustness property. Motivated by this observation, we propose to integrate LLE and PCA into a LLE guided PCA model (LLE-PCA) that incorporates both global structure and local neighborhood structure simultaneously while performs robustly to outliers. LLE-PCA has a compact closed-form solution and can be efficiently computed. Extensive experiments on five datasets show promising results on data reconstruction and improvement on data clustering and semi-supervised learning tasks.

Introduction

Efficient and compact representation of data is a fundamental problem in data mining and machine learning area. In real-world datasets, data usually have high dimensionality and often contain noises and errors. One of the widely used effective approaches is to find a low-dimensional representation for high dimension data based on low-rank matrix factorization [1], [2], [3], [4], [5], [6], [7]. In low-dimensional space, the cluster distribution becomes more apparent and thus can improve the machine learning results [8], [9], [10], [11]. In additional to matrix factorization, trace norm based low-rank and sparse methods have also been widely used to conduct data representation [12], [13], [14], [15]. For example, one of popular methods is sparse representation [12] which has been successfully applied in face recognition problem. Liu et al., [13] proposed a robust subspace segmentation and data recovery method based on low rank representation (LRR). Liu and Yan [16] extend LRR to Latent Low-Rank Representation (LatLRR) which uses both observed and unobserved, hidden data and thus provides a more robust representation. Li et al., [14] proposed a robust structured subspace learning (RSSL) algorithm for data representation by integrating both image recognition and feature extraction into a general joint learning model. Zhang et al., [17] provided a matrix completion method via truncated nuclear norm regularization which can better approximate the matrix rank function. In this paper, we focus on matrix factorization based data representation methods.

Principal Component Analysis (PCA) is a classical data representation and dimension reduction technique [1], [18], [19], [20]. It has been widely used to learn a low-dimensional representation of data on a linear manifold. When PCA is applied for data representation and dimension reduction, it aims to preserve the global Euclidean structure of the data [21], [22], [23]. However, in many applications, data lie in a nonlinear manifold. One popular method is to use manifold embedding methods such as Laplacian Embedding [24], Isomap [25], Locality Preserving Projections [26], Locally Linear Embedding [27], Local Tangent Space Alignment [28], Neighborhood Preserving Embedding [29], Kernel methods [30] etc. As one of the popular manifold embedding methods, Locally Linear Embedding (LLE) [27], [30] has been widely used for embedding data on a nonlinear manifold. It aims to preserve the local neighborhood structure on the data manifold and thus performs robustly to local outliers.

In this paper, we begin with a new observation that LLE has a natural robustness property. Roughly speaking, the second step of LLE (which computes the embedding coordinates) can be approximately determined by the normal data points, without consideration of the outliers or severely corrupted data points. Once the embedding coordinates of normal data points are computed, the outlier data points are brought back to the correct data manifold through embedding in their kNN neighborhoods which consists of normal data points only.

Motivated by this observation, we propose to integrate LLE and PCA into a new robust data representation model, called LLE guided PCA (LLE-PCA), which combines the LLE robustness in local manifold structure and PCA’s good preservation of the global structure of data simultaneously. Generally, there are three main benefits of LLE-PCA method.

  • It can be used for data reconstruction. Compared with traditional PCA, LLE-PCA reconstruction performs more robustly to outliers due to LLE regularization.

  • It can be used for data dimension reduction. In LLE-PCA low-dimensional subspace, the cluster structure of the data generally becomes more apparent and thus more desirable for conducting data clustering and semi-supervised learning tasks.

  • It has a compact closed-form solution and can be efficiently computed. This makes it simple and suitable for practical application.

Moreover, based on LLE-PCA model, we show experimentally how the data information of X can be preserved (retained) in LLE manifold embedding. This can be seen as a new feature for LLE manifold learning and embedding method. We perform experiments on several datasets. Promising results show that LLE-PCA has certain robustness and consistently improves clustering and semi-supervised learning.

Section snippets

Locally linear embedding and robustness

Principal component analysis (PCA) provides an embedding for the data lying on a linear manifold [1], [21]. However, in many applications, data lie in a nonlinear manifold. One popular method is to use locally linear embedding (LLE) [27], [30], [31]. LLE is a nonlinear dimension reduction approach.

Assume X=(x1,x2,,xn) be a set of n input points in a high-dimensional data space Rp. LLE expects that each data point and its neighbors lie close to a locally linear manifold. It reconstructs each

LLE guided PCA

Motivated by the above new observation on LLE robustness, in this section, we propose to utilize LLE for robust data representation. We first give a brief review of Principal Component Analysis (PCA) and then present our LLE guided PCA (LLE-PCA) model.

PCA finds the best low-dimensional subspace defined (spanned) by the principal directions U=(u1,,ud)Rp×d. Here the subspace dimension d is usually much less than the input data dimension p. The projected data points in the new subspace are V=(v1,

Experiments

To test the effectiveness of the proposed model, in this section, we implement LLE-PCA and test it on several datasets, including AT&T faces, Bin-alpha3, MNIST, USPS, COIL204. The detail information of these datasets are list in the following:

AT&T Faces dataset contains ten different images of each 40 distinct peoples. The size of each image is 92 × 112, with 256 grey levels

Conclusion

Motivated by the LLE robustness recognized in this paper, we propose to combine PCA and LLE to form LLE-PCA that provides both a new low-dimensional data representation. LLE-PCA has a simple closed-form solution and can be efficiently computed. One important benefit of LLE-PCA is that it performs robustly w.r.t. significant data corruptions. The properties and clustering and semi-supervised learning comparison results exhibited in experiments indicate that LLE-PCA is a good low-dimensional data

Acknowledgment

This work is supported National Natural Science Foundation of China (61602001, 61572030, 61671018); Natural Science Foundation of Anhui Province (1708085QF139); Natural Science Foundation of Anhui Higher Education Institutions of China (KJ2016A020); Co-Innovation Center for Information Supply & Assurance Technology, Anhui University; The Open Projects Program of National Laboratory of Pattern Recognition.

Bo Jiang received the B.Eng. degree in 2012 and the Ph.D. degree in 2015 in computer science from Anhui University, Hefei, China. He is currently a lecturer in computer science at Anhui University. His current research interests include image and graph matching, image feature extraction, and statistical pattern recognition.

References (32)

  • XiangS. et al.

    Nonlinear dimensionality reduction with local spline embedding

    Knowl. Data Eng. IEEE Trans.

    (2008)
  • SongY.

    Orthogonal locality minimizing globality maximizing projections for feature extraction

    Opt. Eng.

    (2009)
  • J. Wright et al.

    Robust face recognition via sparse representation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • LiuG. et al.

    Robust subspace segmentation by low-rank representation

    Proceedings of the International Conference on Machine Learning, ICML

    (2010)
  • HouC. et al.

    Joint embedding learning and sparse regression: a framework for unsupervised feature selection

    IEEE Trans. Cybern.

    (2013)
  • LiuG. et al.

    Latent Low-Rank Representation for subspace segmentation and feature extraction

    Proceeding of International Conference on Computer Vision

    (2012)
  • Cited by (0)

    Bo Jiang received the B.Eng. degree in 2012 and the Ph.D. degree in 2015 in computer science from Anhui University, Hefei, China. He is currently a lecturer in computer science at Anhui University. His current research interests include image and graph matching, image feature extraction, and statistical pattern recognition.

    Chris Ding received his Ph.D. degree in Columbia University in 1987. After that he joined in California Institute of Technology and then in Jet Propulsion Laboratory, California Institute of Technology. From 1996 to 2007, he was with Lawrence Berkeley National Laboratory, University of California. Since 2007 to present, he has been with University of Texas at Arlington. His main research areas are machine learning, data mining, bioinformatics, information retrieval, web link analysis, and high performance computing. He has published about 200 papers that were cited more than 13,000 times (google scholar).

    Bin Luo received his Ph.D. degree in Computer Science in 2002 from the University of York, the United Kingdom. He has published more than 200 papers in journals, edited books and refereed conferences. He is a professor at Anhui University of China. At present, he chairs the IEEE Hefei Subsection. He served as a peer reviewer of international academic journals such as IEEE Trans. on PAMI, Pattern Recognition, Pattern Recognition Letters, International Journal of Pattern Recognition and Artificial Intelligence, and Neurocomputing, etc. His current research interests include random graph based pattern recognition, image and graph matching, and video analysis.

    View full text