Bayesian robust principal component analysis with structured sparse component

doi:10.1016/j.csda.2016.12.005

Computational Statistics & Data Analysis

Volume 109, May 2017, Pages 144-158

https://doi.org/10.1016/j.csda.2016.12.005 Get rights and content

Abstract

The robust principal component analysis (RPCA) refers to the decomposition of an observed matrix into the low-rank component and the sparse component. Conventional methods model the sparse component as pixel-wisely sparse (e.g., $ℓ_{1}$ -norm for the sparsity). However, in many practical scenarios, elements in the sparse part are not truly independently sparse but distributed with contiguous structures. This is the reason why representative RPCA techniques fail to work well in realistic complex situations. To solve this problem, a Bayesian framework for RPCA with structured sparse component is proposed, where both amplitude and support correlation structure are considered simultaneously in recovering the sparse component. The model learning is based on the variational Bayesian inference, which can potentially be applied to estimate the posteriors of all latent variables. Experimental results demonstrate the proposed methodology is validated on synthetic and real data.

Introduction

High-dimensional data, such as image and video processing, documents, web search, and biological data often lie in a low-dimensional structure or manifold. As the fields of optimization and statistics develop, the problem of exploring and exploiting low-dimensional structure in high-dimensional data has been extensively studied.

Principal component analysis (PCA) (Serneels and Verdonck, 2008, Giordani and Kiers, 2006), as a classical and popular unsupervised dimensionality reduction approach, has wide applications in computer vision and pattern recognition. To a large extent, PCA efficiently develops the best $ℓ_{2}$ -norm low-rank approximation of the observed data. However, $ℓ_{2}$ -norm is sensitive to outliers which often appear in practical situations. Therefore, PCA may not get the optimal performance under a large corruption, even if the corruption affects only a small part of the data.

To address the brittleness of classical PCA with respect to outliers, robust principal component analysis (RPCA) (Candès et al., 2011) has been proposed to decompose the observed data matrix $Y \in R^{m \times n}$ into a low-rank matrix $X \in R^{m \times n}$ (rank: $r ≪ min {m, n}$ ) and a sparse matrix $E \in R^{m \times n}$ (with sparse outliers). This research seeks to recover the low-rank matrix $X$ and the sparse matrix $E$ from $Y$ by solving the following optimization problem $min_{X, E} rank (X) + λ {‖ E ‖}_{0}, s.t . Y = X + E,$ where ${‖ E ‖}_{0}$ denotes the number of nonzero entries in $E$ and $λ$ is a positive weighting parameter. Unfortunately, (1) is a highly nonconvex optimization problem. However, one can obtain a tractable optimization problem by relaxing (1) that replaces the $ℓ_{0}$ -norm with the $ℓ_{1}$ -norm and the rank with the nuclear norm, yielding the following convex optimization problem $min_{X, E} {‖ X ‖}_{*} + λ {‖ E ‖}_{1}, s.t . Y = X + E,$ where ${‖ \cdot ‖}_{*}$ and ${‖ \cdot ‖}_{1}$ denote the matrix nuclear norm (the sum of its singular values) and the $ℓ_{1}$ -norm (the sum of the absolute values of entries), respectively. Candès et al. (2011) showed that under broad conditions the underlying low-rank component $X$ and the sparse component $E$ can be exactly recovered with high probability by solving the above convex optimization problem. As a promising tool, RPCA has been successfully applied to many research fields, such as video surveillance (Zhou et al., 2013), face recognition (Wagner et al., 2012), subspace clustering (Liu et al., 2013) and image alignment (Peng et al., 2012).

In fact, it is an identifiability issue to perfectly disentangle the low-rank and the sparse components. To make the problem meaningful, we first need to impose that the low-rank component is not sparse. As done in Candès et al. (2011), the singular value decomposition of the real low-rank component $X_{0}$ is written as $X_{0} = U Σ V^{*},$ where $U, V$ are the left-singular and the right-singular matrices. Suppose that the rank of $X_{0}$ is $r$ , the incoherence condition with parameter $μ$ is defined as $max_{i} {‖ U^{*} e_{i} ‖}^{2} \leq \frac{μ r}{m}, max_{i} {‖ V^{*} e_{i} ‖}^{2} \leq \frac{u r}{n}, {‖ {UV}^{*} ‖}_{\infty} \leq \sqrt{\frac{μ r}{m n}},$ where ${‖ M ‖}_{\infty} = {max}_{i, j} | M_{i j} |$ and $e_{i}$ is the unit vector that has a one in the $i$ th element and zeros everywhere else. As discussed in Candès et al. (2011), the incoherence condition asserts that for small values of $μ$ , the singular vectors of $X_{0}$ spread out. In other words, the low-rank component $X_{0}$ is not sparse. To avoid another meaningless situation, we assume that the sparsity pattern of the real sparse component is selected uniformly at random (the sparse component is not low-rank). Under these assumptions that the rank of the low-rank component is not too large and the sparse component is reasonably sparse, the model (2) can perfectly recover the low-rank and the sparse components. The result is given in the following theorem.

Theorem 1 Candès et al., 2011

Suppose $X_{0}$ obeys (4). Fix any $m \times n$ matrix $Ψ$ of signs. Suppose that the support set $Ω$ of $E_{0}$ is uniformly distributed among all sets of cardinality $w$ , and that $sgn ({[E_{0}]}_{i j}) = Ψ_{i j}$ for all $(i, j) \in Ω$ . Then, there is numerical constant $c$ such that with probability at least $1 - c n_{1}^{- 10}$ (over the choice of support of $E_{0}$ , $n_{1} = \max (m, n)$ ), the model (2) with a range of correct values of $λ$ can exactly recover the real low-rank and the real sparse components, provided that $rank (X_{0}) \leq ρ_{r} n_{2} μ^{- 1} {(log n_{1})}^{- 2}$ and $w \leq ρ_{s} m n$ . In this equation, $ρ_{r}$ and $ρ_{s}$ are positive numerical constants and $n_{2} = \min (m, n)$ .

Algorithms developed for the RPCA problem are often intuitive extensions of low-rank matrix recovery, therefore share a similar trajectory. Among the different methods proposed are heuristic deterministic approaches based on nuclear norm relaxation, such as singular value thresholding (SVT) (Cai et al., 2010), singular value projection (SVP) (Jain et al., 2010), an accelerated proximal gradient algorithm (APG) (Toh and Yun, 2010), the augmented lagrange multiplier method (ALM) (Lin et al., 0000) etc. Although these algorithms may work well theoretically, they have limited reconstruction effects since the nuclear norm may not be a good surrogate to the rank function. To get a more accurate and robust approximation to the rank function, Hu et al. proposed a novel method called truncated nuclear norm regularization (TNNR) (Hu et al., 2013) which only minimized the smallest $p$ singular values to recover the low-rank component. Noted that all the existing nonconvex penalty functions were concave and their gradients were decreasing functions, an iteratively reweighted nuclear norm (IRNN) was suggested in Lu et al. (2014). Inspired by the paradigm of $ℓ_{p}$ -norm in compressive sensing (Ince et al., 2013), some try to expand this concept to the traditional nuclear norm (Lu, 2014, Lu et al., 2015, Nie et al., 2012, Peng et al., 2014), which can approximate the rank function better.

In addition to the above deterministic methods, some statistical algorithms express the RPCA as the solution of a Bayesian inference problem and apply statistical tools to solve it. The statistical procedures (Manteiga and Vieu, 2007), recently gaining popularity, offer several advantages over deterministic methods. First, prior knowledge about the rank of matrix is not necessary, and the way to estimate the unknown rank is similar to the automatic relevance determination strategy in machine learning. On the other hand, algorithmic parameters are treated as stochastic quantities so that it is insensitive to the initialization of parameters. In Ding et al. (2011), the authors modeled the singular values of low-rank component $X$ and the entries of sparse component $E$ with beta-Bernoulli priors, and the resulting algorithm used a Markov Chain Monte Carlo (MCMC) sampling scheme with high computational complexity for inference. Babacan et al. (2012) adopted sparse Bayesian learning principles to recover the sparse component and the low-rank component, which started from a matrix factorization formulation and enforced the low-rank constraint in the process through the sparsity constraint. The complex noise, as considered in Zhao et al. (2014), results in representing data noise as a mixture of Gaussian, which could fit a wide range of noises and provide high recovery performance.

However, the present methods obviously impose the sparsity constraint on individual coefficient of $E$ , and ignore the spatial connection of nonzero coefficients. In many practical scenarios, the distributions of coefficients in the sparse component are not truly pixel-wise sparse but structurally sparse. There has been a stream of problems related to RPCA proposing the use of structural information. In face recognition, the sparse noises, such as occlusion and shadow constitute continuous region distributions at different face positions. The sparse outliers in background subtraction are also typically spatially continuous. It is possible to leverage the realistic sparse distributions that go beyond simple sparsity by exploiting structural dependencies between the values and locations of the sparse coefficients. Gao et al. (2014) introduced a novel algorithm for background, which fell into the category of group-sparse RPCA. However, these works do not extend the special circumstance to general RPCA. Clearly a further adjustment, either implicitly or explicitly, is needed.

The structured sparse model that parallels the conventional theory is a general extension of traditional sparse patterns. It effectively captures the structured properties of sparse component by clustering relevant parts together, and therefore the dependencies among different elements are considered. The structured sparsity has been proved useful in a variety of domains including compressive sensing (Baraniuk et al., 2010, Peleg et al., 2012, Yu et al., 2012, Drémeau et al., 2012), background subtraction (Gao et al., 2014), and face recognition (Jia et al., 2012, Xiao et al., 2014), etc. To intuitively illustrate the reasonability of group-sparse model, we show two different sparse distributions in Fig. 1, in which blank regions indicate the pixels are zeros and shadow regions correspond to outliers (nonzero values). Clearly, the two different patterns have same values by using individual sparse constraint (e.g., $ℓ_{1}$ -norm). Therefore, it seems plausible to further recognize different sparse patterns and capture the interactions between the related pixels.

Our motivation in this paper is to solve the RPCA problem with the structured sparse component. The beta process for latent factor analysis has been widely researched, which provides more flexibility for the learning of the sparsity. The zero elements in sparse component can be further guaranteed to be exactly zero by introducing the binary latent factor indicators. Taking advantage of the beta process factor analysis (Paisley and Carin, 2009), the estimation of sparse component is split into the magnitudes of sparse coefficients and the latent variables indicating sparse patterns. According to the inherent structures, the dependence between the entries in the support can be modeled through a graph with local clusters, thus a Gaussian-Gamma framework is employed on each of local clusters to control the sparsity. In this way, not only the values of sparse component but also their sparse patterns are exploited to boost the performance of RPCA. Moreover, the proposed method is cast within a Bayesian inference procedure and based on variational Bayesian approximation (McGrory and Titterington, 2007).

This paper is organized as follows. In Section 2, we present the proposed Bayesian framework. Subsequently, the variational Bayesian inference is employed to estimate the algorithm in Section 3. Experiments including synthetic and real data are shown in Section 4. After that, we devise an extension model which can potentially be applied to the background subtraction in Section 5. Finally, Section 6 gives the conclusion.

Next, we introduce some notations used in this paper. Boldfaced upper-case letters, e.g., $A$ denote matrices, while boldfaced lower-case letters, e.g., $a$ , denote column vectors. For a matrix $A$ , $A_{i \cdot}$ denotes the $i$ th row, $A_{\cdot i}$ denotes the $i$ th column, and $A_{i, j}$ denotes the element that lies in the $i$ th row and the $j$ th column. A $m$ by $m$ identity matrix is denoted by $I_{m}$ . $I_{m n}$ denotes the matrix in which all elements are equal to 1 and $l_{m}$ is a $m$ -dimension row vector of ones. $A^{T}$ and $Tr (A)$ denote the transpose and trace of $A$ , respectively. ${‖ A ‖}_{F}$ is Frobenius norm, which is the square root of the sum of the absolute value squared of all elements in matrix $A$ .

Section snippets

Bayesian RPCA model based on structured sparsity

Let $Y \in R^{m \times n}$ be the observed matrix which consists of three parts: low-rank component $X \in R^{m \times n}$ , sparse component $E \in R^{m \times n}$ and noise term $N \in R^{m \times n}$ . The Low-rank component $X$ is formulated as the form of matrix factorization, which is commonly used in recovering the low-rank matrix, i.e., $X = U V^{T}$ . As discussed above, we separate the estimation of sparse values from the estimation of latent labels, which indicate the sparse patterns, such that the sparse nature of $E$ can be guaranteed, i.e., $E = W \circ Z$ . The

Variational Bayesian inference

Let $Θ = {U, V, γ, W, α, Z, π, β}$ denote all hidden variables. Bayesian inference is based on evaluating the posterior distributions of unknowns given the observation. However, the posterior distributions are computationally intractable since the marginal distribution $p (Y)$ is not calculated analytically. In this paper, we use the variational Bayes (VB) (Bishop, 2006) to deal with the problem of intractable joint posterior distribution. The approximate posterior distribution is denoted by $q (Θ)$ . The

Experiments

In this section, we present the experimental results of synthetic and real data to illustrate the performance of proposed algorithm. Before conducting these experiments, we study the choice of cardinality $k$ (size of group). On the synthetic data, both unstructured and structured sparse cases are considered. Moreover, comparisons in different noise are made to examine the robustness of the proposed algorithm. Finally, we carry out experiments on real data including face recognition and

Discussion

So far, all issues in the process of handling RPCA with structured sparse component have been solved. In fact, the proposed variational Bayesian inference procedure estimates the posterior distributions of the unknowns iteratively, where in each iteration the method first computes the low-rank parts $q (U)$ , $q (V)$ , $q (γ)$ using (24), (25), (27), respectively. The structured sparse parts are calculated according to (28), (29), (30), (31), respectively. Finally, the estimation of noise is obtained from

Conclusion

In this paper, we proposed a variational Bayesian based algorithm for estimation of the low-rank and structured sparse components from the observed data. The key ingredient of our method is a more realistic structured sparsity model that goes beyond the simple sparsity pattern by codifying the interdependency structure among different elements of the sparse component. To this end, the estimation of sparse component is split into the magnitudes of sparse elements and latent variables indicating

Acknowledgments

The authors would like to express their sincere gratitude to Dr. Ying Li for many valuable suggestions and comments that helped to improve the paper. The work was partially supported by the National Natural Science Foundation of China (Grant No. 61379014).

References (34)

P. Giordani et al.
A comparison of three methods for principal component analysis of fuzzy interval data
Comput. Statist. Data Anal.
(2006)
T. Ince et al.
Nonconvex compressed sensing with partially known signal support
Signal Process.
(2013)
W.G. Manteiga et al.
Statistics for functional data
Comput. Statist. Data Anal.
(2007)
C.A. McGrory et al.
Variational approximations in Bayesian model selection for finite mixture distributions
Comput. Statist. Data Anal.
(2007)
S. Serneels et al.
Principal component analysis for data containing outliers and missing elements
Comput. Statist. Data Anal.
(2008)
L. Yu et al.
Bayesian compressive sensing for cluster structured sparse signals
Signal Process.
(2012)
S.D. Babacan et al.
Sparse Bayesian methods for low-rank matrix estimation
IEEE Trans. Signal Process.
(2012)
R.G. Baraniuk et al.
Model-based compressive sensing
IEEE Trans. Inform. Theory
(2010)
C.M. Bishop
Pattern Recognition and Machine Learning
(2006)
S. Brutzer et al.
Evaluation of background subtraction techniques for video surveillance

J.-F. Cai et al.

A singular value thresholding algorithm for matrix completion

SIAM J. Optim.

(2010)

E.J. Candès et al.

Robust principal component analysis ?

J. ACM

(2011)

X. Ding et al.