Elsevier

Neurocomputing

Volume 272, 10 January 2018, Pages 279-293
Neurocomputing

Hierarchical sparse coding from a Bayesian perspective

https://doi.org/10.1016/j.neucom.2017.06.076Get rights and content

Abstract

We consider the problem of hierarchical sparse coding, where not only a few groups of atoms are active at a time but also each group enjoys internal sparsity. The current approaches are usually to achieve between-group sparsity using the 1 penalty, such that many groups have small coefficients rather than being accurately zeroed out. The trivial groups may incur the proneness to overfitting of noise and are thereby harmful to interpretability of sparse representation. To this end, we in this paper reformulate the hierarchical sparse model from a Bayesian perspective employing twofold priors: the spike-and-slab prior and the Laplacian prior. The former is utilized to explicitly induce between-group sparsity, while the latter is adopted for both inducing within-group sparsity and obtaining a small reconstruction error. We propose a nest prior by integrating the both priors to result in hierarchical sparsity. The resultant optimization problem can be delivered a convergence solution in a few iterations via the proposed nested algorithm, corresponding to the nested prior. In experiments, we evaluate the performance of our method on signal recovery, image inpainting and sparse representation based classification, with simulated signals and two publicly available image databases. The results manifest that the proposed method, compared with the popular methods for sparse coding, can yield more concise representation and more reliable interpretation of data.

Introduction

Parsimony, preferring a simple explanation rather than a complex one, is probably one of the most important principles and is widely used for guiding data modeling. In the context of statistics and machine leaning, it takes the form of compact representations and economical models of data [1]. Compact representation of data can discover inherent interpretation for data generation and is thus conductive to subsequent data analysis tasks, such as principal component analysis (PCA) [2] and fast correlation-based filter (FCBF) [3]. And economical model of data, selected in modeling, can effectively avoid overfitting and thus results in better generalization performance, such as the joint embedding learning and sparse regression [4]. Achieving parsimonious representations of data and models has become the primary task in most of scientific domains, especially when the studies are related to high-dimensional observation.

In response to the parsimony, sparse modeling asserts that a high-dimensional data has many coefficients equal to zero, when linearly represented by a combination of atoms from a given or learned dictionary [5], [6]. In the case of fixed dictionary, it is usually referred to as sparse coding that we consider in this paper. As the kernel of sparse modeling, sparse coding not only delivers a concise representation for data explanation [7] and analysis [8], [9], [10], but also yields a succinct linear model for data reconstruction [11] and prediction [12]. Therefore, sparse coding attracts an enormous amount of studies in recent years and has achieved many appealing results in computer vision [13], pattern analysis [8], compressed sensing [14] and images retrieval [15].

With the sparsity assertion, sparse coding aims at selecting as few atoms as possible from the given overcomplete dictionary to linearly reconstruct the probe data, meanwhile maintaining the reconstruction error as small as possible. Concretely, let yRd be a probe data and DRd×m be the dictionary composed of m atoms di. Sparse coding is to seek the sparse representation xRm with respect to y in D, using the following primitive formulation: minxx0subjecttoy=Dx,where ‖ · ‖0 denotes the 0-norm, which counts the number of nonzero entries in a vector. It is well known that solving (1) is NP-hard, however, a sub-optimal solution can be obtained using greedy scheme, such as orthogonal matching pursuit (OMP) [16]. An alternative route is to relax it into its closest convex proxy via the 1-norm [17], [18], whose Lagrange formulation can be cast as: minx12yDx22+λx1,where the parameter λ is used for balancing the reconstruction error and the sparsity. This convex formula is often referred to as Lasso [12] or basis pursuit [19]. Countless algorithms have been put forth to pursue the optimal solution to (2), such as the least angle regression (LARS) [20]; see [13], [21] and reference therein.

In many practical situations, one often knows a group structure on the coefficient vector x in addition to sparsity. In the case, the nonzero entries of x are no more unrelated but appear in groups, i.e., groups of atoms should be selected or ignored simultaneously. For example, in genomics study, factors of the gene expression patterns are expected to involve groups of genes corresponding to biological pathway or set of genes which are neighbors in the protein-protein interaction network. Taking this higher-order prior knowledge into account is beneficial to sparse coding on improving both interpretability [13] and predictive performance [22]. Elastic Net penalizes both the 1-norm and the 2-norm to promote such grouping effect [23]. While, Group Lasso imposes the 1/2-norm on groups to directly encourage between-group sparsity [24]. Denoting by x=[x1,x2,,xJ]Rm the coefficient vector partitioned into J groups and by D=[D1,D2,,DJ]Rd×m the corresponding dictionary matrix, Group Lasso can be cast as: minx12yi=1JDjxj22+λj=1Jpjxj2,where pj is the scale of the group xj and the second term is just the mixed 1/2-norm. Group Lasso can be tackled by convex algorithms [21] and has been extended to the case of overlapping groups [25]. Its greedy version [17] can be written as: minxj=1JI(xj2>0)subjecttoy=Dx,where I(·) denotes an indicator function. The approximate solution to (4) can be delivered by the block orthogonal matching pursuit (BOMP) algorithm [17]. Besides, the similar methods, structure orthogonal matching pursuit (structOMP) [18] and group orthogonal matching pursuit (GroupOMP) [26], are also investigated with different theory perspectives and applications.

However, the group sparse coding results in dense coefficients in the selected groups, where sparse effect is often expected to be imposed for further interpretation. For example, in genomics, we would like sparsity within each group in addition to sparsity between groups such that we can identify particularly important genes in biological pathways of interest. Adding such knowledge of hierarchical sparse prior can give rise to both more robust representation and more approving data interpretation, in comparison with group sparse coding [7]. Towards this end, Sparse-Group Lasso uses a convex combination of the Lasso penalty and the Group-Lasso penalty for the hierarchical sparsity at both group level and atom level [27], which is cast as: minx12yj=1JDjxj22+(1α)λj=1Jpjxj2+αλx1,where α ∈ [0, 1] is to construct a convex combination of penalties: α=0 gives the Group-Lasso fit, α=1 gives the Lasso fit. The similar method, named Hierarchical Lasso (HiLasso) [28], also uses the 1/2-norm and the 1-norm for hierarchical sparsity, but without the convex combination. Introducing two independent balance parameters λ1 and λ2, HiLasso aims to: minx12yj=1JDjxj22+λ1j=1Jxj2+λ2x1.

Actually, when the groups are with equal size, the Sparse-Group Lasso becomes a specific case of the HiLasso with the constrained parameter selection shown in formula (5). Since both (5) and (6) are convex, various convex algorithms can be exploited [21]. Note that the formulation for hierarchical sparsity is more general, since it can degrade into the Lasso (2) and the Group Lasso (3). Hence, hierarchical sparse model is capable of dealing with various application scenarios and attracts a lot of attention. The combination of the squared mixed 1/2-norm and the 0-norm is developed in [29]. In [30], 0+1+2 regularization is fully used for hierarchical sparsity. Besides, a more complex regularization is presented in [31].

Hierarchical sparsity has been reached, nevertheless, the current methods mostly achieve between-group sparsity by imposing the 1-norm on the groups. For obtaining an intuitive motivation, here, we sparsely reconstruct the corrupted image from the USPS dataset,1 shown in Fig. 1(a). And we at random select 100 images per digit as the dictionary, over which the representation is supposed to be hierarchically sparse. That is, the corrupted image is believed to select those images of similar chirography and belonging to the same digit for reconstruction [8]. Sparse-Group Lasso, HiLasso and our method respectively give rise to the sparse coefficients, shown in Fig. 1. From Fig. 1(b, c), we can find that Sparse-Group Lasso and HiLasso result in the two groups with small coefficients rather than being completely zeroed out. For this phenomenon, there are at least two reasons. The one is noise of data: The noise likely break the original trade-off between the reconstruction error and the sparsity, and thus increases the sparsity for seeking the new compromise. Another one is the rough penalization using the 1-norm: Assigning an identical penalty to all groups causes the under-penalization on the groups that should be ignored, but the over-penalization on the groups with true large coefficients. As a result, the resulting representation is unclean and harmful to its interpretability. Several studies have concerned the plights, such as the reports [32], [33] for the first reason and the adaptive Lasso [34], [35] for the second reason. In addition, the method [29] mainly replaces the 1-norm of the HiLasso with the 0-norm and fails to consider the two drawbacks; the route [30] fails to explicitly take the group structure into account; and the approach [31], like the Sparse-Group Lasso, only considers the second shortcoming. Thus, there are still short of effective methods for tackling both of them, especially for hierarchical sparse coding.

To this end, in this paper, we consider the problem of hierarchical sparse coding utilizing the Bayesian framework that can usually yields a more explicable formulation [36]. In the framework, we construct a nested prior for hierarchical sparsity via employing the spike-and-slab prior and the Laplace prior together. Introducing the former to the level of group, resulting in the 0-norm, aims to achieve between-group sparsity; while imposing the latter on each atom within group, resulting in the 1-norm, is expected to arrive at within-group sparsity as well as small reconstruction error. Our approach can yield a clearer sparse representation, like the example shown in Fig. 1(d). The main contributions of our paper can be outlined as follows:

  • (1) We provide a Bayesian interpretation for hierarchical sparse coding utilizing the proposed nested prior. To the best of our knowledge, this is the first time that such prior, integrating the spike-and-slab prior and the Laplace prior, is developed for sparse coding.

  • (2) The resultant mathematical formulation is general, because it can degrade into the BOMP and the Lasso at specific situations. Different from the sparse-group Lasso and HiLasso, it explicitly stipulates between-group sparsity by the 0-norm no longer the 1-norm.

  • (3) Inspired by the proposed nested prior, we knit a simple and easy algorithm via combining the OMP algorithm and the LARS algorithm to reap a convergent solution to the resultant optimization problem in a few iterations. Besides, we offer some algorithm analyses.

  • (4) In the proposed algorithm, we take into account the two drawbacks above. Concretely, we perform greedy iteration of group selection until a suspected group that overfits noise. On the other hand, sparse coding is limited to using the selected groups such that the maximal penalization is imposed on the ignored groups.

The remainder of the paper is organized as follows. In Section 2, we simply review the use of sparsity-inducing prior, involving the spike-and-slab prior and the Laplacian prior. In Section 3, we propose the nested prior, exhibit our method and the solving algorithm, and analyze the algorithm. Then the experimental results are all shown in Section 4, followed by some discussions. We finally conclude this paper in Section 5.

Section snippets

Sparse coding with the Bayesian framework

To set the stage for introducing our Bayesian formulation, in this section, we review the common basic assumption of sparse linear model and the two extensively used sparsity-inducing priors: the spike-and-slab prior and the Laplace prior.

Usually, we consider the sparse linear representation model [37] as follows: y=Dx+ɛ,where it is common to assume that the noise follows an i.i.d. Gaussian distribution with zero mean and variance σ2, and thus ɛ ∼ Normal(0, σ2I). Note that in the context of d > 

The proposed method for hierarchical sparse coding

The current approaches for hierarchical sparse coding always achieve the between-group sparsity by imposing the 1-norm at the group level, such that many groups have small coefficients rather than being discarded. To this end, we lean on the respective merits of the both priors recalled above to reformulate the hierarchical sparse coding from a novel Bayesian perspective.

Experiments

In this section, we conduct some experiments to evaluate the performance of our proposed method. First, we recover 1D simulated signals from its noisy observations to investigate the capability of discovering the correct sparse pattern. Then, we apply OEL on image inpainting for observing the capability of denoising and the compactness of representation. Finally we probe the discrimination of sparse representation under high confidence level. The related methods are used for comparison: OMP [16]

Conclusion and future work

In this paper, we reformulate the hierarchical sparse coding using the Bayesian framework, where we develop a nested prior by integrating two common sparsity-inducing priors: the Laplacian prior and the spike-and-slab prior. The resulting objective task more explicitly stipulates between-group sparsity than the popular approaches. Then we present a simple and easy algorithm for pursuing its convergence solution. The experimental results, on signal recovery and image recovery, show that the

Acknowledgments

The authors would like to thank the editors and any anonymous reviewers for their constructive suggestions and helpful comments.

Yupei Zhang received the B.Eng. degree in computer science and technology from East China University of Technology in 2009 and the M.Eng. degree in computer software and theory from Zhengzhou University in 2013. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University. His current research interests mainly include sparse representation, pattern recognition and machine learning.

References (51)

  • ZhangY. et al.

    Low-rank preserving embedding

    Pattern Recognit.

    (2017)
  • T. Hastie et al.

    Statistical Learning with Sparsity: The Lasso and Generalizations

    (2015)
  • HouC. et al.

    Joint embedding learning and sparse regression: a framework for unsupervised feature selection

    IEEE Trans. Cybern.

    (2014)
  • A.M. Bruckstein et al.

    From sparse solutions of systems of equations to sparse modeling of signals and images

    SIAM Rev.

    (2009)
  • M. Aharon et al.

    K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation

    IEEE Trans. Signal Process.

    (2006)
  • A. Zaslaver et al.

    Hierarchical sparse coding in the sensory system of Caenorhabditis elegans

    Proc. Natl. Acad. Sci.

    (2015)
  • J. Wright et al.

    Robust face recognition via sparse representation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • E. Elhamifar et al.

    Sparse subspace clustering: algorithm, theory, and applications

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • J. Mairal et al.

    Sparse representation for color image restoration

    IEEE Trans. Image Process.

    (2008)
  • R. Tibshirani

    Regression shrinkage and selection via the lasso

    J. R. Stat. Soc.. Ser. B (Methodol.)

    (1996)
  • J. Mairal, F. Bach, J. Ponce, Sparse modeling for image and vision processing, arXiv preprint arXiv:1411.3230,...
  • B.E.N. Adcock et al.

    Breaking the coherence barrier: a new theory for compressed sensing

    Forum Math., Sigma

    (2017)
  • J. Tropp et al.

    Signal recovery from random measurements via orthogonal matching pursuit

    IEEE Trans. Inf. Theory

    (2007)
  • Y.C. Eldar et al.

    Block-sparse signals: uncertainty relations and efficient recovery

    IEEE Trans. Signal Process.

    (2010)
  • HuangJ. et al.

    Learning with structured sparsity

    J. Mach. Learn. Res.

    (2011)
  • Cited by (10)

    View all citing articles on Scopus

    Yupei Zhang received the B.Eng. degree in computer science and technology from East China University of Technology in 2009 and the M.Eng. degree in computer software and theory from Zhengzhou University in 2013. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University. His current research interests mainly include sparse representation, pattern recognition and machine learning.

    Ming Xiang received the B.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi'an, China, in 1987 and 1999 respectively, and currently works as an associate professor in the department of computer science and technology in Xi'an Jiaotong University, Xi'an, China. His current research interests mainly include information fusion, pattern recognition and machine learning.

    Bo Yang received the B.Eng. degree in computer science and technology from Xi'an University of Posts & Telecommunication, Xi'an, China, in 2005, and received the M.Eng. degree in computer system architecture from Xidian University, Xi'an, China, in 2009. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University, Xi'an, China. His current research interests mainly include manifold learning, pattern recognition and machine learning.

    View full text