Hierarchical sparse coding from a Bayesian perspective

doi:10.1016/j.neucom.2017.06.076

Neurocomputing

Volume 272, 10 January 2018, Pages 279-293

https://doi.org/10.1016/j.neucom.2017.06.076 Get rights and content

Abstract

We consider the problem of hierarchical sparse coding, where not only a few groups of atoms are active at a time but also each group enjoys internal sparsity. The current approaches are usually to achieve between-group sparsity using the ℓ₁ penalty, such that many groups have small coefficients rather than being accurately zeroed out. The trivial groups may incur the proneness to overfitting of noise and are thereby harmful to interpretability of sparse representation. To this end, we in this paper reformulate the hierarchical sparse model from a Bayesian perspective employing twofold priors: the spike-and-slab prior and the Laplacian prior. The former is utilized to explicitly induce between-group sparsity, while the latter is adopted for both inducing within-group sparsity and obtaining a small reconstruction error. We propose a nest prior by integrating the both priors to result in hierarchical sparsity. The resultant optimization problem can be delivered a convergence solution in a few iterations via the proposed nested algorithm, corresponding to the nested prior. In experiments, we evaluate the performance of our method on signal recovery, image inpainting and sparse representation based classification, with simulated signals and two publicly available image databases. The results manifest that the proposed method, compared with the popular methods for sparse coding, can yield more concise representation and more reliable interpretation of data.

Introduction

Parsimony, preferring a simple explanation rather than a complex one, is probably one of the most important principles and is widely used for guiding data modeling. In the context of statistics and machine leaning, it takes the form of compact representations and economical models of data [1]. Compact representation of data can discover inherent interpretation for data generation and is thus conductive to subsequent data analysis tasks, such as principal component analysis (PCA) [2] and fast correlation-based filter (FCBF) [3]. And economical model of data, selected in modeling, can effectively avoid overfitting and thus results in better generalization performance, such as the joint embedding learning and sparse regression [4]. Achieving parsimonious representations of data and models has become the primary task in most of scientific domains, especially when the studies are related to high-dimensional observation.

In response to the parsimony, sparse modeling asserts that a high-dimensional data has many coefficients equal to zero, when linearly represented by a combination of atoms from a given or learned dictionary [5], [6]. In the case of fixed dictionary, it is usually referred to as sparse coding that we consider in this paper. As the kernel of sparse modeling, sparse coding not only delivers a concise representation for data explanation [7] and analysis [8], [9], [10], but also yields a succinct linear model for data reconstruction [11] and prediction [12]. Therefore, sparse coding attracts an enormous amount of studies in recent years and has achieved many appealing results in computer vision [13], pattern analysis [8], compressed sensing [14] and images retrieval [15].

With the sparsity assertion, sparse coding aims at selecting as few atoms as possible from the given overcomplete dictionary to linearly reconstruct the probe data, meanwhile maintaining the reconstruction error as small as possible. Concretely, let $y \in R^{d}$ be a probe data and $D \in R^{d \times m}$ be the dictionary composed of m atoms d_i. Sparse coding is to seek the sparse representation $x \in R^{m}$ with respect to y in D, using the following primitive formulation: $min_{x} {∥ x ∥}_{0} subject to y = Dx,$ where ‖ · ‖₀ denotes the ℓ₀-norm, which counts the number of nonzero entries in a vector. It is well known that solving (1) is NP-hard, however, a sub-optimal solution can be obtained using greedy scheme, such as orthogonal matching pursuit (OMP) [16]. An alternative route is to relax it into its closest convex proxy via the ℓ₁-norm [17], [18], whose Lagrange formulation can be cast as: $min_{x} \frac{1}{2} {∥ y - Dx ∥}_{2}^{2} + λ {∥ x ∥}_{1},$ where the parameter λ is used for balancing the reconstruction error and the sparsity. This convex formula is often referred to as Lasso [12] or basis pursuit [19]. Countless algorithms have been put forth to pursue the optimal solution to (2), such as the least angle regression (LARS) [20]; see [13], [21] and reference therein.

In many practical situations, one often knows a group structure on the coefficient vector x in addition to sparsity. In the case, the nonzero entries of x are no more unrelated but appear in groups, i.e., groups of atoms should be selected or ignored simultaneously. For example, in genomics study, factors of the gene expression patterns are expected to involve groups of genes corresponding to biological pathway or set of genes which are neighbors in the protein-protein interaction network. Taking this higher-order prior knowledge into account is beneficial to sparse coding on improving both interpretability [13] and predictive performance [22]. Elastic Net penalizes both the ℓ₁-norm and the ℓ₂-norm to promote such grouping effect [23]. While, Group Lasso imposes the ℓ₁/ℓ₂-norm on groups to directly encourage between-group sparsity [24]. Denoting by $x = [x_{1}, x_{2}, \dots, x_{J}] \in R^{m}$ the coefficient vector partitioned into J groups and by $D = [D_{1}, D_{2}, \dots, D_{J}] \in R^{d \times m}$ the corresponding dictionary matrix, Group Lasso can be cast as: $min_{x} \frac{1}{2} {∥ y - \sum_{i = 1}^{J} D_{j} x_{j} ∥}_{2}^{2} + λ \sum_{j = 1}^{J} \sqrt{p_{j}} {∥ x_{j} ∥}_{2},$ where p_j is the scale of the group x_j and the second term is just the mixed ℓ₁/ℓ₂-norm. Group Lasso can be tackled by convex algorithms [21] and has been extended to the case of overlapping groups [25]. Its greedy version [17] can be written as: $min_{x} \sum_{j = 1}^{J} I ({∥ x_{j} ∥}_{2} > 0) subject to y = Dx,$ where $I (\cdot)$ denotes an indicator function. The approximate solution to (4) can be delivered by the block orthogonal matching pursuit (BOMP) algorithm [17]. Besides, the similar methods, structure orthogonal matching pursuit (structOMP) [18] and group orthogonal matching pursuit (GroupOMP) [26], are also investigated with different theory perspectives and applications.

However, the group sparse coding results in dense coefficients in the selected groups, where sparse effect is often expected to be imposed for further interpretation. For example, in genomics, we would like sparsity within each group in addition to sparsity between groups such that we can identify particularly important genes in biological pathways of interest. Adding such knowledge of hierarchical sparse prior can give rise to both more robust representation and more approving data interpretation, in comparison with group sparse coding [7]. Towards this end, Sparse-Group Lasso uses a convex combination of the Lasso penalty and the Group-Lasso penalty for the hierarchical sparsity at both group level and atom level [27], which is cast as: $min_{x} \frac{1}{2} {∥ y - \sum_{j = 1}^{J} D_{j} x_{j} ∥}_{2}^{2} + (1 - α) λ \sum_{j = 1}^{J} \sqrt{p_{j}} {∥ x_{j} ∥}_{2} + α λ {∥ x ∥}_{1},$ where α ∈ [0, 1] is to construct a convex combination of penalties: $α = 0$ gives the Group-Lasso fit, $α = 1$ gives the Lasso fit. The similar method, named Hierarchical Lasso (HiLasso) [28], also uses the ℓ₁/ℓ₂-norm and the ℓ₁-norm for hierarchical sparsity, but without the convex combination. Introducing two independent balance parameters λ₁ and λ₂, HiLasso aims to: $min_{x} \frac{1}{2} {∥ y - \sum_{j = 1}^{J} D_{j} x_{j} ∥}_{2}^{2} + λ_{1} \sum_{j = 1}^{J} {∥ x_{j} ∥}_{2} + λ_{2} {∥ x ∥}_{1} .$

Actually, when the groups are with equal size, the Sparse-Group Lasso becomes a specific case of the HiLasso with the constrained parameter selection shown in formula (5). Since both (5) and (6) are convex, various convex algorithms can be exploited [21]. Note that the formulation for hierarchical sparsity is more general, since it can degrade into the Lasso (2) and the Group Lasso (3). Hence, hierarchical sparse model is capable of dealing with various application scenarios and attracts a lot of attention. The combination of the squared mixed ℓ₁/ℓ₂-norm and the ℓ₀-norm is developed in [29]. In [30], ℓ₀+ℓ₁+ℓ₂ regularization is fully used for hierarchical sparsity. Besides, a more complex regularization is presented in [31].

Hierarchical sparsity has been reached, nevertheless, the current methods mostly achieve between-group sparsity by imposing the ℓ₁-norm on the groups. For obtaining an intuitive motivation, here, we sparsely reconstruct the corrupted image from the USPS dataset,¹ shown in Fig. 1(a). And we at random select 100 images per digit as the dictionary, over which the representation is supposed to be hierarchically sparse. That is, the corrupted image is believed to select those images of similar chirography and belonging to the same digit for reconstruction [8]. Sparse-Group Lasso, HiLasso and our method respectively give rise to the sparse coefficients, shown in Fig. 1. From Fig. 1(b, c), we can find that Sparse-Group Lasso and HiLasso result in the two groups with small coefficients rather than being completely zeroed out. For this phenomenon, there are at least two reasons. The one is noise of data: The noise likely break the original trade-off between the reconstruction error and the sparsity, and thus increases the sparsity for seeking the new compromise. Another one is the rough penalization using the ℓ₁-norm: Assigning an identical penalty to all groups causes the under-penalization on the groups that should be ignored, but the over-penalization on the groups with true large coefficients. As a result, the resulting representation is unclean and harmful to its interpretability. Several studies have concerned the plights, such as the reports [32], [33] for the first reason and the adaptive Lasso [34], [35] for the second reason. In addition, the method [29] mainly replaces the ℓ₁-norm of the HiLasso with the ℓ₀-norm and fails to consider the two drawbacks; the route [30] fails to explicitly take the group structure into account; and the approach [31], like the Sparse-Group Lasso, only considers the second shortcoming. Thus, there are still short of effective methods for tackling both of them, especially for hierarchical sparse coding.

To this end, in this paper, we consider the problem of hierarchical sparse coding utilizing the Bayesian framework that can usually yields a more explicable formulation [36]. In the framework, we construct a nested prior for hierarchical sparsity via employing the spike-and-slab prior and the Laplace prior together. Introducing the former to the level of group, resulting in the ℓ₀-norm, aims to achieve between-group sparsity; while imposing the latter on each atom within group, resulting in the ℓ₁-norm, is expected to arrive at within-group sparsity as well as small reconstruction error. Our approach can yield a clearer sparse representation, like the example shown in Fig. 1(d). The main contributions of our paper can be outlined as follows:

(1) We provide a Bayesian interpretation for hierarchical sparse coding utilizing the proposed nested prior. To the best of our knowledge, this is the first time that such prior, integrating the spike-and-slab prior and the Laplace prior, is developed for sparse coding.
(2) The resultant mathematical formulation is general, because it can degrade into the BOMP and the Lasso at specific situations. Different from the sparse-group Lasso and HiLasso, it explicitly stipulates between-group sparsity by the ℓ₀-norm no longer the ℓ₁-norm.
(3) Inspired by the proposed nested prior, we knit a simple and easy algorithm via combining the OMP algorithm and the LARS algorithm to reap a convergent solution to the resultant optimization problem in a few iterations. Besides, we offer some algorithm analyses.
(4) In the proposed algorithm, we take into account the two drawbacks above. Concretely, we perform greedy iteration of group selection until a suspected group that overfits noise. On the other hand, sparse coding is limited to using the selected groups such that the maximal penalization is imposed on the ignored groups.

The remainder of the paper is organized as follows. In Section 2, we simply review the use of sparsity-inducing prior, involving the spike-and-slab prior and the Laplacian prior. In Section 3, we propose the nested prior, exhibit our method and the solving algorithm, and analyze the algorithm. Then the experimental results are all shown in Section 4, followed by some discussions. We finally conclude this paper in Section 5.

Section snippets

Sparse coding with the Bayesian framework

To set the stage for introducing our Bayesian formulation, in this section, we review the common basic assumption of sparse linear model and the two extensively used sparsity-inducing priors: the spike-and-slab prior and the Laplace prior.

Usually, we consider the sparse linear representation model [37] as follows: $y = Dx + ɛ,$ where it is common to assume that the noise follows an i.i.d. Gaussian distribution with zero mean and variance σ², and thus ɛ ∼ Normal(0, σ²I). Note that in the context of d > 

The proposed method for hierarchical sparse coding

The current approaches for hierarchical sparse coding always achieve the between-group sparsity by imposing the ℓ₁-norm at the group level, such that many groups have small coefficients rather than being discarded. To this end, we lean on the respective merits of the both priors recalled above to reformulate the hierarchical sparse coding from a novel Bayesian perspective.

Experiments

In this section, we conduct some experiments to evaluate the performance of our proposed method. First, we recover 1D simulated signals from its noisy observations to investigate the capability of discovering the correct sparse pattern. Then, we apply OEL on image inpainting for observing the capability of denoising and the compactness of representation. Finally we probe the discrimination of sparse representation under high confidence level. The related methods are used for comparison: OMP [16]

Conclusion and future work

In this paper, we reformulate the hierarchical sparse coding using the Bayesian framework, where we develop a nested prior by integrating two common sparsity-inducing priors: the Laplacian prior and the spike-and-slab prior. The resulting objective task more explicitly stipulates between-group sparsity than the popular approaches. Then we present a simple and easy algorithm for pursuing its convergence solution. The experimental results, on signal recovery and image recovery, show that the

Acknowledgments

The authors would like to thank the editors and any anonymous reviewers for their constructive suggestions and helpful comments.

Yupei Zhang received the B.Eng. degree in computer science and technology from East China University of Technology in 2009 and the M.Eng. degree in computer software and theory from Zhengzhou University in 2013. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University. His current research interests mainly include sparse representation, pattern recognition and machine learning.

References (51)

YiS. et al.
Joint sparse principal component analysis
Pattern Recognit.
(2017)
LiuY. et al.
A fault diagnosis approach for diesel engines based on self-adaptive WVD, improved FCBF and PECOC-RVM
Neurocomputing
(2016)
ZhangY. et al.
Linear dimensionality reduction based on Hybrid structure preserving projections
Neurocomputing
(2016)
ZhangY. et al.
Graph regularized nonnegative sparse coding using incoherent dictionary for approximate nearest neighbor search
Pattern Recognit.
(2017)
B.A. Olshausen et al.
Sparse coding with an overcomplete basis set: a strategy employed by V1?
Vis. Res.
(1997)
LuX. et al.
Sparse coding from a Bayesian perspective
IEEE Trans. Neural Netw. Learn. Syst.
(2013)
LuX. et al.
Sparse coding for image denoising using spike and slab prior
Neurocomputing
(2013)
N. Akhtar et al.
Efficient classification with sparsity augmented collaborative representation
Pattern Recognit.
(2017)
I. Nouretdinov et al.
Machine learning classification with confidence: application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression
Neuroimage
(2011)
WangX. et al.
Structured regularized robust coding for face recognition
Neurocomputing
(2016)

ZhangY. et al.

Low-rank preserving embedding

Pattern Recognit.

(2017)

T. Hastie et al.

Statistical Learning with Sparsity: The Lasso and Generalizations

(2015)

HouC. et al.

Joint embedding learning and sparse regression: a framework for unsupervised feature selection

IEEE Trans. Cybern.

(2014)

A.M. Bruckstein et al.

From sparse solutions of systems of equations to sparse modeling of signals and images

SIAM Rev.

(2009)

M. Aharon et al.

K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation

IEEE Trans. Signal Process.

(2006)

A. Zaslaver et al.

Hierarchical sparse coding in the sensory system of Caenorhabditis elegans

Proc. Natl. Acad. Sci.

(2015)

J. Wright et al.

Robust face recognition via sparse representation

IEEE Trans. Pattern Anal. Mach. Intell.

(2009)

E. Elhamifar et al.

Sparse subspace clustering: algorithm, theory, and applications

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

J. Mairal et al.

Sparse representation for color image restoration

IEEE Trans. Image Process.

(2008)

R. Tibshirani

Regression shrinkage and selection via the lasso

J. R. Stat. Soc.. Ser. B (Methodol.)

(1996)

J. Mairal, F. Bach, J. Ponce, Sparse modeling for image and vision processing, arXiv preprint arXiv:1411.3230,...

B.E.N. Adcock et al.

Breaking the coherence barrier: a new theory for compressed sensing

Forum Math., Sigma

(2017)

J. Tropp et al.

Signal recovery from random measurements via orthogonal matching pursuit

IEEE Trans. Inf. Theory

(2007)

Y.C. Eldar et al.

Block-sparse signals: uncertainty relations and efficient recovery

IEEE Trans. Signal Process.

(2010)

HuangJ. et al.

Learning with structured sparsity

J. Mach. Learn. Res.

(2011)

Cited by (10)

Framelet regularization for uneven intensity correction of color images with illumination and reflectance estimation
2018, Neurocomputing
Citation Excerpt :
Unfortunately, many finer details are still smoothed in the estimated reflectance, even less than those of the reflectance estimated by [39]. In recent years, the image sparsity, which has been widely applied in sparse coding for image super-resolution [44,45], image denoising [46], image segmentation [47] and image quality assessment [48], is often used as an constraint item for constructing a variational model [42,43]. To fully exploit the image sparsity, the framelet transform is recently used as a framelet regularization for image restoration [49,50].
To solve the problem of simultaneously estimating the illumination and reflectance (IR) from a single image based on the Retinex theory, an effective way is utilizing a Maximum-a-Posterior (MAP) distribution as an approximation. However, the current MAP-based image enhancement methods fail to fully utilize the property of the reflectance, which leads to the loss of detailed structures of images. Through a large number of observations, it is found that the properties of reflectance can be effectively extracted by a powerful operator called framelet transform. Therefore, we propose a novel image enhancement scheme with framelet regularization on the reflectance, which is able to simultaneously estimate the IR while keeping image details. To be specific, a MAP distribution is adopted where a framelet regularization is proposed as a prior to exploiting the multi-scale edge information and sparsity of reflectance. Then the MAP problem is converted to a minimization of an energy function, which can be efficiently solved by an alternating direction method of multipliers with split Bregman iteration (ADMM-SBI). Furthermore, an adaptive Gamma correction operator is proposed to avoid over-enhancement of the illumination. Experiments show that the proposed approach outperforms the state-of-the-arts in terms of brightness improvement, contrast enhancement and details preservation.
Identifying Non-Math Students from Brain MRIs with an Ensemble Classifier Based on Subspace-Enhanced Contrastive Learning
2022, Brain Sciences
An Effective Chinese Text Classification Method with Contextualized Weak Supervision for Review Autograding
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Educational Data Mining Techniques for Student Performance Prediction: Method Review and Comparison Analysis
2021, Frontiers in Psychology
An MRI Study on Effects of Math Education on Brain Development Using Multi-Instance Contrastive Learning
2021, Frontiers in Psychology
DCAE: Selecting Discriminative Genes on Single-cell RNA-seq Data for Cell-type Quantification
2021, Proceedings - 2021 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021

View all citing articles on Scopus

Ming Xiang received the B.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi'an, China, in 1987 and 1999 respectively, and currently works as an associate professor in the department of computer science and technology in Xi'an Jiaotong University, Xi'an, China. His current research interests mainly include information fusion, pattern recognition and machine learning.

Bo Yang received the B.Eng. degree in computer science and technology from Xi'an University of Posts & Telecommunication, Xi'an, China, in 2005, and received the M.Eng. degree in computer system architecture from Xidian University, Xi'an, China, in 2009. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University, Xi'an, China. His current research interests mainly include manifold learning, pattern recognition and machine learning.

View full text

Hierarchical sparse coding from a Bayesian perspective

Abstract

Introduction

Section snippets

Sparse coding with the Bayesian framework

The proposed method for hierarchical sparse coding

Experiments

Conclusion and future work

Acknowledgments

Pattern Recognit.

Neurocomputing

Neurocomputing

Pattern Recognit.

Vis. Res.

IEEE Trans. Neural Netw. Learn. Syst.

Neurocomputing

Pattern Recognit.

Neuroimage

Neurocomputing

Pattern Recognit.

Statistical Learning with Sparsity: The Lasso and Generalizations

Joint embedding learning and sparse regression: a framework for unsupervised feature selection

IEEE Trans. Cybern.

From sparse solutions of systems of equations to sparse modeling of signals and images

SIAM Rev.

K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation

IEEE Trans. Signal Process.

Hierarchical sparse coding in the sensory system of Caenorhabditis elegans

Proc. Natl. Acad. Sci.

Robust face recognition via sparse representation

IEEE Trans. Pattern Anal. Mach. Intell.

Sparse subspace clustering: algorithm, theory, and applications

IEEE Trans. Pattern Anal. Mach. Intell.

Sparse representation for color image restoration

IEEE Trans. Image Process.

Regression shrinkage and selection via the lasso

J. R. Stat. Soc.. Ser. B (Methodol.)

Breaking the coherence barrier: a new theory for compressed sensing

Forum Math., Sigma

Signal recovery from random measurements via orthogonal matching pursuit

IEEE Trans. Inf. Theory

Block-sparse signals: uncertainty relations and efficient recovery

IEEE Trans. Signal Process.

Learning with structured sparsity

J. Mach. Learn. Res.