Hierarchical sparse coding from a Bayesian perspective
Introduction
Parsimony, preferring a simple explanation rather than a complex one, is probably one of the most important principles and is widely used for guiding data modeling. In the context of statistics and machine leaning, it takes the form of compact representations and economical models of data [1]. Compact representation of data can discover inherent interpretation for data generation and is thus conductive to subsequent data analysis tasks, such as principal component analysis (PCA) [2] and fast correlation-based filter (FCBF) [3]. And economical model of data, selected in modeling, can effectively avoid overfitting and thus results in better generalization performance, such as the joint embedding learning and sparse regression [4]. Achieving parsimonious representations of data and models has become the primary task in most of scientific domains, especially when the studies are related to high-dimensional observation.
In response to the parsimony, sparse modeling asserts that a high-dimensional data has many coefficients equal to zero, when linearly represented by a combination of atoms from a given or learned dictionary [5], [6]. In the case of fixed dictionary, it is usually referred to as sparse coding that we consider in this paper. As the kernel of sparse modeling, sparse coding not only delivers a concise representation for data explanation [7] and analysis [8], [9], [10], but also yields a succinct linear model for data reconstruction [11] and prediction [12]. Therefore, sparse coding attracts an enormous amount of studies in recent years and has achieved many appealing results in computer vision [13], pattern analysis [8], compressed sensing [14] and images retrieval [15].
With the sparsity assertion, sparse coding aims at selecting as few atoms as possible from the given overcomplete dictionary to linearly reconstruct the probe data, meanwhile maintaining the reconstruction error as small as possible. Concretely, let be a probe data and be the dictionary composed of m atoms di. Sparse coding is to seek the sparse representation with respect to y in D, using the following primitive formulation: where ‖ · ‖0 denotes the ℓ0-norm, which counts the number of nonzero entries in a vector. It is well known that solving (1) is NP-hard, however, a sub-optimal solution can be obtained using greedy scheme, such as orthogonal matching pursuit (OMP) [16]. An alternative route is to relax it into its closest convex proxy via the ℓ1-norm [17], [18], whose Lagrange formulation can be cast as: where the parameter λ is used for balancing the reconstruction error and the sparsity. This convex formula is often referred to as Lasso [12] or basis pursuit [19]. Countless algorithms have been put forth to pursue the optimal solution to (2), such as the least angle regression (LARS) [20]; see [13], [21] and reference therein.
In many practical situations, one often knows a group structure on the coefficient vector x in addition to sparsity. In the case, the nonzero entries of x are no more unrelated but appear in groups, i.e., groups of atoms should be selected or ignored simultaneously. For example, in genomics study, factors of the gene expression patterns are expected to involve groups of genes corresponding to biological pathway or set of genes which are neighbors in the protein-protein interaction network. Taking this higher-order prior knowledge into account is beneficial to sparse coding on improving both interpretability [13] and predictive performance [22]. Elastic Net penalizes both the ℓ1-norm and the ℓ2-norm to promote such grouping effect [23]. While, Group Lasso imposes the ℓ1/ℓ2-norm on groups to directly encourage between-group sparsity [24]. Denoting by the coefficient vector partitioned into J groups and by the corresponding dictionary matrix, Group Lasso can be cast as: where pj is the scale of the group xj and the second term is just the mixed ℓ1/ℓ2-norm. Group Lasso can be tackled by convex algorithms [21] and has been extended to the case of overlapping groups [25]. Its greedy version [17] can be written as: where denotes an indicator function. The approximate solution to (4) can be delivered by the block orthogonal matching pursuit (BOMP) algorithm [17]. Besides, the similar methods, structure orthogonal matching pursuit (structOMP) [18] and group orthogonal matching pursuit (GroupOMP) [26], are also investigated with different theory perspectives and applications.
However, the group sparse coding results in dense coefficients in the selected groups, where sparse effect is often expected to be imposed for further interpretation. For example, in genomics, we would like sparsity within each group in addition to sparsity between groups such that we can identify particularly important genes in biological pathways of interest. Adding such knowledge of hierarchical sparse prior can give rise to both more robust representation and more approving data interpretation, in comparison with group sparse coding [7]. Towards this end, Sparse-Group Lasso uses a convex combination of the Lasso penalty and the Group-Lasso penalty for the hierarchical sparsity at both group level and atom level [27], which is cast as: where α ∈ [0, 1] is to construct a convex combination of penalties: gives the Group-Lasso fit, gives the Lasso fit. The similar method, named Hierarchical Lasso (HiLasso) [28], also uses the ℓ1/ℓ2-norm and the ℓ1-norm for hierarchical sparsity, but without the convex combination. Introducing two independent balance parameters λ1 and λ2, HiLasso aims to:
Actually, when the groups are with equal size, the Sparse-Group Lasso becomes a specific case of the HiLasso with the constrained parameter selection shown in formula (5). Since both (5) and (6) are convex, various convex algorithms can be exploited [21]. Note that the formulation for hierarchical sparsity is more general, since it can degrade into the Lasso (2) and the Group Lasso (3). Hence, hierarchical sparse model is capable of dealing with various application scenarios and attracts a lot of attention. The combination of the squared mixed ℓ1/ℓ2-norm and the ℓ0-norm is developed in [29]. In [30], ℓ0+ℓ1+ℓ2 regularization is fully used for hierarchical sparsity. Besides, a more complex regularization is presented in [31].
Hierarchical sparsity has been reached, nevertheless, the current methods mostly achieve between-group sparsity by imposing the ℓ1-norm on the groups. For obtaining an intuitive motivation, here, we sparsely reconstruct the corrupted image from the USPS dataset,1 shown in Fig. 1(a). And we at random select 100 images per digit as the dictionary, over which the representation is supposed to be hierarchically sparse. That is, the corrupted image is believed to select those images of similar chirography and belonging to the same digit for reconstruction [8]. Sparse-Group Lasso, HiLasso and our method respectively give rise to the sparse coefficients, shown in Fig. 1. From Fig. 1(b, c), we can find that Sparse-Group Lasso and HiLasso result in the two groups with small coefficients rather than being completely zeroed out. For this phenomenon, there are at least two reasons. The one is noise of data: The noise likely break the original trade-off between the reconstruction error and the sparsity, and thus increases the sparsity for seeking the new compromise. Another one is the rough penalization using the ℓ1-norm: Assigning an identical penalty to all groups causes the under-penalization on the groups that should be ignored, but the over-penalization on the groups with true large coefficients. As a result, the resulting representation is unclean and harmful to its interpretability. Several studies have concerned the plights, such as the reports [32], [33] for the first reason and the adaptive Lasso [34], [35] for the second reason. In addition, the method [29] mainly replaces the ℓ1-norm of the HiLasso with the ℓ0-norm and fails to consider the two drawbacks; the route [30] fails to explicitly take the group structure into account; and the approach [31], like the Sparse-Group Lasso, only considers the second shortcoming. Thus, there are still short of effective methods for tackling both of them, especially for hierarchical sparse coding.
To this end, in this paper, we consider the problem of hierarchical sparse coding utilizing the Bayesian framework that can usually yields a more explicable formulation [36]. In the framework, we construct a nested prior for hierarchical sparsity via employing the spike-and-slab prior and the Laplace prior together. Introducing the former to the level of group, resulting in the ℓ0-norm, aims to achieve between-group sparsity; while imposing the latter on each atom within group, resulting in the ℓ1-norm, is expected to arrive at within-group sparsity as well as small reconstruction error. Our approach can yield a clearer sparse representation, like the example shown in Fig. 1(d). The main contributions of our paper can be outlined as follows:
(1) We provide a Bayesian interpretation for hierarchical sparse coding utilizing the proposed nested prior. To the best of our knowledge, this is the first time that such prior, integrating the spike-and-slab prior and the Laplace prior, is developed for sparse coding.
(2) The resultant mathematical formulation is general, because it can degrade into the BOMP and the Lasso at specific situations. Different from the sparse-group Lasso and HiLasso, it explicitly stipulates between-group sparsity by the ℓ0-norm no longer the ℓ1-norm.
(3) Inspired by the proposed nested prior, we knit a simple and easy algorithm via combining the OMP algorithm and the LARS algorithm to reap a convergent solution to the resultant optimization problem in a few iterations. Besides, we offer some algorithm analyses.
(4) In the proposed algorithm, we take into account the two drawbacks above. Concretely, we perform greedy iteration of group selection until a suspected group that overfits noise. On the other hand, sparse coding is limited to using the selected groups such that the maximal penalization is imposed on the ignored groups.
The remainder of the paper is organized as follows. In Section 2, we simply review the use of sparsity-inducing prior, involving the spike-and-slab prior and the Laplacian prior. In Section 3, we propose the nested prior, exhibit our method and the solving algorithm, and analyze the algorithm. Then the experimental results are all shown in Section 4, followed by some discussions. We finally conclude this paper in Section 5.
Section snippets
Sparse coding with the Bayesian framework
To set the stage for introducing our Bayesian formulation, in this section, we review the common basic assumption of sparse linear model and the two extensively used sparsity-inducing priors: the spike-and-slab prior and the Laplace prior.
Usually, we consider the sparse linear representation model [37] as follows: where it is common to assume that the noise follows an i.i.d. Gaussian distribution with zero mean and variance σ2, and thus ɛ ∼ Normal(0, σ2I). Note that in the context of d >
The proposed method for hierarchical sparse coding
The current approaches for hierarchical sparse coding always achieve the between-group sparsity by imposing the ℓ1-norm at the group level, such that many groups have small coefficients rather than being discarded. To this end, we lean on the respective merits of the both priors recalled above to reformulate the hierarchical sparse coding from a novel Bayesian perspective.
Experiments
In this section, we conduct some experiments to evaluate the performance of our proposed method. First, we recover 1D simulated signals from its noisy observations to investigate the capability of discovering the correct sparse pattern. Then, we apply OEL on image inpainting for observing the capability of denoising and the compactness of representation. Finally we probe the discrimination of sparse representation under high confidence level. The related methods are used for comparison: OMP [16]
Conclusion and future work
In this paper, we reformulate the hierarchical sparse coding using the Bayesian framework, where we develop a nested prior by integrating two common sparsity-inducing priors: the Laplacian prior and the spike-and-slab prior. The resulting objective task more explicitly stipulates between-group sparsity than the popular approaches. Then we present a simple and easy algorithm for pursuing its convergence solution. The experimental results, on signal recovery and image recovery, show that the
Acknowledgments
The authors would like to thank the editors and any anonymous reviewers for their constructive suggestions and helpful comments.
Yupei Zhang received the B.Eng. degree in computer science and technology from East China University of Technology in 2009 and the M.Eng. degree in computer software and theory from Zhengzhou University in 2013. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University. His current research interests mainly include sparse representation, pattern recognition and machine learning.
References (51)
- et al.
Joint sparse principal component analysis
Pattern Recognit.
(2017) - et al.
A fault diagnosis approach for diesel engines based on self-adaptive WVD, improved FCBF and PECOC-RVM
Neurocomputing
(2016) - et al.
Linear dimensionality reduction based on Hybrid structure preserving projections
Neurocomputing
(2016) - et al.
Graph regularized nonnegative sparse coding using incoherent dictionary for approximate nearest neighbor search
Pattern Recognit.
(2017) - et al.
Sparse coding with an overcomplete basis set: a strategy employed by V1?
Vis. Res.
(1997) - et al.
Sparse coding from a Bayesian perspective
IEEE Trans. Neural Netw. Learn. Syst.
(2013) - et al.
Sparse coding for image denoising using spike and slab prior
Neurocomputing
(2013) - et al.
Efficient classification with sparsity augmented collaborative representation
Pattern Recognit.
(2017) - et al.
Machine learning classification with confidence: application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression
Neuroimage
(2011) - et al.
Structured regularized robust coding for face recognition
Neurocomputing
(2016)
Low-rank preserving embedding
Pattern Recognit.
Statistical Learning with Sparsity: The Lasso and Generalizations
Joint embedding learning and sparse regression: a framework for unsupervised feature selection
IEEE Trans. Cybern.
From sparse solutions of systems of equations to sparse modeling of signals and images
SIAM Rev.
K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation
IEEE Trans. Signal Process.
Hierarchical sparse coding in the sensory system of Caenorhabditis elegans
Proc. Natl. Acad. Sci.
Robust face recognition via sparse representation
IEEE Trans. Pattern Anal. Mach. Intell.
Sparse subspace clustering: algorithm, theory, and applications
IEEE Trans. Pattern Anal. Mach. Intell.
Sparse representation for color image restoration
IEEE Trans. Image Process.
Regression shrinkage and selection via the lasso
J. R. Stat. Soc.. Ser. B (Methodol.)
Breaking the coherence barrier: a new theory for compressed sensing
Forum Math., Sigma
Signal recovery from random measurements via orthogonal matching pursuit
IEEE Trans. Inf. Theory
Block-sparse signals: uncertainty relations and efficient recovery
IEEE Trans. Signal Process.
Learning with structured sparsity
J. Mach. Learn. Res.
Cited by (10)
Framelet regularization for uneven intensity correction of color images with illumination and reflectance estimation
2018, NeurocomputingCitation Excerpt :Unfortunately, many finer details are still smoothed in the estimated reflectance, even less than those of the reflectance estimated by [39]. In recent years, the image sparsity, which has been widely applied in sparse coding for image super-resolution [44,45], image denoising [46], image segmentation [47] and image quality assessment [48], is often used as an constraint item for constructing a variational model [42,43]. To fully exploit the image sparsity, the framelet transform is recently used as a framelet regularization for image restoration [49,50].
An Effective Chinese Text Classification Method with Contextualized Weak Supervision for Review Autograding
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Educational Data Mining Techniques for Student Performance Prediction: Method Review and Comparison Analysis
2021, Frontiers in PsychologyAn MRI Study on Effects of Math Education on Brain Development Using Multi-Instance Contrastive Learning
2021, Frontiers in PsychologyDCAE: Selecting Discriminative Genes on Single-cell RNA-seq Data for Cell-type Quantification
2021, Proceedings - 2021 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021
Yupei Zhang received the B.Eng. degree in computer science and technology from East China University of Technology in 2009 and the M.Eng. degree in computer software and theory from Zhengzhou University in 2013. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University. His current research interests mainly include sparse representation, pattern recognition and machine learning.
Ming Xiang received the B.Eng. and Ph.D. degrees from Northwestern Polytechnical University, Xi'an, China, in 1987 and 1999 respectively, and currently works as an associate professor in the department of computer science and technology in Xi'an Jiaotong University, Xi'an, China. His current research interests mainly include information fusion, pattern recognition and machine learning.
Bo Yang received the B.Eng. degree in computer science and technology from Xi'an University of Posts & Telecommunication, Xi'an, China, in 2005, and received the M.Eng. degree in computer system architecture from Xidian University, Xi'an, China, in 2009. He is currently a Ph.D. candidate in the department of computer science and technology, Xi'an Jiaotong University, Xi'an, China. His current research interests mainly include manifold learning, pattern recognition and machine learning.