MAP approximation to the variational Bayes Gaussian mixture model and application

Lim, Kart-Leong; Wang, Han

doi:10.1007/s00500-017-2565-z

MAP approximation to the variational Bayes Gaussian mixture model and application

Methodologies and Application
Published: 04 April 2017

Volume 22, pages 3287–3299, (2018)
Cite this article

Soft Computing Aims and scope Submit manuscript

379 Accesses
Explore all metrics

Abstract

The learning of variational inference can be widely seen as first estimating the class assignment variable and then using it to estimate parameters of the mixture model. The estimate is mainly performed by computing the expectations of the prior models. However, learning is not exclusive to expectation. Several authors report other possible configurations that use different combinations of maximization or expectation for the estimation. For instance, variational inference is generalized under the expectation–expectation (EE) algorithm. Inspired by this, another variant known as the maximization–maximization (MM) algorithm has been recently exploited on various models such as Gaussian mixture, Field-of-Gaussians mixture, and sparse-coding-based Fisher vector. Despite the recent success, MM is not without issue. Firstly, it is very rare to find any theoretical study comparing MM to EE. Secondly, the computational efficiency and accuracy of MM is seldom compared to EE. Hence, it is difficult to convince the use of MM over a mainstream learner such as EE or even Gibbs sampling. In this work, we revisit the learning of EE and MM on a simple Bayesian GMM case. We also made theoretical comparison of MM with EE and found that they in fact obtain near identical solutions. In the experiments, we performed unsupervised classification, comparing the computational efficiency and accuracy of MM and EE on two datasets. We also performed unsupervised feature learning, comparing Bayesian approach such as MM with other maximum likelihood approaches on two datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A Guide for Sparse PCA: Model Comparison and Applications

Article Open access 29 June 2021

Variational Autoencoder

References

Bdiri T, Bouguila N, Ziou D (2016) Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering. Appl Intell 44(3):507–525
Article Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Blei DM, Jordan MI et al (2006) Variational inference for Dirichlet process mixtures. Bayesian Anal 1(1):121–144
Article MathSciNet MATH Google Scholar
Cinbis RG, Verbeek J, Schmid C (2016) Approximate fisher kernels of non-iid image models for image categorization. IEEE Trans Pattern Anal Mach Intell 38(6):1084–1098
Article Google Scholar
Corduneanu A, Bishop CM (2001) Variational Bayesian model selection for mixture distributions. In: Artificial intelligence and statistics, vol 2001. Morgan Kaufmann, Waltham, MA, pp 27–34
Fan W, Bouguila N (2013) Variational learning of a Dirichlet process of generalized Dirichlet distributions for simultaneous clustering and feature selection. Pattern Recognit 46(10):2754–2769
Article MATH Google Scholar
Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70
Article Google Scholar
Fernando B, Fromont E, Muselet D, Sebban M (2012) Supervised learning of Gaussian mixture models for visual vocabulary generation. Pattern Recognit 45(2):897–907
Article MATH Google Scholar
Kurihara K, Welling M (2009) Bayesian k-means as a maximization–expectation algorithm. Neural Comput 21(4):1145–1172
Article MathSciNet MATH Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. IEEE, pp 2169–2178
Lian X-C, Li Z, Wang C, Lu B-L, Zhang L (2010) Probabilistic models for supervised dictionary learning. In: Proceedings of the 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2305–2312
Lim K-L, Wang H (2016) Learning a field of Gaussian mixture model for image classification. In: Proceedings of the 2016 14th international conference on control, automation, robotics and vision (ICARCV). IEEE, pp 1–5
Lim K-L, Wang H (2017) Sparse coding based Fisher vector using a Bayesian approach. IEEE Signal Process. Lett. 24(1):91
Lim K-L, Wang H, Mou X (2016) Learning Gaussian mixture model with a maximization-maximization algorithm for image classification. In: Proceedings of the 2016 12th IEEE international conference on control and automation (ICCA). IEEE, pp 887–891
Liu L, Shen C, Wang L, van den Hengel A, Wang C (2014) Encoding high dimensional local features by sparse coding based Fisher vectors. In: Advances in neural information processing systems, pp 1143–1151
MacKay DJC (2003) Information theory, inference and learning algorithms. Cambridge University Press, Cambridge
MATH Google Scholar
Ma Z, Leijon A (2011) Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell 33(11):2160–2173
Article Google Scholar
Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
MathSciNet Google Scholar
Ozuysal M, Lepetit V, Fua P (2009) Pose estimation for category specific multiview object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 778–785
Paisley J, Wang C, Blei DM, Jordan MI (2015) Nested hierarchical dirichlet processes. IEEE Trans Pattern Anal Mach Intell 37(2):256–270
Article Google Scholar
Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Sharing clusters among related groups: hierarchical Dirichlet processes. In: Advances in Neural Information Processing Systems, pp 1385–1392
Welling M, Kurihara K (2006) Bayesian k-means as a maximization-expectation algorithm. In: Proceedings of the 2006 SIAM international conference on data mining, pp 474–478

Download references

Acknowledgements

We are grateful to Dr Shiping Wang for his helpful discussion and guidance.

Author information

Authors and Affiliations

School of Electrical and Electronics Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
Kart-Leong Lim & Han Wang

Authors

Kart-Leong Lim
View author publications
You can also search for this author in PubMed Google Scholar
Han Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kart-Leong Lim.

Ethics declarations

Conflict of interest

Author Kart-Leong Lim declares no conflict of interest. Co-author Han Wang declares no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Appendix

In our proposed approach for solving Bayesian GMM using the MM algorithm, we emphasis on its computational efficiency. For completeness, we now discuss how we can obtain convergence by checking the lower bound for each iteration.

$$\begin{aligned} \mathcal {L}\ge & {} {\int }{\int }{\int }{\int } q(z,\mu ,\tau ,\pi )\ln \left[ \frac{p(x,z,\mu ,\tau ,\pi )}{q(z,\mu ,\tau ,\pi )}\right] {\hbox {d}}z\,{\hbox {d}}\mu \,{\hbox {d}}\tau \,{\hbox {d}}\pi \nonumber \\= & {} E\left[ \ln p(x,z,\mu ,\tau ,\pi )\right] -E\left[ \ln q(z,\mu ,\tau ,\pi )\right] \nonumber \\= & {} E\left[ \ln p(x\mid z,\mu ,\tau )\right] +E\left[ \ln p(z\mid \pi )\right] +E\left[ \ln p(\pi )\right] \nonumber \\&+\,E\left[ \ln p(\mu ,\tau )\right] -E\left[ \ln q(z)\right] \nonumber \\&-\,E\left[ \ln q(\pi )\right] -E\left[ \ln q(\mu ,\tau )\right] \end{aligned}$$

(52)

In the lower bound expression above, each expectation function is taken with respect to all of the hidden variables and is simplified using Jensen’s inequality as follows

$$\begin{aligned}&E\left[ \ln p(x\mid z,\mu ,\tau )\right] \nonumber \\&\quad =E\left[ \mathop {\sum }\limits _{k=1}^{K}\mathop {\sum }\limits _{n=1}^{N}\ln \mathcal {N} \left( x_{n}\mid \mu _{k},\tau _{k}^{-1}\right) ^{z_{nk}}\right] \nonumber \\&\quad = \mathop {\sum }\limits _{k=1}^{K} \mathop {\sum }\limits _{n=1}^{N}\left\{ \frac{1}{2} E\left[ \ln \tau _{k}\right] -\frac{E\left[ \tau _{k}\right] }{2} (x_{n}{-}E\left[ \mu _{k}\right] )^{2}\right\} E\left[ z_{nk}\right] \end{aligned}$$

(53)

$$\begin{aligned} E\left[ \ln p(z\mid \pi )\right]= & {} E\left[ \mathop {\sum }\limits _{k=1}^{K} \mathop {\sum }\limits _{n=1}^{N}z_{nk}\ln \pi _{k}\right] \nonumber \\= & {} \mathop {\sum }\limits _{k=1}^{K}\mathop {\sum }\limits _{n=1}^{N}E\left[ z_{nk}\right] E\left[ \ln \pi _{k}\right] \end{aligned}$$

(54)

$$\begin{aligned} E\left[ \ln p(\pi )\right]= & {} E\left[ \mathop {\sum }\limits _{k=1}^{K}\ln {\hbox {Dir}}\left( \pi _{k}\mid \alpha _{0}\right) \right] \nonumber \\= & {} (\alpha _{0}-1) \mathop {\sum }\limits _{k=1}^{K}E\left[ \ln \pi _{k}\right] \end{aligned}$$

(55)

$$\begin{aligned}&E\left[ \ln p(\mu ,\tau )\right] \nonumber \\&\quad =E\left[ \mathop {\sum }\limits _{k=1}^{K}\mathcal {\ln N}\left( \mu _{k}\mid m_{0},(\lambda _{0}\tau _{k})^{-1}\right) {\hbox {Gam}}\left( \tau _{k}\mid a_{0},b_{0}\right) \right] \nonumber \\&\quad =\mathop {\sum }\limits _{k=1}^{K}\left( -\frac{(E\left[ \mu _{k}\right] - m_{0})^{2}}{2\left( \lambda _{0}E\left[ \tau _{k}\right] \right) ^{-1}} +(a_{0}-1)E\left[ \ln \tau _{k}\right] -b_{0}E\left[ \tau _{k}\right] \right) \nonumber \\ \end{aligned}$$

(56)

$$\begin{aligned} E\left[ \ln q(z)\right]= & {} \mathop {\sum }\limits _{n=1}^{N}\mathop {\sum }\limits _{k=1}^{K}\left\{ \ln E\left[ \pi _{k}\right] +\frac{1}{2}\ln E\left[ \tau _{k}\right] \right. \nonumber \\&\quad -\frac{E\left[ \tau _{k}\right] }{2}(x_{n}-E\left[ \mu _{k}\right] )^{2}\Biggr \} E\left[ z_{nk}\right] \end{aligned}$$

(57)

$$\begin{aligned} E\left[ \ln q(\pi )\right]= & {} E\left[ \mathop {\sum }\limits _{n=1}^{N} \mathop {\sum }\limits _{k=1}^{K}E\left[ z_{nk}\right] \ln \pi _{k}\right. \nonumber \\&\left. +(\alpha _{0}-1)\mathop {\sum }\limits _{k=1}^{K}\ln \pi _{k}\right] \nonumber \\= & {} \mathop {\sum }\limits _{n=1}^{N}\mathop {\sum }\limits _{k=1}^{K}E\left[ z_{nk}\right] \ln E\left[ \pi _{k}\right] \nonumber \\&+(\alpha _{0}-1)\mathop {\sum }\limits _{k=1}^{K}\ln E\left[ \pi _{k}\right] \end{aligned}$$

(58)

$$\begin{aligned}&E\left[ \ln q(\mu ,\tau )\right] \nonumber \\&\quad =E\left[ \mathop {\sum }\limits _{k=1}^{K}\left( \mathop {\sum }\limits _{n=1}^{N}\left\{ \frac{\ln \tau _{k}}{2}-\frac{\tau _{k}}{2}(x_{n}-E\left[ \mu _{k}\right] )^{2}\right\} E\left[ z_{nk}\right] \right. \right. \nonumber \\&\qquad \left. \left. -\frac{(E\left[ \mu _{k}\right] -m_{0})^{2}}{2(\lambda _{0}\tau _{k})^{-1}}+(a_{0}-1)\ln \tau _{k}-b_{0}\tau _{k}) \right) \right] \nonumber \\&\quad =\mathop {\sum }\limits _{k=1}^{K}\left( \mathop {\sum }\limits _{n=1}^{N}\left\{ \frac{E\left[ \ln \tau _{k}\right] }{2}-\frac{E\left[ \tau _{k}\right] }{2}(x_{n}-E\left[ \mu _{k}\right] )^{2}\right\} E\left[ z_{nk}\right] \right. \nonumber \\&\qquad \left. -\frac{(E\left[ \mu _{k}\right] -m_{0})^{2}}{2(\lambda _{0}E\left[ \tau _{k}\right] )^{-1}}+(a_{0}-1)E\left[ \ln \tau _{k}\right] -b_{0}\left[ E\tau _{k}\right] \right) \end{aligned}$$

(59)

In practice, it is not necessary to compute lower bound as we can visualize the convergence by checking the changes in weight of each cluster using the equation below, as seen in the experiments later on.

$$\begin{aligned} \pi _{k}=\frac{{\sum }_{n=1}^{N}E\left[ z_{nk}\right] }{{\sum }_{j=1}^{K}{\sum }_{n=1}^{N}E\left[ z_{nj}\right] } \end{aligned}$$

(60)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, KL., Wang, H. MAP approximation to the variational Bayes Gaussian mixture model and application. Soft Comput 22, 3287–3299 (2018). https://doi.org/10.1007/s00500-017-2565-z

Download citation

Published: 04 April 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s00500-017-2565-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MAP approximation to the variational Bayes Gaussian mixture model and application

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

A Guide for Sparse PCA: Model Comparison and Applications

Variational Autoencoder

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MAP approximation to the variational Bayes Gaussian mixture model and application

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

A Guide for Sparse PCA: Model Comparison and Applications

Variational Autoencoder

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation