Abstract
We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into “global” parameters and “local” latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Archer, E., Park, I.M., Buesing, L., Cunningham, J., Paninski, L.: Black box variational inference for state space models (2016). arXiv:1511.07367
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)
Breslow, N.E., Clayton, D.G.: Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993)
Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Diggle, P.J., Heagerty, P., Liang, K.Y., Zeger, S.L.: The Analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford (2002)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)
Domke, J., Sheldon, D.R.: Importance weighting and variational inference. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 4470–4479. Curran Associates, Inc., New York (2018)
Fitzmaurice, G.M., Laird, N.M.: A likelihood-based method for analysing longitudinal binary responses. Biometrika 80, 141–151 (1993)
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol. 37, pp. 881–889 (2015)
Guo, F., Wang, X., Broderick, T., Dunson, D.B.: Boosting variational inference (2016). arXiv: 1611.05559
Han, S., Liao, X., Dunson, D.B., Carin, L.C.: Variational Gaussian copula inference. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 51, pp. 829–838 (2016)
Hoffman, M., Blei, D.: Stochastic structured variational inference. In: Lebanon, G., Vishwanathan, S. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 38, pp. 361–369 (2015)
Huszár, F.: Variational inference using implicit distributions (2017). arXiv:1702.08235
Jaakkola, T.S., Jordan, M.I.: Improving the mean field approximation via the use of mixture distributions, pp. 163–173. Springer, Dordrecht (1998)
Kastner, G., Frühwirth-Schnatter, S.: Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Comput. Stat. Data Anal. 76, 408–423 (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR) (2014)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4743–4751. Curran Associates, Inc., New York (2016)
Li, Y., Turner, R.E.: Rényi divergence variational inference. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’16, pp. 1081–1089 (2016)
Maddison, C.J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., Teh, Y.: Filtering variational objectives. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 6573–6583. Curran Associates, Inc, New York (2017)
Magnus, J.R., Neudecker, H.: The elimination matrix: some lemmas and applications. SIAM J. Algebr. Discrete Methods 1, 422–449 (1980)
Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd edn. Wiley, New York (1999)
Miller, A.C., Foti, N., Adams, R.P.: Variational boosting: iteratively refining posterior approximations (2016). arXiv: 1611.06585
Minka, T.: Divergence measures and message passing. Technical report (2005)
Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64, 140–153 (2010)
Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: Non-centered parameterisations for hierarchical models and data augmentation. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics 7, pp. 307–326. Oxford University Press, New York (2003)
Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: A general framework for the parametrization of hierarchical models. Stat Sci 22, 59–73 (2007)
Ranganath, R., Tran, D., Blei, D.M.: Hierarchical variational models. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 324–333 (2016)
Regli, J.B., Silva, R.: Alpha-beta divergence for variational inference (2018). arXiv: 1805.01045
Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp. 547–561 (1961)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538 (2015)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 32, pp. 1278–1286 (2014)
Roeder, G., Wu, Y., Duvenaud, D.K.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. In: Guyon, I., Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30 (2017)
Roeder, G., Grant, P.K., Phillips, A., Dalchau, N., Meeds, E.: Efficient amortised bayesian inference for hierarchical and nonlinear dynamical systems. Proc. Mach. Learn. Res. 97, 4445–4455 (2019)
Rothman, A.J., Levina, E., Zhu, J.: A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97, 539–550 (2010)
Salimans, T., Knowles, D.A.: Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Anal. 8, 837–882 (2013)
Smith, M.S., Loaiza-Maya, R., Nott, D.J.: High-dimensional copula variational approximation through transformation (2019). arXiv:1904.07495
Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, New York (2003)
Spantini, A., Bigoni, D., Marzouk, Y.: Inference via low-dimensional couplings. J. Mach. Learn. Res. 19, 1–71 (2018)
Tan, L.S.L.: Efficient data augmentation techniques for Gaussian state space models (2017). arXiv:1712.08887
Tan, L.S.L.: Use of model reparametrization to improve variational Bayes (2018). arXiv:1805.07267
Tan, L.S.L., Nott, D.J.: Variational inference for generalized linear mixed models using partially non-centered parametrizations. Stat. Sci. 28, 168–188 (2013)
Tan, L.S.L., Nott, D.J.: Gaussian variational approximation with sparse precision matrices. Stat. Comput. 28, 259–275 (2018)
Thall, P.F., Vail, S.C.: Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671 (1990)
Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational Bayes for non-conjugate inference. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971–1979 (2014)
Tran, D., Blei, D.M., Airoldi, E.M.: Copula variational inference. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 3564–3572 (2015)
Tucker, G., Lawson, D., Gu, S., Maddison, C.J.: Doubly reparametrized gradient estimators for Monte Carlo objectives (2018). arXiv: 1810.04152
van Erven, T., Harremos, P.: Rnyi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60, 3797–3820 (2014)
Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. (2019) (to appear)
Acknowledgements
We wish to thank the editor and reviewer for their time in reviewing this manuscript and for their constructive comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Linda Tan and Aishwarya Bhaskaran are supported by the start-up Grant R-155-000-190-133.
Appendices
Appendix A: Derivation of stochastic gradient
Let \(\otimes \) denote the Kronecker product between any two matrices. We have
where \(v(C_2^*) = f + F(\mu _1 + C_1^{-T} s_1)\). Differentiating \(r_\lambda (s)\) with respect to \(\lambda \), \(\nabla _\lambda r_\lambda (s) \) is given by
Since \(\theta _G\) does not depend on d, D, f and F, we have
It is easy to see that \(\nabla _{\mu _1} \theta _G = I_G\) and \(\nabla _d \theta _L = I_{nL}\). The rest of the terms are derived as follows.
Differentiating \(\theta _G\) with respect to \(v(C_1^*)\),
Differentiating \(\theta _L\) with respect to f,
Differentiating \(\theta _L\) with respect to F,
Differentiating \(\theta _L\) with respect to D,
Differentiating \(\theta _L\) with respect to \(\mu _1\),
Differentiating \(\theta _L\) with respect to \(v(C_1)\),
Since \(s_1 = C_1^T(\theta _G - \mu _1)\) and \(s_2 = C_2^T (\theta _L - \mu _2)\), we have
As \(\mu _2 = d + C_2^{-T} D(\mu _1 - \theta _G)\) and \(v(C_2^*) = f + F \theta _G\), differentiating \(\log q_\lambda (\theta )\) with respect to \(\theta _G\),
Therefore
Note that \(D_2^* v(C_2^{-T}) = v(I_{nL})\) as \(C_2^{-T}\) is upper triangular and \(v(C_2^{-T})\) only retains the diagonal elements of \(C_2^{-T}\).
Differentiating \(\log q_\lambda (\theta )\) with respect to \(\theta _L\),
Appendix B: Gradients for generalized linear mixed models
Since \(\theta = [\beta ^T, \omega ^T, {\tilde{b}}_1^T, \dots , {\tilde{b}}_n^T]^T\), we require
For the centered parametrization, the components in \(\nabla _\theta \log p(y, \theta )\) are given below. Note that \(\beta = [\beta _{RG_1}^T, \beta _{G_2}^T]^T\).
Differentiating \(\log p(y, \theta ) \) with respect to \(\omega \),
where \(\mathrm{d}v(W) = D^*_L \mathrm{d}\omega \) and \(D^*_L = \mathrm{diag}\{ v(\mathrm{dg}(W) + \mathbf{1}_L\mathbf{1}_L^T - I_L) \}\). Hence
Note that \(D_L^* v(W^{-T}) = v(I_L)\) because \(W^{-T}\) is upper triangular and \(v(W^{-T})\) only retains the diagonal elements.
Appendix C: Gradients for state space models
Since \(\theta = [\alpha , \kappa , \psi , b_1^T, \dots , b_n^T]^T\), we require
The components in \(\nabla _\theta \log p(y, \theta )\) are given below.
For \(2 \le i \le n-1\),
Rights and permissions
About this article
Cite this article
Tan, L.S.L., Bhaskaran, A. & Nott, D.J. Conditionally structured variational Gaussian approximation with importance weights. Stat Comput 30, 1255–1272 (2020). https://doi.org/10.1007/s11222-020-09944-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-020-09944-8