Conditionally structured variational Gaussian approximation with importance weights

Tan, Linda S. L.; Bhaskaran, Aishwarya; Nott, David J.

doi:10.1007/s11222-020-09944-8

Conditionally structured variational Gaussian approximation with importance weights

Published: 28 April 2020

Volume 30, pages 1255–1272, (2020)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Linda S. L. Tan¹,
Aishwarya Bhaskaran¹ &
David J. Nott²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into “global” parameters and “local” latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gaussian variational approximation with sparse precision matrices

Article 10 February 2017

Variational inference for sparse spectrum Gaussian process regression

Article 12 September 2015

R-VGAL: a sequential variational Bayes algorithm for generalised linear mixed models

Article Open access 06 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Archer, E., Park, I.M., Buesing, L., Cunningham, J., Paninski, L.: Black box variational inference for state space models (2016). arXiv:1511.07367
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)
Article MathSciNet Google Scholar
Breslow, N.E., Clayton, D.G.: Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993)
MATH Google Scholar
Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)
Diggle, P.J., Heagerty, P., Liang, K.Y., Zeger, S.L.: The Analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford (2002)
MATH Google Scholar
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)
Domke, J., Sheldon, D.R.: Importance weighting and variational inference. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 4470–4479. Curran Associates, Inc., New York (2018)
Google Scholar
Fitzmaurice, G.M., Laird, N.M.: A likelihood-based method for analysing longitudinal binary responses. Biometrika 80, 141–151 (1993)
Article Google Scholar
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol. 37, pp. 881–889 (2015)
Guo, F., Wang, X., Broderick, T., Dunson, D.B.: Boosting variational inference (2016). arXiv: 1611.05559
Han, S., Liao, X., Dunson, D.B., Carin, L.C.: Variational Gaussian copula inference. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 51, pp. 829–838 (2016)
Hoffman, M., Blei, D.: Stochastic structured variational inference. In: Lebanon, G., Vishwanathan, S. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 38, pp. 361–369 (2015)
Huszár, F.: Variational inference using implicit distributions (2017). arXiv:1702.08235
Jaakkola, T.S., Jordan, M.I.: Improving the mean field approximation via the use of mixture distributions, pp. 163–173. Springer, Dordrecht (1998)
MATH Google Scholar
Kastner, G., Frühwirth-Schnatter, S.: Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Comput. Stat. Data Anal. 76, 408–423 (2014)
Article MathSciNet Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR) (2014)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4743–4751. Curran Associates, Inc., New York (2016)
Google Scholar
Li, Y., Turner, R.E.: Rényi divergence variational inference. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’16, pp. 1081–1089 (2016)
Maddison, C.J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., Teh, Y.: Filtering variational objectives. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 6573–6583. Curran Associates, Inc, New York (2017)
Google Scholar
Magnus, J.R., Neudecker, H.: The elimination matrix: some lemmas and applications. SIAM J. Algebr. Discrete Methods 1, 422–449 (1980)
Article MathSciNet Google Scholar
Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd edn. Wiley, New York (1999)
MATH Google Scholar
Miller, A.C., Foti, N., Adams, R.P.: Variational boosting: iteratively refining posterior approximations (2016). arXiv: 1611.06585
Minka, T.: Divergence measures and message passing. Technical report (2005)
Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64, 140–153 (2010)
Article MathSciNet Google Scholar
Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: Non-centered parameterisations for hierarchical models and data augmentation. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics 7, pp. 307–326. Oxford University Press, New York (2003)
Google Scholar
Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: A general framework for the parametrization of hierarchical models. Stat Sci 22, 59–73 (2007)
Article MathSciNet Google Scholar
Ranganath, R., Tran, D., Blei, D.M.: Hierarchical variational models. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 324–333 (2016)
Regli, J.B., Silva, R.: Alpha-beta divergence for variational inference (2018). arXiv: 1805.01045
Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp. 547–561 (1961)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538 (2015)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 32, pp. 1278–1286 (2014)
Roeder, G., Wu, Y., Duvenaud, D.K.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. In: Guyon, I., Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30 (2017)
Roeder, G., Grant, P.K., Phillips, A., Dalchau, N., Meeds, E.: Efficient amortised bayesian inference for hierarchical and nonlinear dynamical systems. Proc. Mach. Learn. Res. 97, 4445–4455 (2019)
Google Scholar
Rothman, A.J., Levina, E., Zhu, J.: A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97, 539–550 (2010)
Article MathSciNet Google Scholar
Salimans, T., Knowles, D.A.: Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Anal. 8, 837–882 (2013)
Article MathSciNet Google Scholar
Smith, M.S., Loaiza-Maya, R., Nott, D.J.: High-dimensional copula variational approximation through transformation (2019). arXiv:1904.07495
Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, New York (2003)
Book Google Scholar
Spantini, A., Bigoni, D., Marzouk, Y.: Inference via low-dimensional couplings. J. Mach. Learn. Res. 19, 1–71 (2018)
MathSciNet MATH Google Scholar
Tan, L.S.L.: Efficient data augmentation techniques for Gaussian state space models (2017). arXiv:1712.08887
Tan, L.S.L.: Use of model reparametrization to improve variational Bayes (2018). arXiv:1805.07267
Tan, L.S.L., Nott, D.J.: Variational inference for generalized linear mixed models using partially non-centered parametrizations. Stat. Sci. 28, 168–188 (2013)
Article Google Scholar
Tan, L.S.L., Nott, D.J.: Gaussian variational approximation with sparse precision matrices. Stat. Comput. 28, 259–275 (2018)
Article MathSciNet Google Scholar
Thall, P.F., Vail, S.C.: Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671 (1990)
Article MathSciNet Google Scholar
Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational Bayes for non-conjugate inference. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971–1979 (2014)
Tran, D., Blei, D.M., Airoldi, E.M.: Copula variational inference. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 3564–3572 (2015)
Tucker, G., Lawson, D., Gu, S., Maddison, C.J.: Doubly reparametrized gradient estimators for Monte Carlo objectives (2018). arXiv: 1810.04152
van Erven, T., Harremos, P.: Rnyi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60, 3797–3820 (2014)
Article Google Scholar
Yang, Y., Pati, D., Bhattacharya, A.: $\alpha $-variational inference with statistical guarantees. Ann. Stat. (2019) (to appear)

Download references

Acknowledgements

We wish to thank the editor and reviewer for their time in reviewing this manuscript and for their constructive comments.

Author information

Authors and Affiliations

Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
Linda S. L. Tan & Aishwarya Bhaskaran
Department of Statistics and Applied Probability and Institute of Operations Research and Analtyics, National University of Singapore, Singapore, Singapore
David J. Nott

Authors

Linda S. L. Tan
View author publications
You can also search for this author inPubMed Google Scholar
Aishwarya Bhaskaran
View author publications
You can also search for this author inPubMed Google Scholar
David J. Nott
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Linda S. L. Tan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Linda Tan and Aishwarya Bhaskaran are supported by the start-up Grant R-155-000-190-133.

Appendices

Appendix A: Derivation of stochastic gradient

Let $\otimes $ denote the Kronecker product between any two matrices. We have

$$\begin{aligned} r_\lambda (s) = \begin{bmatrix} \theta _G \\ \theta _L \end{bmatrix} = \begin{bmatrix} \mu _1 + C_1^{-T} s_1 \\ d+ C_2^{-T} (s_2 - DC_1^{-T} s_1) \end{bmatrix}, \end{aligned}$$

where $v(C_2^*) = f + F(\mu _1 + C_1^{-T} s_1)$. Differentiating $r_\lambda (s)$ with respect to $\lambda $, $\nabla _\lambda r_\lambda (s) $ is given by

$$\begin{aligned} \begin{aligned}&\begin{bmatrix} \nabla _{\mu _1} \theta _G &{} \nabla _{\mu _1} \theta _L \\ \nabla _{v(C_1^*)} \theta _G &{} \nabla _{v(C_1^*)} \theta _L \\ \nabla _{d} \theta _G &{} \nabla _{d} \theta _L \\ \nabla _{\mathrm{vec}(D)} \theta _G &{} \nabla _{\mathrm{vec}(D)} \theta _L \\ \nabla _{f} \theta _G &{} \nabla _{f} \theta _L \\ \nabla _{\mathrm{vec}(F)} \theta _G &{} \nabla _{\mathrm{vec}(F)} \theta _L \\ \end{bmatrix}. \end{aligned} \end{aligned}$$

Since $\theta _G$ does not depend on d, D, f and F, we have

$$\begin{aligned} \begin{aligned} \nabla _{d} \theta _G&= 0_{nL \times G}, \quad \nabla _{\mathrm{vec}(D)} \theta _G = 0_{nLG \times G} \\ \nabla _{f} \theta _G&= 0_{nL(nL+1)/2 \times G}, \;\; \nabla _{\mathrm{vec}(F)} \theta _G = 0_{nLG(nL+1)/2 \times G}. \end{aligned} \end{aligned}$$

It is easy to see that $\nabla _{\mu _1} \theta _G = I_G$ and $\nabla _d \theta _L = I_{nL}$. The rest of the terms are derived as follows.

Differentiating $\theta _G$ with respect to $v(C_1^*)$,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _G&= -C_1^{-T} \mathrm{d}(C_1^T) C_1^{-T} s_1\\&= - (s_1^T C_1^{-1} \otimes C_1^{-T}) K_G E_G^T D_1^* \mathrm{d}v(C_1^*) \\&= - (C_1^{-T} \otimes s_1^T C_1^{-1}) E_G^T D_1^* \mathrm{d}v(C_1^*). \\ \therefore \; \nabla _{v(C_1^*)} \theta _G&= - D_1^* E_G (C_1^{-1} \otimes C_1^{-T} s_1 ). \end{aligned} \end{aligned}$$

Differentiating $\theta _L$ with respect to f,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T} \mathrm{d}(C_2^T) C_2^{-T} (s_2 - DC_1^{-T} s_1) \\&= - \{ (s_2 - DC_1^{-T} s_1)^T C_2^{-1} \otimes C_2^{-T} \} \\&\quad \times K_{nL} E_{nL}^T D_2^* \mathrm{d}f \\ \therefore \; \nabla _{f} \theta _L&= - D_2^* E_{nL} \{ C_2^{-1} \otimes C_2^{-T} (s_2 - DC_1^{-T} s_1) \}. \end{aligned} \end{aligned}$$

Differentiating $\theta _L$ with respect to F,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= (\nabla _{f} \theta _L)^T \mathrm{d}F \theta _G \\&= \{ \theta _G^T \otimes (\nabla _{f} \theta _L)^T \} \mathrm{d}\mathrm{vec}(F). \\ \therefore \; \nabla _{\mathrm{vec}(F)} \theta _L&= \theta _G \otimes \nabla _{f} \theta _L. \end{aligned} \end{aligned}$$

Differentiating $\theta _L$ with respect to D,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T} \mathrm{d}D C_1^{-T} s_1 \\&= - (s_1^T C_1^{-1} \otimes C_2^{-T}) \mathrm{d}\mathrm{vec}(D). \\ \therefore \; \nabla _{\mathrm{vec}(D)} \theta _L&= - (C_1^{-T} s_1 \otimes C_2^{-1}). \end{aligned} \end{aligned}$$

Differentiating $\theta _L$ with respect to $\mu _1$,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= (\nabla _{f} \theta _L)^T F \mathrm{d}\mu _1 \\ \therefore \; \nabla _{\mu _1} \theta _L&= F^T (\nabla _{f} \theta _L). \end{aligned} \end{aligned}$$

Differentiating $\theta _L$ with respect to $v(C_1)$,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T}\mathrm{d}(C_2^T) C_2^{-T}(s_2 - DC_1^{-T} s_1) \\&\quad - C_2^{-T} D \mathrm{d}(C_1^{-T}) s_1 \\&= (\nabla _{f} \theta _L)^T F\mathrm{d}(C_1^{-T}) s_1 - C_2^{-T} D \mathrm{d}(C_1^{-T}) s_1 \\&= \{ (\nabla _{f} \theta _L)^T F - C_2^{-T} D\} (\nabla _{v(C_1^*)} \theta _G )^T \mathrm{d}v(C_1^*) \\ \therefore \; \nabla _{v(C_1^*)} \theta _L&= \nabla _{v(C_1^*)}\theta _G \{ F^T \nabla _{f} \theta _L - D^T C_2^{-1} \} \\&= \nabla _{v(C_1^*)}\theta _G \{ \nabla _{\mu _1} \theta _L - D^T C_2^{-1} \}. \end{aligned} \end{aligned}$$

Since $s_1 = C_1^T(\theta _G - \mu _1)$ and $s_2 = C_2^T (\theta _L - \mu _2)$, we have

$$\begin{aligned} \begin{aligned} \log q_\lambda (\theta )&= \log q(\theta _G) + \log q(\theta _L|\theta _G) \\&= -\frac{G}{2} \log (2\pi ) + \log |C_1| \\&\quad - \frac{1}{2}(\theta _G - \mu _1)^T C_1 C_1^T (\theta _G - \mu _1) \\&\quad -\frac{nL}{2} \log (2\pi ) + \log |C_2| \\&\quad - \frac{1}{2}(\theta _L - \mu _2)^T C_2 C_2^T (\theta _L - \mu _2) \\&= - \frac{nL+G}{2} \log (2\pi ) + \log |C_1C_2| - \frac{1}{2} s^T s. \end{aligned} \end{aligned}$$

As $\mu _2 = d + C_2^{-T} D(\mu _1 - \theta _G)$ and $v(C_2^*) = f + F \theta _G$, differentiating $\log q_\lambda (\theta )$ with respect to $\theta _G$,

$$\begin{aligned} \begin{aligned} \mathrm{d}\log q_\lambda (\theta )&= - (\theta _G - \mu _1)^T C_1 C_1^T \mathrm{d}\theta _G - (\theta _L - \mu _2)^T C_2 C_2^T(-\mathrm{d}\mu _2)\\&\quad - (\theta _L - \mu _2)^T \mathrm{d}C_2 s_2 + \mathrm{tr}(C_2^{-1} \mathrm{d}C_2) \\&= -s_1^T C_1^T \mathrm{d}\theta _G + s_2^T C_2^T\{- C_2^{-T} D \mathrm{d}\theta _G \\&\quad + \mathrm{d}(C_2^{-T}) D(\mu _1 - \theta _G)\} \\&\quad - \mathrm{vec}(C_2^{-T} s_2 s_2^T)^T \mathrm{d}\mathrm{vec}(C_2) + \mathrm{vec}(C_2^{-T})^T d\mathrm{vec}(C_2) \\&= \mathrm{vec}(C_2^{-T} - \{C_2^{-T} s_2 + (\mu _2 -d)\}s_2^T)^T d\mathrm{vec}(C_2) \\&\quad -s_1^T C_1^T \mathrm{d}\theta _G - s_2^T D \mathrm{d}\theta _G \\&= \mathrm{vec}(C_2^{-T} - (\theta _L -d) s_2^T)^T E_{nL}^T D_2^* F \mathrm{d}\theta _G \\&\quad -s_1^T C_1^T \mathrm{d}\theta _G - s_2^T D \mathrm{d}\theta _G. \end{aligned} \end{aligned}$$

Therefore

$$\begin{aligned} \begin{aligned} \nabla _{\theta _G} \log q_\lambda (\theta )&=F^T D_2^* v(C_2^{-T} - (\theta _L -d) s_2^T) \\&\quad - C_1 s_1 - D^T s_2. \end{aligned} \end{aligned}$$

Note that $D_2^* v(C_2^{-T}) = v(I_{nL})$ as $C_2^{-T}$ is upper triangular and $v(C_2^{-T})$ only retains the diagonal elements of $C_2^{-T}$.

Differentiating $\log q_\lambda (\theta )$ with respect to $\theta _L$,

$$\begin{aligned} \begin{aligned} \mathrm{d}\log q_\lambda (\theta )&= - (\theta _L - \mu _2)^T C_2 C_2^T \mathrm{d}\theta _L \\&= - s_2^T C_2^T \mathrm{d}\theta _L. \\ \therefore \; \nabla _{\theta _L} \log q_\lambda (\theta )&= - C_2 s_2. \end{aligned} \end{aligned}$$

Appendix B: Gradients for generalized linear mixed models

Since $\theta = [\beta ^T, \omega ^T, {\tilde{b}}_1^T, \dots , {\tilde{b}}_n^T]^T$, we require

$$\begin{aligned}&\nabla _\theta \log p(y, \theta ) = [\nabla _\beta \log p(y, \theta ), \nabla _\omega \log p(y, \theta ), \\&\quad \nabla _{{\tilde{b}}_1} \log p(y, \theta ), \dots , \nabla _{{\tilde{b}}_n} \log p(y, \theta )]^T. \end{aligned}$$

For the centered parametrization, the components in $\nabla _\theta \log p(y, \theta )$ are given below. Note that $\beta = [\beta _{RG_1}^T, \beta _{G_2}^T]^T$.

$$\begin{aligned} \nabla _{\beta _{G_2}} \log p(y, \theta )= & {} \sum _{i=1}^n {X_i^{G_2}}^T \{ y_i - h'(\eta _i) \} - \beta _{G_2}/\sigma _\beta ^2. \\ \nabla _{\beta _{RG_1}} \log p(y, \theta )= & {} \sum _{i=1}^n C_i^T W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}) \\&- \beta _{RG_1}/\sigma _\beta ^2. \end{aligned}$$

Differentiating $\log p(y, \theta ) $ with respect to $\omega $,

$$\begin{aligned} \begin{aligned} \mathrm{d}\log p(y, \theta )&= -\sum _{i=1}^n ({\tilde{b}}_i - C_i \beta _{RG_1})^T \mathrm{d}W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}) \\&\quad + n \mathrm{tr}(W^{-1} \mathrm{d}W) - \omega ^T \mathrm{d}\omega /\sigma _\omega ^2 \\&= \mathrm{vec}\bigg \{ - \sum _{i=1}^n ({\tilde{b}}_i - C_i \beta _{RG_1}) ({\tilde{b}}_i - C_i \beta _{RG_1})^T W \\&\quad + nW^{-T} \bigg \}^T E_L^T D_L^* \mathrm{d}\omega - \omega ^T \mathrm{d}\omega /\sigma _\omega ^2, \end{aligned} \end{aligned}$$

where $\mathrm{d}v(W) = D^*_L \mathrm{d}\omega $ and $D^*_L = \mathrm{diag}\{ v(\mathrm{dg}(W) + \mathbf{1}_L\mathbf{1}_L^T - I_L) \}$. Hence

$$\begin{aligned} \begin{aligned} \nabla _\omega \log p(y, \theta )&=- D^*_L \sum _{i=1}^n v\{ ({\tilde{b}}_i - C_i \beta _{RG_1}) ({\tilde{b}}_i \\&\quad - C_i \beta _{RG_1}) ^TW \} \\&\quad + n v(I_L) - \omega /\sigma _\omega ^2. \end{aligned} \end{aligned}$$

Note that $D_L^* v(W^{-T}) = v(I_L)$ because $W^{-T}$ is upper triangular and $v(W^{-T})$ only retains the diagonal elements.

$$\begin{aligned} \nabla _{{\tilde{b}}_i} \log p(y, \theta ) = Z_i^T \{ y_i - h'(\eta _i)\} - W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}). \end{aligned}$$

Appendix C: Gradients for state space models

Since $\theta = [\alpha , \kappa , \psi , b_1^T, \dots , b_n^T]^T$, we require

$$\begin{aligned}&\nabla _\theta \log p(y, \theta ) = [\nabla _\alpha \log p(y, \theta ), \nabla _\kappa \log p(y, \theta ), \\&\quad \nabla _\psi \log p(y, \theta ), \nabla _{b_1} \log p(y, \theta ), \dots , \nabla _{b_n} \log p(y, \theta )]^T. \end{aligned}$$

The components in $\nabla _\theta \log p(y, \theta )$ are given below.

$$\begin{aligned} \nabla _\alpha \log p(y, \theta )= & {} \frac{1}{2}\sum ^n_{i=1} (b_i y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - b_i)(1-\mathrm{e}^{-\sigma }) - \frac{\alpha }{\sigma _{\alpha }^2}. \\ \nabla _\kappa \log p(y, \theta )= & {} \frac{1}{2} \bigg (\sum ^n_{i=1} y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - n \bigg ) - \kappa /\sigma _{\kappa }^2.\\ \nabla _\psi \log p(y, \theta )= & {} \bigg \{ \sum _{i=2}^{n} (b_i - \phi b_{i-1})b_{i-1} + b_1^2 \phi - \frac{\phi }{1-\phi ^2} \bigg \} \\&\quad \times \phi (1-\phi ) - \psi /\sigma _{\psi }^2.\\ \nabla _{b_1} \log p(y, \theta )= & {} \frac{\sigma }{2} (y_1^2 \mathrm{e}^{-\sigma b_1 - \kappa } - 1) \\&+ \phi (b_2 - \phi b_1)- b_1 (1-\phi )^2. \end{aligned}$$

For $2 \le i \le n-1$,

$$\begin{aligned} \begin{aligned} \nabla _{b_i} \log p(y, \theta )&= \frac{\sigma }{2} (y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - 1) +\phi (b_{i+1} - \phi b_i)\\&\quad - (b_i - \phi b_{i-1}). \\ \nabla _{b_n} \log p(y, \theta )&= \frac{\sigma }{2} (y_n^2 \mathrm{e}^{-\sigma b_n - \kappa } - 1) - (b_n - \phi b_{n-1}). \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, L.S.L., Bhaskaran, A. & Nott, D.J. Conditionally structured variational Gaussian approximation with importance weights. Stat Comput 30, 1255–1272 (2020). https://doi.org/10.1007/s11222-020-09944-8

Download citation

Received: 21 April 2019
Accepted: 16 April 2020
Published: 28 April 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11222-020-09944-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditionally structured variational Gaussian approximation with importance weights

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Gaussian variational approximation with sparse precision matrices

Variational inference for sparse spectrum Gaussian process regression

R-VGAL: a sequential variational Bayes algorithm for generalised linear mixed models

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Derivation of stochastic gradient

Appendix B: Gradients for generalized linear mixed models

Appendix C: Gradients for state space models

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now