Skip to main content
Log in

Conditionally structured variational Gaussian approximation with importance weights

  • Published:
Statistics and Computing Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into “global” parameters and “local” latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Archer, E., Park, I.M., Buesing, L., Cunningham, J., Paninski, L.: Black box variational inference for state space models (2016). arXiv:1511.07367

  • Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)

    Article  MathSciNet  Google Scholar 

  • Breslow, N.E., Clayton, D.G.: Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993)

    MATH  Google Scholar 

  • Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)

  • Diggle, P.J., Heagerty, P., Liang, K.Y., Zeger, S.L.: The Analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford (2002)

    MATH  Google Scholar 

  • Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)

  • Domke, J., Sheldon, D.R.: Importance weighting and variational inference. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 4470–4479. Curran Associates, Inc., New York (2018)

    Google Scholar 

  • Fitzmaurice, G.M., Laird, N.M.: A likelihood-based method for analysing longitudinal binary responses. Biometrika 80, 141–151 (1993)

    Article  Google Scholar 

  • Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol. 37, pp. 881–889 (2015)

  • Guo, F., Wang, X., Broderick, T., Dunson, D.B.: Boosting variational inference (2016). arXiv: 1611.05559

  • Han, S., Liao, X., Dunson, D.B., Carin, L.C.: Variational Gaussian copula inference. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 51, pp. 829–838 (2016)

  • Hoffman, M., Blei, D.: Stochastic structured variational inference. In: Lebanon, G., Vishwanathan, S. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 38, pp. 361–369 (2015)

  • Huszár, F.: Variational inference using implicit distributions (2017). arXiv:1702.08235

  • Jaakkola, T.S., Jordan, M.I.: Improving the mean field approximation via the use of mixture distributions, pp. 163–173. Springer, Dordrecht (1998)

    MATH  Google Scholar 

  • Kastner, G., Frühwirth-Schnatter, S.: Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Comput. Stat. Data Anal. 76, 408–423 (2014)

    Article  MathSciNet  Google Scholar 

  • Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)

  • Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR) (2014)

  • Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4743–4751. Curran Associates, Inc., New York (2016)

    Google Scholar 

  • Li, Y., Turner, R.E.: Rényi divergence variational inference. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’16, pp. 1081–1089 (2016)

  • Maddison, C.J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., Teh, Y.: Filtering variational objectives. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 6573–6583. Curran Associates, Inc, New York (2017)

    Google Scholar 

  • Magnus, J.R., Neudecker, H.: The elimination matrix: some lemmas and applications. SIAM J. Algebr. Discrete Methods 1, 422–449 (1980)

    Article  MathSciNet  Google Scholar 

  • Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd edn. Wiley, New York (1999)

    MATH  Google Scholar 

  • Miller, A.C., Foti, N., Adams, R.P.: Variational boosting: iteratively refining posterior approximations (2016). arXiv: 1611.06585

  • Minka, T.: Divergence measures and message passing. Technical report (2005)

  • Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64, 140–153 (2010)

    Article  MathSciNet  Google Scholar 

  • Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: Non-centered parameterisations for hierarchical models and data augmentation. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics 7, pp. 307–326. Oxford University Press, New York (2003)

    Google Scholar 

  • Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: A general framework for the parametrization of hierarchical models. Stat Sci 22, 59–73 (2007)

    Article  MathSciNet  Google Scholar 

  • Ranganath, R., Tran, D., Blei, D.M.: Hierarchical variational models. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 324–333 (2016)

  • Regli, J.B., Silva, R.: Alpha-beta divergence for variational inference (2018). arXiv: 1805.01045

  • Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp. 547–561 (1961)

  • Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538 (2015)

  • Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 32, pp. 1278–1286 (2014)

  • Roeder, G., Wu, Y., Duvenaud, D.K.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. In: Guyon, I., Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30 (2017)

  • Roeder, G., Grant, P.K., Phillips, A., Dalchau, N., Meeds, E.: Efficient amortised bayesian inference for hierarchical and nonlinear dynamical systems. Proc. Mach. Learn. Res. 97, 4445–4455 (2019)

    Google Scholar 

  • Rothman, A.J., Levina, E., Zhu, J.: A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97, 539–550 (2010)

    Article  MathSciNet  Google Scholar 

  • Salimans, T., Knowles, D.A.: Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Anal. 8, 837–882 (2013)

    Article  MathSciNet  Google Scholar 

  • Smith, M.S., Loaiza-Maya, R., Nott, D.J.: High-dimensional copula variational approximation through transformation (2019). arXiv:1904.07495

  • Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, New York (2003)

    Book  Google Scholar 

  • Spantini, A., Bigoni, D., Marzouk, Y.: Inference via low-dimensional couplings. J. Mach. Learn. Res. 19, 1–71 (2018)

    MathSciNet  MATH  Google Scholar 

  • Tan, L.S.L.: Efficient data augmentation techniques for Gaussian state space models (2017). arXiv:1712.08887

  • Tan, L.S.L.: Use of model reparametrization to improve variational Bayes (2018). arXiv:1805.07267

  • Tan, L.S.L., Nott, D.J.: Variational inference for generalized linear mixed models using partially non-centered parametrizations. Stat. Sci. 28, 168–188 (2013)

    Article  Google Scholar 

  • Tan, L.S.L., Nott, D.J.: Gaussian variational approximation with sparse precision matrices. Stat. Comput. 28, 259–275 (2018)

    Article  MathSciNet  Google Scholar 

  • Thall, P.F., Vail, S.C.: Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671 (1990)

    Article  MathSciNet  Google Scholar 

  • Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational Bayes for non-conjugate inference. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971–1979 (2014)

  • Tran, D., Blei, D.M., Airoldi, E.M.: Copula variational inference. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 3564–3572 (2015)

  • Tucker, G., Lawson, D., Gu, S., Maddison, C.J.: Doubly reparametrized gradient estimators for Monte Carlo objectives (2018). arXiv: 1810.04152

  • van Erven, T., Harremos, P.: Rnyi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60, 3797–3820 (2014)

    Article  Google Scholar 

  • Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. (2019) (to appear)

Download references

Acknowledgements

We wish to thank the editor and reviewer for their time in reviewing this manuscript and for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linda S. L. Tan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Linda Tan and Aishwarya Bhaskaran are supported by the start-up Grant R-155-000-190-133.

Appendices

Appendix A: Derivation of stochastic gradient

Let \(\otimes \) denote the Kronecker product between any two matrices. We have

$$\begin{aligned} r_\lambda (s) = \begin{bmatrix} \theta _G \\ \theta _L \end{bmatrix} = \begin{bmatrix} \mu _1 + C_1^{-T} s_1 \\ d+ C_2^{-T} (s_2 - DC_1^{-T} s_1) \end{bmatrix}, \end{aligned}$$

where \(v(C_2^*) = f + F(\mu _1 + C_1^{-T} s_1)\). Differentiating \(r_\lambda (s)\) with respect to \(\lambda \), \(\nabla _\lambda r_\lambda (s) \) is given by

$$\begin{aligned} \begin{aligned}&\begin{bmatrix} \nabla _{\mu _1} \theta _G &{} \nabla _{\mu _1} \theta _L \\ \nabla _{v(C_1^*)} \theta _G &{} \nabla _{v(C_1^*)} \theta _L \\ \nabla _{d} \theta _G &{} \nabla _{d} \theta _L \\ \nabla _{\mathrm{vec}(D)} \theta _G &{} \nabla _{\mathrm{vec}(D)} \theta _L \\ \nabla _{f} \theta _G &{} \nabla _{f} \theta _L \\ \nabla _{\mathrm{vec}(F)} \theta _G &{} \nabla _{\mathrm{vec}(F)} \theta _L \\ \end{bmatrix}. \end{aligned} \end{aligned}$$

Since \(\theta _G\) does not depend on d, D, f and F, we have

$$\begin{aligned} \begin{aligned} \nabla _{d} \theta _G&= 0_{nL \times G}, \quad \nabla _{\mathrm{vec}(D)} \theta _G = 0_{nLG \times G} \\ \nabla _{f} \theta _G&= 0_{nL(nL+1)/2 \times G}, \;\; \nabla _{\mathrm{vec}(F)} \theta _G = 0_{nLG(nL+1)/2 \times G}. \end{aligned} \end{aligned}$$

It is easy to see that \(\nabla _{\mu _1} \theta _G = I_G\) and \(\nabla _d \theta _L = I_{nL}\). The rest of the terms are derived as follows.

Differentiating \(\theta _G\) with respect to \(v(C_1^*)\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _G&= -C_1^{-T} \mathrm{d}(C_1^T) C_1^{-T} s_1\\&= - (s_1^T C_1^{-1} \otimes C_1^{-T}) K_G E_G^T D_1^* \mathrm{d}v(C_1^*) \\&= - (C_1^{-T} \otimes s_1^T C_1^{-1}) E_G^T D_1^* \mathrm{d}v(C_1^*). \\ \therefore \; \nabla _{v(C_1^*)} \theta _G&= - D_1^* E_G (C_1^{-1} \otimes C_1^{-T} s_1 ). \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to f,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T} \mathrm{d}(C_2^T) C_2^{-T} (s_2 - DC_1^{-T} s_1) \\&= - \{ (s_2 - DC_1^{-T} s_1)^T C_2^{-1} \otimes C_2^{-T} \} \\&\quad \times K_{nL} E_{nL}^T D_2^* \mathrm{d}f \\ \therefore \; \nabla _{f} \theta _L&= - D_2^* E_{nL} \{ C_2^{-1} \otimes C_2^{-T} (s_2 - DC_1^{-T} s_1) \}. \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to F,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= (\nabla _{f} \theta _L)^T \mathrm{d}F \theta _G \\&= \{ \theta _G^T \otimes (\nabla _{f} \theta _L)^T \} \mathrm{d}\mathrm{vec}(F). \\ \therefore \; \nabla _{\mathrm{vec}(F)} \theta _L&= \theta _G \otimes \nabla _{f} \theta _L. \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to D,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T} \mathrm{d}D C_1^{-T} s_1 \\&= - (s_1^T C_1^{-1} \otimes C_2^{-T}) \mathrm{d}\mathrm{vec}(D). \\ \therefore \; \nabla _{\mathrm{vec}(D)} \theta _L&= - (C_1^{-T} s_1 \otimes C_2^{-1}). \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to \(\mu _1\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= (\nabla _{f} \theta _L)^T F \mathrm{d}\mu _1 \\ \therefore \; \nabla _{\mu _1} \theta _L&= F^T (\nabla _{f} \theta _L). \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to \(v(C_1)\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T}\mathrm{d}(C_2^T) C_2^{-T}(s_2 - DC_1^{-T} s_1) \\&\quad - C_2^{-T} D \mathrm{d}(C_1^{-T}) s_1 \\&= (\nabla _{f} \theta _L)^T F\mathrm{d}(C_1^{-T}) s_1 - C_2^{-T} D \mathrm{d}(C_1^{-T}) s_1 \\&= \{ (\nabla _{f} \theta _L)^T F - C_2^{-T} D\} (\nabla _{v(C_1^*)} \theta _G )^T \mathrm{d}v(C_1^*) \\ \therefore \; \nabla _{v(C_1^*)} \theta _L&= \nabla _{v(C_1^*)}\theta _G \{ F^T \nabla _{f} \theta _L - D^T C_2^{-1} \} \\&= \nabla _{v(C_1^*)}\theta _G \{ \nabla _{\mu _1} \theta _L - D^T C_2^{-1} \}. \end{aligned} \end{aligned}$$

Since \(s_1 = C_1^T(\theta _G - \mu _1)\) and \(s_2 = C_2^T (\theta _L - \mu _2)\), we have

$$\begin{aligned} \begin{aligned} \log q_\lambda (\theta )&= \log q(\theta _G) + \log q(\theta _L|\theta _G) \\&= -\frac{G}{2} \log (2\pi ) + \log |C_1| \\&\quad - \frac{1}{2}(\theta _G - \mu _1)^T C_1 C_1^T (\theta _G - \mu _1) \\&\quad -\frac{nL}{2} \log (2\pi ) + \log |C_2| \\&\quad - \frac{1}{2}(\theta _L - \mu _2)^T C_2 C_2^T (\theta _L - \mu _2) \\&= - \frac{nL+G}{2} \log (2\pi ) + \log |C_1C_2| - \frac{1}{2} s^T s. \end{aligned} \end{aligned}$$

As \(\mu _2 = d + C_2^{-T} D(\mu _1 - \theta _G)\) and \(v(C_2^*) = f + F \theta _G\), differentiating \(\log q_\lambda (\theta )\) with respect to \(\theta _G\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\log q_\lambda (\theta )&= - (\theta _G - \mu _1)^T C_1 C_1^T \mathrm{d}\theta _G - (\theta _L - \mu _2)^T C_2 C_2^T(-\mathrm{d}\mu _2)\\&\quad - (\theta _L - \mu _2)^T \mathrm{d}C_2 s_2 + \mathrm{tr}(C_2^{-1} \mathrm{d}C_2) \\&= -s_1^T C_1^T \mathrm{d}\theta _G + s_2^T C_2^T\{- C_2^{-T} D \mathrm{d}\theta _G \\&\quad + \mathrm{d}(C_2^{-T}) D(\mu _1 - \theta _G)\} \\&\quad - \mathrm{vec}(C_2^{-T} s_2 s_2^T)^T \mathrm{d}\mathrm{vec}(C_2) + \mathrm{vec}(C_2^{-T})^T d\mathrm{vec}(C_2) \\&= \mathrm{vec}(C_2^{-T} - \{C_2^{-T} s_2 + (\mu _2 -d)\}s_2^T)^T d\mathrm{vec}(C_2) \\&\quad -s_1^T C_1^T \mathrm{d}\theta _G - s_2^T D \mathrm{d}\theta _G \\&= \mathrm{vec}(C_2^{-T} - (\theta _L -d) s_2^T)^T E_{nL}^T D_2^* F \mathrm{d}\theta _G \\&\quad -s_1^T C_1^T \mathrm{d}\theta _G - s_2^T D \mathrm{d}\theta _G. \end{aligned} \end{aligned}$$

Therefore

$$\begin{aligned} \begin{aligned} \nabla _{\theta _G} \log q_\lambda (\theta )&=F^T D_2^* v(C_2^{-T} - (\theta _L -d) s_2^T) \\&\quad - C_1 s_1 - D^T s_2. \end{aligned} \end{aligned}$$

Note that \(D_2^* v(C_2^{-T}) = v(I_{nL})\) as \(C_2^{-T}\) is upper triangular and \(v(C_2^{-T})\) only retains the diagonal elements of \(C_2^{-T}\).

Differentiating \(\log q_\lambda (\theta )\) with respect to \(\theta _L\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\log q_\lambda (\theta )&= - (\theta _L - \mu _2)^T C_2 C_2^T \mathrm{d}\theta _L \\&= - s_2^T C_2^T \mathrm{d}\theta _L. \\ \therefore \; \nabla _{\theta _L} \log q_\lambda (\theta )&= - C_2 s_2. \end{aligned} \end{aligned}$$

Appendix B: Gradients for generalized linear mixed models

Since \(\theta = [\beta ^T, \omega ^T, {\tilde{b}}_1^T, \dots , {\tilde{b}}_n^T]^T\), we require

$$\begin{aligned}&\nabla _\theta \log p(y, \theta ) = [\nabla _\beta \log p(y, \theta ), \nabla _\omega \log p(y, \theta ), \\&\quad \nabla _{{\tilde{b}}_1} \log p(y, \theta ), \dots , \nabla _{{\tilde{b}}_n} \log p(y, \theta )]^T. \end{aligned}$$

For the centered parametrization, the components in \(\nabla _\theta \log p(y, \theta )\) are given below. Note that \(\beta = [\beta _{RG_1}^T, \beta _{G_2}^T]^T\).

$$\begin{aligned} \nabla _{\beta _{G_2}} \log p(y, \theta )= & {} \sum _{i=1}^n {X_i^{G_2}}^T \{ y_i - h'(\eta _i) \} - \beta _{G_2}/\sigma _\beta ^2. \\ \nabla _{\beta _{RG_1}} \log p(y, \theta )= & {} \sum _{i=1}^n C_i^T W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}) \\&- \beta _{RG_1}/\sigma _\beta ^2. \end{aligned}$$

Differentiating \(\log p(y, \theta ) \) with respect to \(\omega \),

$$\begin{aligned} \begin{aligned} \mathrm{d}\log p(y, \theta )&= -\sum _{i=1}^n ({\tilde{b}}_i - C_i \beta _{RG_1})^T \mathrm{d}W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}) \\&\quad + n \mathrm{tr}(W^{-1} \mathrm{d}W) - \omega ^T \mathrm{d}\omega /\sigma _\omega ^2 \\&= \mathrm{vec}\bigg \{ - \sum _{i=1}^n ({\tilde{b}}_i - C_i \beta _{RG_1}) ({\tilde{b}}_i - C_i \beta _{RG_1})^T W \\&\quad + nW^{-T} \bigg \}^T E_L^T D_L^* \mathrm{d}\omega - \omega ^T \mathrm{d}\omega /\sigma _\omega ^2, \end{aligned} \end{aligned}$$

where \(\mathrm{d}v(W) = D^*_L \mathrm{d}\omega \) and \(D^*_L = \mathrm{diag}\{ v(\mathrm{dg}(W) + \mathbf{1}_L\mathbf{1}_L^T - I_L) \}\). Hence

$$\begin{aligned} \begin{aligned} \nabla _\omega \log p(y, \theta )&=- D^*_L \sum _{i=1}^n v\{ ({\tilde{b}}_i - C_i \beta _{RG_1}) ({\tilde{b}}_i \\&\quad - C_i \beta _{RG_1}) ^TW \} \\&\quad + n v(I_L) - \omega /\sigma _\omega ^2. \end{aligned} \end{aligned}$$

Note that \(D_L^* v(W^{-T}) = v(I_L)\) because \(W^{-T}\) is upper triangular and \(v(W^{-T})\) only retains the diagonal elements.

$$\begin{aligned} \nabla _{{\tilde{b}}_i} \log p(y, \theta ) = Z_i^T \{ y_i - h'(\eta _i)\} - W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}). \end{aligned}$$

Appendix C: Gradients for state space models

Since \(\theta = [\alpha , \kappa , \psi , b_1^T, \dots , b_n^T]^T\), we require

$$\begin{aligned}&\nabla _\theta \log p(y, \theta ) = [\nabla _\alpha \log p(y, \theta ), \nabla _\kappa \log p(y, \theta ), \\&\quad \nabla _\psi \log p(y, \theta ), \nabla _{b_1} \log p(y, \theta ), \dots , \nabla _{b_n} \log p(y, \theta )]^T. \end{aligned}$$

The components in \(\nabla _\theta \log p(y, \theta )\) are given below.

$$\begin{aligned} \nabla _\alpha \log p(y, \theta )= & {} \frac{1}{2}\sum ^n_{i=1} (b_i y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - b_i)(1-\mathrm{e}^{-\sigma }) - \frac{\alpha }{\sigma _{\alpha }^2}. \\ \nabla _\kappa \log p(y, \theta )= & {} \frac{1}{2} \bigg (\sum ^n_{i=1} y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - n \bigg ) - \kappa /\sigma _{\kappa }^2.\\ \nabla _\psi \log p(y, \theta )= & {} \bigg \{ \sum _{i=2}^{n} (b_i - \phi b_{i-1})b_{i-1} + b_1^2 \phi - \frac{\phi }{1-\phi ^2} \bigg \} \\&\quad \times \phi (1-\phi ) - \psi /\sigma _{\psi }^2.\\ \nabla _{b_1} \log p(y, \theta )= & {} \frac{\sigma }{2} (y_1^2 \mathrm{e}^{-\sigma b_1 - \kappa } - 1) \\&+ \phi (b_2 - \phi b_1)- b_1 (1-\phi )^2. \end{aligned}$$

For \(2 \le i \le n-1\),

$$\begin{aligned} \begin{aligned} \nabla _{b_i} \log p(y, \theta )&= \frac{\sigma }{2} (y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - 1) +\phi (b_{i+1} - \phi b_i)\\&\quad - (b_i - \phi b_{i-1}). \\ \nabla _{b_n} \log p(y, \theta )&= \frac{\sigma }{2} (y_n^2 \mathrm{e}^{-\sigma b_n - \kappa } - 1) - (b_n - \phi b_{n-1}). \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tan, L.S.L., Bhaskaran, A. & Nott, D.J. Conditionally structured variational Gaussian approximation with importance weights. Stat Comput 30, 1255–1272 (2020). https://doi.org/10.1007/s11222-020-09944-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09944-8

Keywords