Skip to main content
Log in

A case study of the widely applicable Bayesian information criterion and its optimality

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

In Bayesian statistics, the marginal likelihood (evidence) is one of the key factors that can be used as a measure of model goodness. However, for many practical model families it cannot be computed analytically. An alternative solution is to use some approximation method or time-consuming sampling method. The widely applicable Bayesian information criterion (WBIC) was developed recently to have a marginal likelihood approximation that works also with singular models. The central idea of the approximation is to select a single thermodynamic integration term (power posterior) with the (approximated) optimal temperature \(\beta ^*=1/\log (n)\), where \(n\) is the data size. We apply this new approximation to the analytically solvable Gaussian process regression case to show that the optimal temperature may depend also on data itself or other variables, such as the noise level. Moreover, we show that the steepness of a thermodynamic curve at the optimal temperature indicates the magnitude of the error that WBIC makes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Calderhead, B., Girolami, M.: Estimating Bayes factors via thermodynamic integration and population MCMC. Comput. Stat. Data Anal. 53, 4028–4045 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47, 547–553 (2009)

    Article  Google Scholar 

  • Filippone, M.: Bayesian inference for Gaussian process classifiers with annealing and exact-approximate MCMC. arXiv:1311.7320 (2013)

  • Filippone, M., Zhong, M., Girolami, M.: A comparative evaluation of stochastic-based inference methods for Gaussian process models. Mach. Learn. 93, 93–114 (2012)

    Article  MathSciNet  Google Scholar 

  • Friel, N., Pettitt, A.N.: Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. Series B (Stat. Methodol.) 70, 589–607 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Friel, N., Hurn, M., Wyse, J.: Improving power posterior estimation of statistical evidence. Stat. Comput. (2013). doi:10.1007/s11222-013-9397-1

  • Kuss, M., Rasmussen, C.E.: Assessing approximate inference for binary Gaussian process classification. J. Mach. Learn. Res. 6, 1679–1704 (2005)

    MathSciNet  MATH  Google Scholar 

  • Murray, I., Adams, R.P.: Slice sampling covariance hyperparameters of latent Gaussian models. Adv. Neural Inf. Process. Syst. 23, 1723–1731 (2010)

    Google Scholar 

  • Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)

    MathSciNet  MATH  Google Scholar 

  • Petersen, K.B., Pedersen, M.S.: The Matrix Cookbook. Version: November 14 (2008)

  • Quiñonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)

    MathSciNet  MATH  Google Scholar 

  • Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA (2006)

  • Roos, T., Zou, Y.: Keep it simple stupid—On the effect of lower-order terms in BIC-like criteria. Information Theory and Applications Workshop, February 2013, San Diego, USA (2013)

  • Rue, H., Martino, S., Chopin, N.: Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Series B (Stat. Methodol.) 71, 319–392 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Vanhatalo, J., et al.: GPstuff: Bayesian modeling with Gaussian processes. J. Mach. Learn. Res. 14, 1175–1179 (2013)

    MathSciNet  MATH  Google Scholar 

  • Watanabe, S.: Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, Cambridge (2009)

    Book  MATH  Google Scholar 

  • Watanabe, S.: A widely applicable Bayesian information criterion. J. Mach. Learn. Res. 14, 867–897 (2013)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The author wishes to thank  Daniel Simpson, Arno Solin, Aki Vehtari and the anonymous reviewers for providing valuable comments on this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommi Mononen.

Appendices

Appendix

Proof of Theorem 1:

Computing the denominator of WBIC for GP

In the following, we have simplified notation by setting \(\sigma ^2_\mathrm{n}=\sigma ^2\) as variable \(\sigma ^2_\mathrm{s}\) is inside matrix \({{\mathrm{\mathbf {K}}}}\) and therefore it does not emerge in the proof. Let us start to compute \(\beta \)-weighted marginal likelihood:

$$\begin{aligned}&\int \limits _{-\infty }^{\infty }p({{\mathrm{\mathbf {y}}}}|{{\mathrm{\mathbf {f}}}},{{\mathrm{D}}})^{\beta }p({{\mathrm{\mathbf {f}}}}|D){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = \int \limits _{-\infty }^{\infty }{{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {y}}}};{{\mathrm{\mathbf {f}}}},\sigma ^2{{\mathrm{\mathbf {I}}}})^{\beta }{{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {f}}}};0,{{\mathrm{\mathbf {K}}}}) {{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = \int \limits _{-\infty }^{\infty }\frac{1}{\sqrt{(2\pi )^{\beta n}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^\beta }}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})}\\&\qquad \times \frac{1}{\sqrt{(2\pi )^n|{{\mathrm{\mathbf {K}}}}|}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = (2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}\\&\qquad \times \int \limits _{-\infty }^{\infty }{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}. \end{aligned}$$

Then complete the sum of squared forms to square (Petersen and Pedersen 2008, 8.1.7 Sum of two squared forms)

$$\begin{aligned}&-\frac{1}{2}({{\mathrm{\mathbf {f}}}}-{{\mathrm{\mathbf {y}}}})^\intercal \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}({{\mathrm{\mathbf {f}}}}-{{\mathrm{\mathbf {y}}}})-\frac{1}{2}({{\mathrm{\mathbf {f}}}}-0)^\intercal {{\mathrm{\mathbf {K}}}}^{-1}({{\mathrm{\mathbf {f}}}}-0)\\&\quad =-\frac{1}{2}\left( {{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}{{\mathrm{\mathbf {y}}}}\right) ^\intercal \\&\qquad \times \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \\&\qquad \times \left( {{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) \\&\qquad + \frac{1}{2}\left( {{\mathrm{\mathbf {y}}}}^\intercal \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}\right) \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\\&\qquad \times \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) -\frac{1}{2}\left( {{\mathrm{\mathbf {y}}}}^\intercal \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) \end{aligned}$$

By using the matrix inversion lemma (Rasmussen and Williams 2006, A.3 Matrix Identities) to the second term we get

$$\begin{aligned}&-\frac{1}{2}\left( {{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) ^\intercal \\&\quad \times \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\\&\quad \times \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\end{aligned}$$

Now we continue with the integral after substituting the previous results and using after that an integration rule of a Gaussian integral (Petersen and Pedersen 2008, 8.1.1 Density and normalization)

$$\begin{aligned}&(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times \int \limits _{-\infty }^{\infty }\exp \Biggl (-\frac{1}{2}\biggl ({{\mathrm{\mathbf {f}}}}-\biggl (\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\biggr )^{-1}\\&\qquad \times \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \\&\qquad \times \biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )\Biggr ){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = (2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times (2\pi )^\frac{n}{2}|(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}|^{-\frac{1}{2}}\\&\quad = (2\pi )^{-\frac{\beta n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times \left( |\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}|~|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}|~|{{\mathrm{\mathbf {K}}}}^{-1}|\right) ^{-\frac{1}{2}}\\&\quad = (2\pi )^{-\frac{\beta n}{2}}|(\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}|^{\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}^{-1}|^{\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times \left( |\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}|~|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}|~|{{\mathrm{\mathbf {K}}}}^{-1}|\right) ^{-\frac{1}{2}}\\&\quad = \frac{1}{\sqrt{(2\pi )^{\beta n}|{{\mathrm{\mathbf {K}}}}+(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})|}}\sqrt{\frac{|(\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}|^\beta }{|(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}|}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( {{\mathrm{\mathbf {K}}}}+(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad =\mathrm{DENOM} \end{aligned}$$

Notice that by setting \(\beta = 1\) and taking the logarithm, we end up to the well-known log marginal likelihood formula.

Computing the numerator of WBIC for GP

$$\begin{aligned}&\int \limits _{-\infty }^{\infty }\log p({{\mathrm{\mathbf {y}}}}|{{\mathrm{\mathbf {f}}}},{{\mathrm{D}}}) p({{\mathrm{\mathbf {y}}}}|{{\mathrm{\mathbf {f}}}},{{\mathrm{D}}})^{\beta }p({{\mathrm{\mathbf {f}}}}|D){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\int \limits _{-\infty }^{\infty }\log {{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {y}}}};{{\mathrm{\mathbf {f}}}},\sigma ^2{{\mathrm{\mathbf {I}}}}){{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {y}}}};{{\mathrm{\mathbf {f}}}},\sigma ^2{{\mathrm{\mathbf {I}}}})^{\beta }{{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {f}}}};0,{{\mathrm{\mathbf {K}}}}) {{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\int \limits _{-\infty }^{\infty }\biggl (-\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\\&\quad -\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})\biggr )\\&\quad \times \frac{1}{\sqrt{(2\pi )^{\beta n}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^\beta }}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})}\\&\quad \times \frac{1}{\sqrt{(2\pi )^n|{{\mathrm{\mathbf {K}}}}|}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) (2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}\\&\quad \times |{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}\int \limits _{-\infty }^{\infty }{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad -\frac{1}{2}\int \limits _{-\infty }^{\infty }({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}\\&\quad \times |{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) \times \mathrm{DENOM}\\&\quad -\frac{1}{2}(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad \times \int \limits _{-\infty }^{\infty }({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})\\&\quad \times \exp \Biggl (-\frac{1}{2}\biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\\&\quad \times \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \\&\quad \times \biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )\Biggr ){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\end{aligned}$$

By using Matrix Cookbook formula 357 (Petersen and Pedersen 2008, 8.2.2 Mean and variance of square forms) and by compensating missing constants we will have

$$\begin{aligned}&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) \times \mathrm{DENOM}\\&\quad -\frac{1}{2}(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad \times (2\pi )^\frac{n}{2}|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}|^{-\frac{1}{2}}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) ^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) \\&\quad +\mathrm{tr}\left( (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\right) \\ =&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) \times \mathrm{DENOM}\\&\quad -\frac{1}{2}(2\pi )^{-\frac{\beta n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}|^{-\frac{1}{2}}|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}\\&\quad \times {{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) ^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) \\&\quad +\mathrm{tr}\left( (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\right) \\ =&~\mathrm{DENOM}\times \Biggl (-\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\\&\quad -\frac{1}{2}\Biggl [\mathrm{tr}\left( (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\right) \\&\quad + \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) ^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) \Biggr ]\Biggr )\\ =&~\mathrm{DENOM}\times (-\mathbf{WBIC}) \end{aligned}$$

In the last line, we observe that as the numerator is divided by the denominator and the numerator has the denominator as a multiplier, then the rest of the numerator must be (negative) WBIC itself. After a manipulation (using the matrix inversion lemma to the all three appropriate terms) we achieve a numerically stable version

$$\begin{aligned}&-\mathbf{WBIC}=-\frac{n}{2}\log \left( 2\pi \sigma ^2\right) \\&\quad -\frac{1}{2\beta }\Biggl [\mathrm{tr}\left( 1-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) \right) \\&\quad +\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) ^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) \!+\!{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\Biggr ], \end{aligned}$$

which can be further manipulated by taking stuff outside of the trace and interpreting the latter term as a matrix norm. As a result we obtain a neat form

$$\begin{aligned} \mathbf{WBIC}=&\frac{n}{2}\left( \log (2\pi \sigma ^2)+\frac{1}{\beta }\right) \\&\quad +\frac{\sigma ^2}{2\beta ^2}\biggl (||\mathbf {A{{\mathrm{\mathbf {y}}}}}||^2-\mathrm{tr}(\mathbf {A})\biggr )\text {, where}\\&\quad \mathbf {A}=\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1} \end{aligned}$$

\(\square \)

The exact solution to the pruned GP case

The following exact solution is computed with Mathematica by equating the results of Lemmas 1and 2

Lemma 3

The optimal temperature for the pruned GP regression case with all the terms included is

$$\begin{aligned} \hat{\beta }^\mathrm{opt}=&-\Biggr (-2{{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}}\sigma ^2_\mathrm{n}+\sigma ^2_\mathrm{s}nW+2n\sigma ^2_\mathrm{n}W\log \frac{\sigma ^2_\mathrm{n}}{W}+\sqrt{W}\\&\quad \times \sqrt{4({{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}})^2\sigma ^2_\mathrm{n}\!+\!\sigma ^4_\mathrm{s}n^2W\!+\!4{{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}}n\sigma ^2_\mathrm{n}W\log \frac{W}{\sigma ^2_\mathrm{n}}}\Biggr )\\&\quad \bigg /\left( 2\sigma ^2_\mathrm{s}\left( -{{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}}+nW\log \frac{\sigma ^2_\mathrm{n}}{W}\right) \right) , \end{aligned}$$

where \(W=(\sigma ^2_\mathrm{s}+\sigma ^2_\mathrm{n})\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mononen, T. A case study of the widely applicable Bayesian information criterion and its optimality. Stat Comput 25, 929–940 (2015). https://doi.org/10.1007/s11222-014-9463-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9463-3

Keywords

Navigation