A case study of the widely applicable Bayesian information criterion and its optimality

Mononen, Tommi

doi:10.1007/s11222-014-9463-3

A case study of the widely applicable Bayesian information criterion and its optimality

Published: 09 April 2014

Volume 25, pages 929–940, (2015)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Tommi Mononen¹

657 Accesses
4 Citations
Explore all metrics

Abstract

In Bayesian statistics, the marginal likelihood (evidence) is one of the key factors that can be used as a measure of model goodness. However, for many practical model families it cannot be computed analytically. An alternative solution is to use some approximation method or time-consuming sampling method. The widely applicable Bayesian information criterion (WBIC) was developed recently to have a marginal likelihood approximation that works also with singular models. The central idea of the approximation is to select a single thermodynamic integration term (power posterior) with the (approximated) optimal temperature $\beta ^*=1/\log (n)$, where $n$ is the data size. We apply this new approximation to the analytically solvable Gaussian process regression case to show that the optimal temperature may depend also on data itself or other variables, such as the noise level. Moreover, we show that the steepness of a thermodynamic curve at the optimal temperature indicates the magnitude of the error that WBIC makes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Article 25 March 2024

Rémi Thériault, Mattan S. Ben-Shachar, … Dominique Makowski

Confidence distributions and hypothesis testing

Article Open access 29 March 2024

Eugenio Melilli & Piero Veronese

Parsimonious ultrametric Gaussian mixture models

Article Open access 01 April 2024

Carlo Cavicchia, Maurizio Vichi & Giorgia Zaccaria

References

Calderhead, B., Girolami, M.: Estimating Bayes factors via thermodynamic integration and population MCMC. Comput. Stat. Data Anal. 53, 4028–4045 (2009)
Article MathSciNet MATH Google Scholar
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47, 547–553 (2009)
Article Google Scholar
Filippone, M.: Bayesian inference for Gaussian process classifiers with annealing and exact-approximate MCMC. arXiv:1311.7320 (2013)
Filippone, M., Zhong, M., Girolami, M.: A comparative evaluation of stochastic-based inference methods for Gaussian process models. Mach. Learn. 93, 93–114 (2012)
Article MathSciNet Google Scholar
Friel, N., Pettitt, A.N.: Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. Series B (Stat. Methodol.) 70, 589–607 (2008)
Article MathSciNet MATH Google Scholar
Friel, N., Hurn, M., Wyse, J.: Improving power posterior estimation of statistical evidence. Stat. Comput. (2013). doi:10.1007/s11222-013-9397-1
Kuss, M., Rasmussen, C.E.: Assessing approximate inference for binary Gaussian process classification. J. Mach. Learn. Res. 6, 1679–1704 (2005)
MathSciNet MATH Google Scholar
Murray, I., Adams, R.P.: Slice sampling covariance hyperparameters of latent Gaussian models. Adv. Neural Inf. Process. Syst. 23, 1723–1731 (2010)
Google Scholar
Nickisch, H., Rasmussen, C.E.: Approximations for binary Gaussian process classification. J. Mach. Learn. Res. 9, 2035–2078 (2008)
MathSciNet MATH Google Scholar
Petersen, K.B., Pedersen, M.S.: The Matrix Cookbook. Version: November 14 (2008)
Quiñonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)
MathSciNet MATH Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA (2006)
Roos, T., Zou, Y.: Keep it simple stupid—On the effect of lower-order terms in BIC-like criteria. Information Theory and Applications Workshop, February 2013, San Diego, USA (2013)
Rue, H., Martino, S., Chopin, N.: Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Series B (Stat. Methodol.) 71, 319–392 (2009)
Article MathSciNet MATH Google Scholar
Vanhatalo, J., et al.: GPstuff: Bayesian modeling with Gaussian processes. J. Mach. Learn. Res. 14, 1175–1179 (2013)
MathSciNet MATH Google Scholar
Watanabe, S.: Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, Cambridge (2009)
Book MATH Google Scholar
Watanabe, S.: A widely applicable Bayesian information criterion. J. Mach. Learn. Res. 14, 867–897 (2013)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

The author wishes to thank Daniel Simpson, Arno Solin, Aki Vehtari and the anonymous reviewers for providing valuable comments on this article.

Author information

Authors and Affiliations

Department of Biomedical Engineering and Computational Science, Aalto University School of Science, P.O. Box 12200, 02150 , Espoo, Finland
Tommi Mononen

Authors

Tommi Mononen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tommi Mononen.

Appendices

Appendix

Proof of Theorem 1:

Computing the denominator of WBIC for GP

In the following, we have simplified notation by setting $\sigma ^2_\mathrm{n}=\sigma ^2$ as variable $\sigma ^2_\mathrm{s}$ is inside matrix ${{\mathrm{\mathbf {K}}}}$ and therefore it does not emerge in the proof. Let us start to compute $\beta $-weighted marginal likelihood:

$$\begin{aligned}&\int \limits _{-\infty }^{\infty }p({{\mathrm{\mathbf {y}}}}|{{\mathrm{\mathbf {f}}}},{{\mathrm{D}}})^{\beta }p({{\mathrm{\mathbf {f}}}}|D){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = \int \limits _{-\infty }^{\infty }{{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {y}}}};{{\mathrm{\mathbf {f}}}},\sigma ^2{{\mathrm{\mathbf {I}}}})^{\beta }{{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {f}}}};0,{{\mathrm{\mathbf {K}}}}) {{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = \int \limits _{-\infty }^{\infty }\frac{1}{\sqrt{(2\pi )^{\beta n}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^\beta }}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})}\\&\qquad \times \frac{1}{\sqrt{(2\pi )^n|{{\mathrm{\mathbf {K}}}}|}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = (2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}\\&\qquad \times \int \limits _{-\infty }^{\infty }{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}. \end{aligned}$$

Then complete the sum of squared forms to square (Petersen and Pedersen 2008, 8.1.7 Sum of two squared forms)

$$\begin{aligned}&-\frac{1}{2}({{\mathrm{\mathbf {f}}}}-{{\mathrm{\mathbf {y}}}})^\intercal \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}({{\mathrm{\mathbf {f}}}}-{{\mathrm{\mathbf {y}}}})-\frac{1}{2}({{\mathrm{\mathbf {f}}}}-0)^\intercal {{\mathrm{\mathbf {K}}}}^{-1}({{\mathrm{\mathbf {f}}}}-0)\\&\quad =-\frac{1}{2}\left( {{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}{{\mathrm{\mathbf {y}}}}\right) ^\intercal \\&\qquad \times \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \\&\qquad \times \left( {{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) \\&\qquad + \frac{1}{2}\left( {{\mathrm{\mathbf {y}}}}^\intercal \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}\right) \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\\&\qquad \times \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) -\frac{1}{2}\left( {{\mathrm{\mathbf {y}}}}^\intercal \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) \end{aligned}$$

By using the matrix inversion lemma (Rasmussen and Williams 2006, A.3 Matrix Identities) to the second term we get

$$\begin{aligned}&-\frac{1}{2}\left( {{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) ^\intercal \\&\quad \times \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\\&\quad \times \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\end{aligned}$$

Now we continue with the integral after substituting the previous results and using after that an integration rule of a Gaussian integral (Petersen and Pedersen 2008, 8.1.1 Density and normalization)

$$\begin{aligned}&(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times \int \limits _{-\infty }^{\infty }\exp \Biggl (-\frac{1}{2}\biggl ({{\mathrm{\mathbf {f}}}}-\biggl (\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\biggr )^{-1}\\&\qquad \times \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \\&\qquad \times \biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )\Biggr ){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad = (2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times (2\pi )^\frac{n}{2}|(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}|^{-\frac{1}{2}}\\&\quad = (2\pi )^{-\frac{\beta n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times \left( |\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}|~|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}|~|{{\mathrm{\mathbf {K}}}}^{-1}|\right) ^{-\frac{1}{2}}\\&\quad = (2\pi )^{-\frac{\beta n}{2}}|(\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}|^{\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}^{-1}|^{\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\qquad \times \left( |\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}|~|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}|~|{{\mathrm{\mathbf {K}}}}^{-1}|\right) ^{-\frac{1}{2}}\\&\quad = \frac{1}{\sqrt{(2\pi )^{\beta n}|{{\mathrm{\mathbf {K}}}}+(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})|}}\sqrt{\frac{|(\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}|^\beta }{|(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}|}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( {{\mathrm{\mathbf {K}}}}+(\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad =\mathrm{DENOM} \end{aligned}$$

Notice that by setting $\beta = 1$ and taking the logarithm, we end up to the well-known log marginal likelihood formula.

Computing the numerator of WBIC for GP

$$\begin{aligned}&\int \limits _{-\infty }^{\infty }\log p({{\mathrm{\mathbf {y}}}}|{{\mathrm{\mathbf {f}}}},{{\mathrm{D}}}) p({{\mathrm{\mathbf {y}}}}|{{\mathrm{\mathbf {f}}}},{{\mathrm{D}}})^{\beta }p({{\mathrm{\mathbf {f}}}}|D){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\int \limits _{-\infty }^{\infty }\log {{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {y}}}};{{\mathrm{\mathbf {f}}}},\sigma ^2{{\mathrm{\mathbf {I}}}}){{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {y}}}};{{\mathrm{\mathbf {f}}}},\sigma ^2{{\mathrm{\mathbf {I}}}})^{\beta }{{\mathrm{\mathcal {N}}}}({{\mathrm{\mathbf {f}}}};0,{{\mathrm{\mathbf {K}}}}) {{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\int \limits _{-\infty }^{\infty }\biggl (-\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\\&\quad -\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})\biggr )\\&\quad \times \frac{1}{\sqrt{(2\pi )^{\beta n}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^\beta }}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})}\\&\quad \times \frac{1}{\sqrt{(2\pi )^n|{{\mathrm{\mathbf {K}}}}|}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) (2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}\\&\quad \times |{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}\int \limits _{-\infty }^{\infty }{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\&\quad -\frac{1}{2}\int \limits _{-\infty }^{\infty }({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}\\&\quad \times |{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})-\frac{1}{2}{{\mathrm{\mathbf {f}}}}^\intercal {{\mathrm{\mathbf {K}}}}^{-1}{{\mathrm{\mathbf {f}}}}}{{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\\ =&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) \times \mathrm{DENOM}\\&\quad -\frac{1}{2}(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad \times \int \limits _{-\infty }^{\infty }({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}({{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {f}}}})\\&\quad \times \exp \Biggl (-\frac{1}{2}\biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\\&\quad \times \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) \\&\quad \times \biggl ({{\mathrm{\mathbf {f}}}}-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\biggr )\Biggr ){{\mathrm{d}}}{{\mathrm{\mathbf {f}}}}\end{aligned}$$

By using Matrix Cookbook formula 357 (Petersen and Pedersen 2008, 8.2.2 Mean and variance of square forms) and by compensating missing constants we will have

$$\begin{aligned}&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) \times \mathrm{DENOM}\\&\quad -\frac{1}{2}(2\pi )^{-\frac{(\beta +1) n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}{{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad \times (2\pi )^\frac{n}{2}|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}|^{-\frac{1}{2}}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) ^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) \\&\quad +\mathrm{tr}\left( (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\right) \\ =&\left( -\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\right) \times \mathrm{DENOM}\\&\quad -\frac{1}{2}(2\pi )^{-\frac{\beta n}{2}}|\sigma ^2{{\mathrm{\mathbf {I}}}}|^{-\frac{\beta }{2}}|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}|^{-\frac{1}{2}}|\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}|^{-\frac{1}{2}}\\&\quad \times {{\mathrm{\mathbf {e}}}}^{-\frac{1}{2}{{\mathrm{\mathbf {y}}}}^\intercal \left( (\frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}})+{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) ^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) \\&\quad +\mathrm{tr}\left( (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\right) \\ =&~\mathrm{DENOM}\times \Biggl (-\frac{n}{2}\log (2\pi )-\frac{1}{2}\log |\sigma ^2{{\mathrm{\mathbf {I}}}}|\\&\quad -\frac{1}{2}\Biggl [\mathrm{tr}\left( (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\right) \\&\quad + \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) ^\intercal (\sigma ^2{{\mathrm{\mathbf {I}}}})^{-1}\\&\quad \times \left( \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}+{{\mathrm{\mathbf {K}}}}^{-1}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}-{{\mathrm{\mathbf {y}}}}\right) \Biggr ]\Biggr )\\ =&~\mathrm{DENOM}\times (-\mathbf{WBIC}) \end{aligned}$$

In the last line, we observe that as the numerator is divided by the denominator and the numerator has the denominator as a multiplier, then the rest of the numerator must be (negative) WBIC itself. After a manipulation (using the matrix inversion lemma to the all three appropriate terms) we achieve a numerically stable version

$$\begin{aligned}&-\mathbf{WBIC}=-\frac{n}{2}\log \left( 2\pi \sigma ^2\right) \\&\quad -\frac{1}{2\beta }\Biggl [\mathrm{tr}\left( 1-\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}\left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) \right) \\&\quad +\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\right) ^\intercal \left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) \!+\!{{\mathrm{\mathbf {K}}}}\right) ^{-1}{{\mathrm{\mathbf {y}}}}\Biggr ], \end{aligned}$$

which can be further manipulated by taking stuff outside of the trace and interpreting the latter term as a matrix norm. As a result we obtain a neat form

$$\begin{aligned} \mathbf{WBIC}=&\frac{n}{2}\left( \log (2\pi \sigma ^2)+\frac{1}{\beta }\right) \\&\quad +\frac{\sigma ^2}{2\beta ^2}\biggl (||\mathbf {A{{\mathrm{\mathbf {y}}}}}||^2-\mathrm{tr}(\mathbf {A})\biggr )\text {, where}\\&\quad \mathbf {A}=\left( \left( \frac{\sigma ^2}{\beta }{{\mathrm{\mathbf {I}}}}\right) +{{\mathrm{\mathbf {K}}}}\right) ^{-1} \end{aligned}$$

$\square $

The exact solution to the pruned GP case

The following exact solution is computed with Mathematica by equating the results of Lemmas 1and 2

Lemma 3

The optimal temperature for the pruned GP regression case with all the terms included is

$$\begin{aligned} \hat{\beta }^\mathrm{opt}=&-\Biggr (-2{{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}}\sigma ^2_\mathrm{n}+\sigma ^2_\mathrm{s}nW+2n\sigma ^2_\mathrm{n}W\log \frac{\sigma ^2_\mathrm{n}}{W}+\sqrt{W}\\&\quad \times \sqrt{4({{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}})^2\sigma ^2_\mathrm{n}\!+\!\sigma ^4_\mathrm{s}n^2W\!+\!4{{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}}n\sigma ^2_\mathrm{n}W\log \frac{W}{\sigma ^2_\mathrm{n}}}\Biggr )\\&\quad \bigg /\left( 2\sigma ^2_\mathrm{s}\left( -{{\mathrm{\mathbf {y}}}}^\intercal {{\mathrm{\mathbf {y}}}}+nW\log \frac{\sigma ^2_\mathrm{n}}{W}\right) \right) , \end{aligned}$$

where $W=(\sigma ^2_\mathrm{s}+\sigma ^2_\mathrm{n})$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mononen, T. A case study of the widely applicable Bayesian information criterion and its optimality. Stat Comput 25, 929–940 (2015). https://doi.org/10.1007/s11222-014-9463-3

Download citation

Received: 28 June 2013
Accepted: 22 March 2014
Published: 09 April 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11222-014-9463-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A case study of the widely applicable Bayesian information criterion and its optimality

Abstract

Access this article

Similar content being viewed by others

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Confidence distributions and hypothesis testing

Parsimonious ultrametric Gaussian mixture models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Computing the denominator of WBIC for GP

Computing the numerator of WBIC for GP

The exact solution to the pruned GP case

Lemma 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A case study of the widely applicable Bayesian information criterion and its optimality

Abstract

Access this article

Similar content being viewed by others

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

Confidence distributions and hypothesis testing

Parsimonious ultrametric Gaussian mixture models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Computing the denominator of WBIC for GP

Computing the numerator of WBIC for GP

The exact solution to the pruned GP case

Lemma 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats