Distributed robust Gaussian Process regression

Mair, Sebastian; Brefeld, Ulf

doi:10.1007/s10115-017-1084-7

Distributed robust Gaussian Process regression

Regular Paper
Published: 19 July 2017

Volume 55, pages 415–435, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

635 Accesses
6 Citations
Explore all metrics

Abstract

We study distributed and robust Gaussian Processes where robustness is introduced by a Gaussian Process prior on the function values combined with a Student-t likelihood. The posterior distribution is approximated by a Laplace Approximation, and together with concepts from Bayesian Committee Machines, we efficiently distribute the computations and render robust GPs on huge data sets feasible. We provide a detailed derivation and report on empirical results. Our findings on real and artificial data show that our approach outperforms existing baselines in the presence of outliers by using all available data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust weighted Gaussian processes

Article 09 July 2020

Sparse Information Filter for Fast Gaussian Process Regression

The extended skew Gaussian process for regression

Article 03 June 2014

References

Cao Y, Fleet DJ (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In: Modern nonparametrics 3: automating the learning pipeline workshop at NIPS
Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci Model Dev 7(3):1247–1250
Article Google Scholar
Chen J, Cao N, Low KH, Ouyang R, Tan CKY, Jaillet P (2013) Parallel Gaussian process regression with low-rank covariance matrix approximations. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, AUAI Press, pp 152–161
Deisenroth MP, Ng JW (2015) Distributed Gaussian processes. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1481–1490
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 1:1–38
MathSciNet MATH Google Scholar
Gal Y, van der Wilk M, Rasmussen C (2014) Distributed variational inference in sparse Gaussian process regression and latent variable models. In: Advances in neural information processing systems, pp 3257–3265
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. CRC Press, Boca Raton
MATH Google Scholar
Geweke J (1993) Bayesian treatment of the independent Student-t linear model. J Appl Econom 8(S1):S19–S40
Article Google Scholar
Harrison D, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manage 5(1):81–102
Article MATH Google Scholar
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, AUAI Press, pp 282–290
Jaynes E, Bretthorst G (2003) Probability theory: the logic of science. Cambridge university press, Cambridge
Book Google Scholar
Jylänki P, Vanhatalo J, Vehtari A (2011) Robust Gaussian process regression with a Student-t likelihood. J Mach Learn Res 12:3227–3257
MathSciNet MATH Google Scholar
Kuss M (2006) Gaussian process models for robust regression, classification, and reinforcement learning. PhD thesis, Technische Universität Darmstadt
Naish-Guzman A, Holden S (2008) Robust regression with twinned Gaussian processes. In: Advances in neural information processing systems, pp 1065–1072
Neal R (1997) Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical report, Department of Statistics, University of Toronto
Nickisch H, Rasmussen CE (2008) Approximations for binary Gaussian process classification. J Mach Learn Res 9(10):2035–2078
MathSciNet MATH Google Scholar
O’Hagan A (1979) On outlier rejection phenomena in Bayes inference. J R Stat Soc Ser B (Methodol) 41:358–367
MathSciNet MATH Google Scholar
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
MathSciNet MATH Google Scholar
Rasmussen C, Williams C (2006) Gaussian processes for machine learning. Adaptive computation and machine learning, MIT Press, Cambridge. http://mitpress.mit.edu/026218253X
Tipping ME, Lawrence ND (2005) Variational inference for Student-t models: robust Bayesian interpolation and generalised component analysis. Neurocomputing 69(1):123–141
Article Google Scholar
Titsias MK (2009) Variational learning of inducing variables in sparse Gaussian processes. In: International conference on artificial intelligence and statistics, pp 567–574
Tresp V (2000) A Bayesian committee machine. Neural Comput 12(11):2719–2741
Article Google Scholar
Vanhatalo J, Jylänki P, Vehtari A (2009) Gaussian process regression with Student-t likelihood. In: Advances in neural information processing systems, pp 1910–1918
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30(1):79
Article Google Scholar

Download references

Author information

Authors and Affiliations

Leuphana University of Lüneburg, Lüneburg, Germany
Sebastian Mair & Ulf Brefeld

Authors

Sebastian Mair
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Brefeld
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastian Mair.

Appendices

Appendix

Before providing the derivation of the partial derivatives of the approximate log marginal likelihood in Eq. (21), we introduce the matrix $\varvec{R}$, which will be convenient later on.

$$\begin{aligned} \varvec{R}&= ( \varvec{W}^{-1} + \varvec{K} )^{-1} \overset{ (25) }{ = } \varvec{W}^{\frac{1}{2}} ( \underbrace{ \varvec{I} + \varvec{W}^{\frac{1}{2}} \varvec{K} \varvec{W}^{\frac{1}{2}} }_{ = \varvec{B} = \varvec{L} \varvec{L}^\top } )^{-1} \varvec{W}^{\frac{1}{2}} = \varvec{W}^{\frac{1}{2}} \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}}. \end{aligned}$$

(26)

Using the matrix $\varvec{R}$ as well as the matrix inversion lemma allows to reformulate the inverse of $\varvec{K}^{-1} + \varvec{W}$ as a sum of the kernel matrix $\varvec{K}$ and a new matrix $\varvec{J}$,

$$\begin{aligned} \Big ( \varvec{K}^{-1} + \varvec{W} \Big )^{-1}&= \varvec{K} - \varvec{K} \underbrace{ \Big ( \varvec{K} + \varvec{W}^{-1} \Big )^{-1} }_{ = \varvec{R} } \varvec{K} \overset{ (26) }{ = } \varvec{K} - \varvec{K} \varvec{R} \varvec{K} \overset{ (26) }{ = } \varvec{K} - \varvec{K} \varvec{W}^{\frac{1}{2}} \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}} \varvec{K} \nonumber \\&= \varvec{K} - \underbrace{ \varvec{K} \varvec{W}^{\frac{1}{2}} ( \varvec{L}^{-1} )^\top }_{ = \varvec{J}^\top } \underbrace{ \varvec{L}^{-1} \varvec{W}^{\frac{1}{2}} \varvec{K} }_{ := \varvec{J} } \overset{ }{ = } \varvec{K} - \varvec{J}^\top \varvec{J}. \end{aligned}$$

(27)

Recall that there are kernel as well as the likelihood hyperparameters. We focus on a squared exponential kernel with automatic relevance detection parametrized by the signal noise $\sigma _f$ and the length scales $\ell _i$ for all $i=1,2,\ldots ,d$ dimensions. The likelihood is parametrized by the scale $\sigma _t$ and the degree of freedom $\nu $.

Partial derivatives with respect to the kernel hyperparameters

The partial derivatives with respect to the kernel hyperparameters are given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j }&\overset{ (21) }{ = } \frac{ \partial }{ \partial \theta _j } \Bigg ( \ln p( \varvec{y} | \hat{\varvec{f}} ) - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} - \frac{1}{2} \ln | \varvec{B} | \Bigg ) \nonumber \\&\,\,\,\overset{ }{ = } \underbrace{ \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j } }_{ \text {explicit} } \underbrace{ + \sum _{i=1}^n \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \hat{f}_i } \frac{ \partial \hat{f}_i }{ \partial \theta _j }, }_{ \text {implicit} } \end{aligned}$$

(28)

which consists of an explicit and an implicit term. The implicit term is caused by the dependence of $\hat{\varvec{f}}$ and $\varvec{W}$ on $\varvec{K}$ and therefore depends on the hyperparameters. The first part of the explicit term

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \ln p( \varvec{y} | \hat{\varvec{f}} )&= 0 \end{aligned}$$

is equal to zero. For the second term we use the intermediate result $\varvec{a}$ from Eq. (14) to obtain

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} \Bigg )&= \frac{1}{2} \underbrace{ \hat{\varvec{f}}^\top \varvec{K}^{-1} }_{ = \varvec{a}^\top } \frac{ \partial \varvec{K} }{ \partial \theta _j } \underbrace{ \varvec{K}^{-1} \hat{\varvec{f}} }_{ = \varvec{a} } \overset{ (14) }{ = } \frac{1}{2} \varvec{a}^\top \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{a}, \end{aligned}$$

and for the third term we get

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \ln | \varvec{B} | \Bigg )&\overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \varvec{B} }{ \partial \theta _j } \Bigg ) \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \partial }{ \partial \theta _j } \Big ( \varvec{I} + \varvec{W}^{\frac{1}{2}} \varvec{K} \varvec{W}^{\frac{1}{2}} \Big ) \Bigg ) \\&\overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}} \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{W}^{\frac{1}{2}} \Bigg ) = - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{W}^{\frac{1}{2}} \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \\&\overset{ (26) }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( ( \varvec{W}^{-1} + \varvec{K} )^{-1} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \overset{ (26) }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{R} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \end{aligned}$$

by using the definitions of the matrices $\varvec{B}$ and $\varvec{R}$ and the fact that circular rotation of matrix products does not change the trace of the product. Therefore, the explicit part of the partial derivative is given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j } \Bigg |_{\text {explicit}}&= \frac{1}{2} \varvec{a}^\top \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{a} - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{R} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ). \end{aligned}$$

Now we take care of the implicit part of the partial derivative. The derivation of the first two parts is equivalent to the derivation of $\varPsi ( \hat{\varvec{f}} )$, which is for $\hat{\varvec{f}}$ equal to zero,

$$\begin{aligned} \frac{ \partial }{ \partial \hat{\varvec{f}} } \Bigg ( \ln p( \varvec{y} | \hat{\varvec{f}} ) - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} \Bigg )&\equiv \frac{ \partial }{ \partial \hat{\varvec{f}} } \varPsi ( \hat{\varvec{f}} ) = 0. \end{aligned}$$

The third term of the partial derivative is the derivation of the log determinant of $\varvec{B}$. Using the definition of the matrices $\varvec{B}$ and $\varvec{J}$ yields

$$\begin{aligned}&\frac{ \partial }{ \partial \hat{f}_i } \Bigg ( - \frac{1}{2} \ln | \varvec{B} | \Bigg )\\&\quad \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \varvec{B} }{ \partial \hat{f}_i } \Bigg ) \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \frac{ \partial }{ \partial \hat{f}_i } \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big ) \Bigg ) \\&\quad \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) = - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} ( \varvec{K}^{-1} + \varvec{W} ) \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) \\&\quad \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K}^{-1} + \varvec{W} \Big )^{-1} \underbrace{ \varvec{K}^{-1} \varvec{K} }_{ = \varvec{I} } \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) = - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K}^{-1} + \varvec{W} \Big )^{-1} \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) \\&\,\,\, \overset{ (27) }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} - \varvec{J}^\top \varvec{J} \Big ) \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) = - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } . \end{aligned}$$

We still need to take care of $\frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }$. By using the definition of $\hat{\varvec{f}}$ from Eq. (11) as well as the multidimensional chain rule, we obtain

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } + \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } \frac{ \hat{\varvec{f}} }{ \partial \theta _j } \\&= \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) + \varvec{K} \underbrace{ \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } }_{ = -\varvec{W} } \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } \\&= \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) - \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } \end{aligned}$$

which is equivalent to

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } + \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} ) \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } = \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) \\ \iff \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} )^{-1} \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) \\&= \Big ( \varvec{I} - \varvec{K} \underbrace{ ( \varvec{W}^{-1} + \varvec{K} )^{-1} }_{ = \varvec{R} } \Big ) \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) \\&= \Big ( \varvec{I} - \varvec{K} \varvec{R} \Big ) \underbrace{ \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }_{ = \varvec{b} } = \varvec{b} - \varvec{K} \varvec{R} \varvec{b} . \end{aligned}$$

Finally, the partial derivative of the approximate log marginal likelihood with respect to the kernel hyperparameters is given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j }&= \frac{1}{2} \varvec{a}^\top \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{a} - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{R} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \\&\quad + \Bigg ( - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg )^\top \Big ( \varvec{b} - \varvec{K} \varvec{R} \varvec{b} \Big ). \end{aligned}$$

Partial derivatives with respect to the likelihood hyperparameters

We now consider the partial derivatives with respect to the likelihood hyperparameters. Like in Eq. (28), it splits up into an explicit and implicit term. For the explicit part, we utilize the factorization of the likelihood $p( \varvec{y} | \hat{\varvec{f}} )$ to obtain

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \ln p( \varvec{y} | \hat{\varvec{f}} ) = \frac{ \partial }{ \partial \theta _j } \ln \prod _{i=1}^n p( y_i | \hat{f}_i ) = \frac{ \partial }{ \partial \theta _j } \sum _{i=1}^n \ln p( y_i | \hat{f}_i ) = \sum _{i=1}^n \frac{ \partial }{ \partial \theta _j } \ln p( y_i | \hat{f}_i ). \end{aligned}$$

The second term in the explicit part is equal to zero since no variable directly depends on a likelihood hyperparameter,

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} \Bigg )&= 0. \end{aligned}$$

For the third term, we have

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \ln | \varvec{B} | \Bigg )&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \varvec{B} }{ \partial \theta _j } \Bigg ) \overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \frac{ \partial }{ \partial \theta _j } \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big ) \Bigg ) \\&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} ( \varvec{K}^{-1} + \varvec{W} ) \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \underbrace{ \varvec{K}^{-1} + \varvec{W} \Big )^{-1} }_{ = \varvec{K} - \varvec{J}^\top \varvec{J} } \underbrace{ \varvec{K}^{-1} \varvec{K} }_{ = \varvec{I} } \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&\overset{(27)}{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} - \varvec{J}^\top \varvec{J} \Big ) \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&= - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \theta _j }, \end{aligned}$$

which yields the final expression for the explicit part, given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j } \Bigg |_{\text {explicit}}&= \sum _{i=1}^n \frac{ \partial }{ \partial \theta _j } \ln p( y_i | \hat{f}_i ) - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \theta _j } . \end{aligned}$$

The implicit part is rather similar to the other implicit part but with marginal modifications.

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } + \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } \frac{ \hat{\varvec{f}} }{ \partial \theta _j } \\&= \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } + \varvec{K} \underbrace{ \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } }_{ = -\varvec{W} } \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } \\&= \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } - \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }. \end{aligned}$$

This is equivalent to

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } + \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} ) \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } = \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } \\ \iff \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} )^{-1} \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } \\&= \Big ( \varvec{I} - \varvec{K} \underbrace{ ( \varvec{W}^{-1} + \varvec{K} )^{-1} }_{ = \varvec{R} } \Big ) \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } \\&= \Big ( \varvec{I} - \varvec{K} \varvec{R} \Big ) \underbrace{ \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } }_{ = \varvec{d} } = \varvec{d} - \varvec{K} \varvec{R} \varvec{d}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mair, S., Brefeld, U. Distributed robust Gaussian Process regression. Knowl Inf Syst 55, 415–435 (2018). https://doi.org/10.1007/s10115-017-1084-7

Download citation

Received: 17 September 2016
Revised: 29 May 2017
Accepted: 30 June 2017
Published: 19 July 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10115-017-1084-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed robust Gaussian Process regression

Abstract

Access this article

Similar content being viewed by others

Robust weighted Gaussian processes

Sparse Information Filter for Fast Gaussian Process Regression

The extended skew Gaussian process for regression

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Partial derivatives with respect to the kernel hyperparameters

Partial derivatives with respect to the likelihood hyperparameters

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed robust Gaussian Process regression

Abstract

Access this article

Similar content being viewed by others

Robust weighted Gaussian processes

Sparse Information Filter for Fast Gaussian Process Regression

The extended skew Gaussian process for regression

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Partial derivatives with respect to the kernel hyperparameters

Partial derivatives with respect to the likelihood hyperparameters

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation