Skip to main content
Log in

Distributed robust Gaussian Process regression

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We study distributed and robust Gaussian Processes where robustness is introduced by a Gaussian Process prior on the function values combined with a Student-t likelihood. The posterior distribution is approximated by a Laplace Approximation, and together with concepts from Bayesian Committee Machines, we efficiently distribute the computations and render robust GPs on huge data sets feasible. We provide a detailed derivation and report on empirical results. Our findings on real and artificial data show that our approach outperforms existing baselines in the presence of outliers by using all available data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Cao Y, Fleet DJ (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In: Modern nonparametrics 3: automating the learning pipeline workshop at NIPS

  2. Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci Model Dev 7(3):1247–1250

    Article  Google Scholar 

  3. Chen J, Cao N, Low KH, Ouyang R, Tan CKY, Jaillet P (2013) Parallel Gaussian process regression with low-rank covariance matrix approximations. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, AUAI Press, pp 152–161

  4. Deisenroth MP, Ng JW (2015) Distributed Gaussian processes. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1481–1490

  5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 1:1–38

    MathSciNet  MATH  Google Scholar 

  6. Gal Y, van der Wilk M, Rasmussen C (2014) Distributed variational inference in sparse Gaussian process regression and latent variable models. In: Advances in neural information processing systems, pp 3257–3265

  7. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. CRC Press, Boca Raton

    MATH  Google Scholar 

  8. Geweke J (1993) Bayesian treatment of the independent Student-t linear model. J Appl Econom 8(S1):S19–S40

    Article  Google Scholar 

  9. Harrison D, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manage 5(1):81–102

    Article  MATH  Google Scholar 

  10. Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, AUAI Press, pp 282–290

  11. Jaynes E, Bretthorst G (2003) Probability theory: the logic of science. Cambridge university press, Cambridge

    Book  Google Scholar 

  12. Jylänki P, Vanhatalo J, Vehtari A (2011) Robust Gaussian process regression with a Student-t likelihood. J Mach Learn Res 12:3227–3257

    MathSciNet  MATH  Google Scholar 

  13. Kuss M (2006) Gaussian process models for robust regression, classification, and reinforcement learning. PhD thesis, Technische Universität Darmstadt

  14. Naish-Guzman A, Holden S (2008) Robust regression with twinned Gaussian processes. In: Advances in neural information processing systems, pp 1065–1072

  15. Neal R (1997) Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical report, Department of Statistics, University of Toronto

  16. Nickisch H, Rasmussen CE (2008) Approximations for binary Gaussian process classification. J Mach Learn Res 9(10):2035–2078

    MathSciNet  MATH  Google Scholar 

  17. O’Hagan A (1979) On outlier rejection phenomena in Bayes inference. J R Stat Soc Ser B (Methodol) 41:358–367

    MathSciNet  MATH  Google Scholar 

  18. Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959

    MathSciNet  MATH  Google Scholar 

  19. Rasmussen C, Williams C (2006) Gaussian processes for machine learning. Adaptive computation and machine learning, MIT Press, Cambridge. http://mitpress.mit.edu/026218253X

  20. Tipping ME, Lawrence ND (2005) Variational inference for Student-t models: robust Bayesian interpolation and generalised component analysis. Neurocomputing 69(1):123–141

    Article  Google Scholar 

  21. Titsias MK (2009) Variational learning of inducing variables in sparse Gaussian processes. In: International conference on artificial intelligence and statistics, pp 567–574

  22. Tresp V (2000) A Bayesian committee machine. Neural Comput 12(11):2719–2741

    Article  Google Scholar 

  23. Vanhatalo J, Jylänki P, Vehtari A (2009) Gaussian process regression with Student-t likelihood. In: Advances in neural information processing systems, pp 1910–1918

  24. Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30(1):79

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastian Mair.

Appendices

Appendix

Before providing the derivation of the partial derivatives of the approximate log marginal likelihood in Eq. (21), we introduce the matrix \(\varvec{R}\), which will be convenient later on.

$$\begin{aligned} \varvec{R}&= ( \varvec{W}^{-1} + \varvec{K} )^{-1} \overset{ (25) }{ = } \varvec{W}^{\frac{1}{2}} ( \underbrace{ \varvec{I} + \varvec{W}^{\frac{1}{2}} \varvec{K} \varvec{W}^{\frac{1}{2}} }_{ = \varvec{B} = \varvec{L} \varvec{L}^\top } )^{-1} \varvec{W}^{\frac{1}{2}} = \varvec{W}^{\frac{1}{2}} \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}}. \end{aligned}$$
(26)

Using the matrix \(\varvec{R}\) as well as the matrix inversion lemma allows to reformulate the inverse of \(\varvec{K}^{-1} + \varvec{W}\) as a sum of the kernel matrix \(\varvec{K}\) and a new matrix \(\varvec{J}\),

$$\begin{aligned} \Big ( \varvec{K}^{-1} + \varvec{W} \Big )^{-1}&= \varvec{K} - \varvec{K} \underbrace{ \Big ( \varvec{K} + \varvec{W}^{-1} \Big )^{-1} }_{ = \varvec{R} } \varvec{K} \overset{ (26) }{ = } \varvec{K} - \varvec{K} \varvec{R} \varvec{K} \overset{ (26) }{ = } \varvec{K} - \varvec{K} \varvec{W}^{\frac{1}{2}} \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}} \varvec{K} \nonumber \\&= \varvec{K} - \underbrace{ \varvec{K} \varvec{W}^{\frac{1}{2}} ( \varvec{L}^{-1} )^\top }_{ = \varvec{J}^\top } \underbrace{ \varvec{L}^{-1} \varvec{W}^{\frac{1}{2}} \varvec{K} }_{ := \varvec{J} } \overset{ }{ = } \varvec{K} - \varvec{J}^\top \varvec{J}. \end{aligned}$$
(27)

Recall that there are kernel as well as the likelihood hyperparameters. We focus on a squared exponential kernel with automatic relevance detection parametrized by the signal noise \(\sigma _f\) and the length scales \(\ell _i\) for all \(i=1,2,\ldots ,d\) dimensions. The likelihood is parametrized by the scale \(\sigma _t\) and the degree of freedom \(\nu \).

Partial derivatives with respect to the kernel hyperparameters

The partial derivatives with respect to the kernel hyperparameters are given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j }&\overset{ (21) }{ = } \frac{ \partial }{ \partial \theta _j } \Bigg ( \ln p( \varvec{y} | \hat{\varvec{f}} ) - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} - \frac{1}{2} \ln | \varvec{B} | \Bigg ) \nonumber \\&\,\,\,\overset{ }{ = } \underbrace{ \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j } }_{ \text {explicit} } \underbrace{ + \sum _{i=1}^n \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \hat{f}_i } \frac{ \partial \hat{f}_i }{ \partial \theta _j }, }_{ \text {implicit} } \end{aligned}$$
(28)

which consists of an explicit and an implicit term. The implicit term is caused by the dependence of \(\hat{\varvec{f}}\) and \(\varvec{W}\) on \(\varvec{K}\) and therefore depends on the hyperparameters. The first part of the explicit term

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \ln p( \varvec{y} | \hat{\varvec{f}} )&= 0 \end{aligned}$$

is equal to zero. For the second term we use the intermediate result \(\varvec{a}\) from Eq. (14) to obtain

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} \Bigg )&= \frac{1}{2} \underbrace{ \hat{\varvec{f}}^\top \varvec{K}^{-1} }_{ = \varvec{a}^\top } \frac{ \partial \varvec{K} }{ \partial \theta _j } \underbrace{ \varvec{K}^{-1} \hat{\varvec{f}} }_{ = \varvec{a} } \overset{ (14) }{ = } \frac{1}{2} \varvec{a}^\top \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{a}, \end{aligned}$$

and for the third term we get

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \ln | \varvec{B} | \Bigg )&\overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \varvec{B} }{ \partial \theta _j } \Bigg ) \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \partial }{ \partial \theta _j } \Big ( \varvec{I} + \varvec{W}^{\frac{1}{2}} \varvec{K} \varvec{W}^{\frac{1}{2}} \Big ) \Bigg ) \\&\overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}} \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{W}^{\frac{1}{2}} \Bigg ) = - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{W}^{\frac{1}{2}} \varvec{B}^{-1} \varvec{W}^{\frac{1}{2}} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \\&\overset{ (26) }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( ( \varvec{W}^{-1} + \varvec{K} )^{-1} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \overset{ (26) }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{R} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \end{aligned}$$

by using the definitions of the matrices \(\varvec{B}\) and \(\varvec{R}\) and the fact that circular rotation of matrix products does not change the trace of the product. Therefore, the explicit part of the partial derivative is given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j } \Bigg |_{\text {explicit}}&= \frac{1}{2} \varvec{a}^\top \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{a} - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{R} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ). \end{aligned}$$

Now we take care of the implicit part of the partial derivative. The derivation of the first two parts is equivalent to the derivation of \(\varPsi ( \hat{\varvec{f}} )\), which is for \(\hat{\varvec{f}}\) equal to zero,

$$\begin{aligned} \frac{ \partial }{ \partial \hat{\varvec{f}} } \Bigg ( \ln p( \varvec{y} | \hat{\varvec{f}} ) - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} \Bigg )&\equiv \frac{ \partial }{ \partial \hat{\varvec{f}} } \varPsi ( \hat{\varvec{f}} ) = 0. \end{aligned}$$

The third term of the partial derivative is the derivation of the log determinant of \(\varvec{B}\). Using the definition of the matrices \(\varvec{B}\) and \(\varvec{J}\) yields

$$\begin{aligned}&\frac{ \partial }{ \partial \hat{f}_i } \Bigg ( - \frac{1}{2} \ln | \varvec{B} | \Bigg )\\&\quad \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \varvec{B} }{ \partial \hat{f}_i } \Bigg ) \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \frac{ \partial }{ \partial \hat{f}_i } \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big ) \Bigg ) \\&\quad \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) = - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} ( \varvec{K}^{-1} + \varvec{W} ) \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) \\&\quad \overset{ }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K}^{-1} + \varvec{W} \Big )^{-1} \underbrace{ \varvec{K}^{-1} \varvec{K} }_{ = \varvec{I} } \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) = - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K}^{-1} + \varvec{W} \Big )^{-1} \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) \\&\,\,\, \overset{ (27) }{ = } - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} - \varvec{J}^\top \varvec{J} \Big ) \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg ) = - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } . \end{aligned}$$

We still need to take care of \(\frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }\). By using the definition of \(\hat{\varvec{f}}\) from Eq. (11) as well as the multidimensional chain rule, we obtain

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } + \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } \frac{ \hat{\varvec{f}} }{ \partial \theta _j } \\&= \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) + \varvec{K} \underbrace{ \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } }_{ = -\varvec{W} } \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } \\&= \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) - \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } \end{aligned}$$

which is equivalent to

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } + \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} ) \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } = \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) \\ \iff \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} )^{-1} \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) \\&= \Big ( \varvec{I} - \varvec{K} \underbrace{ ( \varvec{W}^{-1} + \varvec{K} )^{-1} }_{ = \varvec{R} } \Big ) \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) \\&= \Big ( \varvec{I} - \varvec{K} \varvec{R} \Big ) \underbrace{ \frac{ \partial \varvec{K} }{ \partial \theta _j } \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }_{ = \varvec{b} } = \varvec{b} - \varvec{K} \varvec{R} \varvec{b} . \end{aligned}$$

Finally, the partial derivative of the approximate log marginal likelihood with respect to the kernel hyperparameters is given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j }&= \frac{1}{2} \varvec{a}^\top \frac{ \partial \varvec{K} }{ \partial \theta _j } \varvec{a} - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{R} \frac{ \partial \varvec{K} }{ \partial \theta _j } \Bigg ) \\&\quad + \Bigg ( - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \hat{f}_i } \Bigg )^\top \Big ( \varvec{b} - \varvec{K} \varvec{R} \varvec{b} \Big ). \end{aligned}$$

Partial derivatives with respect to the likelihood hyperparameters

We now consider the partial derivatives with respect to the likelihood hyperparameters. Like in Eq. (28), it splits up into an explicit and implicit term. For the explicit part, we utilize the factorization of the likelihood \(p( \varvec{y} | \hat{\varvec{f}} )\) to obtain

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \ln p( \varvec{y} | \hat{\varvec{f}} ) = \frac{ \partial }{ \partial \theta _j } \ln \prod _{i=1}^n p( y_i | \hat{f}_i ) = \frac{ \partial }{ \partial \theta _j } \sum _{i=1}^n \ln p( y_i | \hat{f}_i ) = \sum _{i=1}^n \frac{ \partial }{ \partial \theta _j } \ln p( y_i | \hat{f}_i ). \end{aligned}$$

The second term in the explicit part is equal to zero since no variable directly depends on a likelihood hyperparameter,

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \hat{\varvec{f}}^\top \varvec{K}^{-1} \hat{\varvec{f}} \Bigg )&= 0. \end{aligned}$$

For the third term, we have

$$\begin{aligned} \frac{ \partial }{ \partial \theta _j } \Bigg ( - \frac{1}{2} \ln | \varvec{B} | \Bigg )&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \varvec{B}^{-1} \frac{ \varvec{B} }{ \partial \theta _j } \Bigg ) \overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \frac{ \partial }{ \partial \theta _j } \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big ) \Bigg ) \\&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{I} + \varvec{K} \varvec{W} \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} ( \varvec{K}^{-1} + \varvec{W} ) \Big )^{-1} \varvec{K} \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&\overset{ }{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \underbrace{ \varvec{K}^{-1} + \varvec{W} \Big )^{-1} }_{ = \varvec{K} - \varvec{J}^\top \varvec{J} } \underbrace{ \varvec{K}^{-1} \varvec{K} }_{ = \varvec{I} } \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&\overset{(27)}{=} - \frac{1}{2} {\text {tr}}\Bigg ( \Big ( \varvec{K} - \varvec{J}^\top \varvec{J} \Big ) \frac{ \partial \varvec{W} }{ \partial \theta _j } \Big ) \Bigg ) \\&= - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \theta _j }, \end{aligned}$$

which yields the final expression for the explicit part, given by

$$\begin{aligned} \frac{ \partial \ln q( \varvec{y} | \varvec{X} ) }{ \partial \theta _j } \Bigg |_{\text {explicit}}&= \sum _{i=1}^n \frac{ \partial }{ \partial \theta _j } \ln p( y_i | \hat{f}_i ) - \frac{1}{2} {\text {diag}}\Big ( {\text {diag}}( \varvec{K} ) - {\text {diag}}( \varvec{J}^\top \varvec{J} ) \Big ) \cdot \frac{ \partial \varvec{W} }{ \partial \theta _j } . \end{aligned}$$

The implicit part is rather similar to the other implicit part but with marginal modifications.

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } + \frac{ \partial \varvec{K} \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } \frac{ \hat{\varvec{f}} }{ \partial \theta _j } \\&= \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } + \varvec{K} \underbrace{ \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \hat{\varvec{f}} } }_{ = -\varvec{W} } \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } \\&= \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } - \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }. \end{aligned}$$

This is equivalent to

$$\begin{aligned} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } + \varvec{K} \varvec{W} \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} ) \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j } = \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } \\ \iff \frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }&= ( \varvec{I} + \varvec{K} \varvec{W} )^{-1} \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } \\&= \Big ( \varvec{I} - \varvec{K} \underbrace{ ( \varvec{W}^{-1} + \varvec{K} )^{-1} }_{ = \varvec{R} } \Big ) \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } \\&= \Big ( \varvec{I} - \varvec{K} \varvec{R} \Big ) \underbrace{ \varvec{K} \frac{ \partial \nabla \ln p( \varvec{y} | \hat{\varvec{f}} ) }{ \partial \theta _j } }_{ = \varvec{d} } = \varvec{d} - \varvec{K} \varvec{R} \varvec{d}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mair, S., Brefeld, U. Distributed robust Gaussian Process regression. Knowl Inf Syst 55, 415–435 (2018). https://doi.org/10.1007/s10115-017-1084-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1084-7

Keywords

Navigation