Abstract
We study distributed and robust Gaussian Processes where robustness is introduced by a Gaussian Process prior on the function values combined with a Student-t likelihood. The posterior distribution is approximated by a Laplace Approximation, and together with concepts from Bayesian Committee Machines, we efficiently distribute the computations and render robust GPs on huge data sets feasible. We provide a detailed derivation and report on empirical results. Our findings on real and artificial data show that our approach outperforms existing baselines in the presence of outliers by using all available data.
Similar content being viewed by others
References
Cao Y, Fleet DJ (2014) Generalized product of experts for automatic and principled fusion of Gaussian process predictions. In: Modern nonparametrics 3: automating the learning pipeline workshop at NIPS
Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci Model Dev 7(3):1247–1250
Chen J, Cao N, Low KH, Ouyang R, Tan CKY, Jaillet P (2013) Parallel Gaussian process regression with low-rank covariance matrix approximations. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, AUAI Press, pp 152–161
Deisenroth MP, Ng JW (2015) Distributed Gaussian processes. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1481–1490
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 1:1–38
Gal Y, van der Wilk M, Rasmussen C (2014) Distributed variational inference in sparse Gaussian process regression and latent variable models. In: Advances in neural information processing systems, pp 3257–3265
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. CRC Press, Boca Raton
Geweke J (1993) Bayesian treatment of the independent Student-t linear model. J Appl Econom 8(S1):S19–S40
Harrison D, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manage 5(1):81–102
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, AUAI Press, pp 282–290
Jaynes E, Bretthorst G (2003) Probability theory: the logic of science. Cambridge university press, Cambridge
Jylänki P, Vanhatalo J, Vehtari A (2011) Robust Gaussian process regression with a Student-t likelihood. J Mach Learn Res 12:3227–3257
Kuss M (2006) Gaussian process models for robust regression, classification, and reinforcement learning. PhD thesis, Technische Universität Darmstadt
Naish-Guzman A, Holden S (2008) Robust regression with twinned Gaussian processes. In: Advances in neural information processing systems, pp 1065–1072
Neal R (1997) Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical report, Department of Statistics, University of Toronto
Nickisch H, Rasmussen CE (2008) Approximations for binary Gaussian process classification. J Mach Learn Res 9(10):2035–2078
O’Hagan A (1979) On outlier rejection phenomena in Bayes inference. J R Stat Soc Ser B (Methodol) 41:358–367
Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
Rasmussen C, Williams C (2006) Gaussian processes for machine learning. Adaptive computation and machine learning, MIT Press, Cambridge. http://mitpress.mit.edu/026218253X
Tipping ME, Lawrence ND (2005) Variational inference for Student-t models: robust Bayesian interpolation and generalised component analysis. Neurocomputing 69(1):123–141
Titsias MK (2009) Variational learning of inducing variables in sparse Gaussian processes. In: International conference on artificial intelligence and statistics, pp 567–574
Tresp V (2000) A Bayesian committee machine. Neural Comput 12(11):2719–2741
Vanhatalo J, Jylänki P, Vehtari A (2009) Gaussian process regression with Student-t likelihood. In: Advances in neural information processing systems, pp 1910–1918
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30(1):79
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
Before providing the derivation of the partial derivatives of the approximate log marginal likelihood in Eq. (21), we introduce the matrix \(\varvec{R}\), which will be convenient later on.
Using the matrix \(\varvec{R}\) as well as the matrix inversion lemma allows to reformulate the inverse of \(\varvec{K}^{-1} + \varvec{W}\) as a sum of the kernel matrix \(\varvec{K}\) and a new matrix \(\varvec{J}\),
Recall that there are kernel as well as the likelihood hyperparameters. We focus on a squared exponential kernel with automatic relevance detection parametrized by the signal noise \(\sigma _f\) and the length scales \(\ell _i\) for all \(i=1,2,\ldots ,d\) dimensions. The likelihood is parametrized by the scale \(\sigma _t\) and the degree of freedom \(\nu \).
Partial derivatives with respect to the kernel hyperparameters
The partial derivatives with respect to the kernel hyperparameters are given by
which consists of an explicit and an implicit term. The implicit term is caused by the dependence of \(\hat{\varvec{f}}\) and \(\varvec{W}\) on \(\varvec{K}\) and therefore depends on the hyperparameters. The first part of the explicit term
is equal to zero. For the second term we use the intermediate result \(\varvec{a}\) from Eq. (14) to obtain
and for the third term we get
by using the definitions of the matrices \(\varvec{B}\) and \(\varvec{R}\) and the fact that circular rotation of matrix products does not change the trace of the product. Therefore, the explicit part of the partial derivative is given by
Now we take care of the implicit part of the partial derivative. The derivation of the first two parts is equivalent to the derivation of \(\varPsi ( \hat{\varvec{f}} )\), which is for \(\hat{\varvec{f}}\) equal to zero,
The third term of the partial derivative is the derivation of the log determinant of \(\varvec{B}\). Using the definition of the matrices \(\varvec{B}\) and \(\varvec{J}\) yields
We still need to take care of \(\frac{ \partial \hat{\varvec{f}} }{ \partial \theta _j }\). By using the definition of \(\hat{\varvec{f}}\) from Eq. (11) as well as the multidimensional chain rule, we obtain
which is equivalent to
Finally, the partial derivative of the approximate log marginal likelihood with respect to the kernel hyperparameters is given by
Partial derivatives with respect to the likelihood hyperparameters
We now consider the partial derivatives with respect to the likelihood hyperparameters. Like in Eq. (28), it splits up into an explicit and implicit term. For the explicit part, we utilize the factorization of the likelihood \(p( \varvec{y} | \hat{\varvec{f}} )\) to obtain
The second term in the explicit part is equal to zero since no variable directly depends on a likelihood hyperparameter,
For the third term, we have
which yields the final expression for the explicit part, given by
The implicit part is rather similar to the other implicit part but with marginal modifications.
This is equivalent to
Rights and permissions
About this article
Cite this article
Mair, S., Brefeld, U. Distributed robust Gaussian Process regression. Knowl Inf Syst 55, 415–435 (2018). https://doi.org/10.1007/s10115-017-1084-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1084-7