A joint learning framework for Gaussian processes regression and graph learning
Introduction
Regression models based on Gaussian processes (GPs) are a powerful tool in various applications [1], [2]. Their objective is to reconstruct the underlying signals or functions mapping inputs to outputs. Gaussian process regression (GPR) relies on a general assumption that similar inputs likely lead to similar target values, which follow a joint Gaussian distribution. The similarity of target values is described in GP by covariance matrix, which is further dependent on a chosen kernel function. A wide range of kernel functions, such as squared exponential (SE) kernel, rational quadratic (RQ) kernel, periodic kernels (PE) [3] and spectral mixture (SM) kernel [4], have been deployed in GP models. Among them, the SE kernel is the most popular one, which describes the relationship of two target values using the Euclidean distance of their corresponding inputs. In practice, the prediction of target values can be conducted by the means of Bayesian inference.
The performance and complexity of a GPR model is generally dominated by both the chosen kernel function and the volume of the data. Nowadays, the optimization of hyper-parameters of kernel functions is still a challenging task, that involves the computation of the inverse of covariance matrix. Its computational cost is generally high, especially when dealing with some complicated kernel functions [1]. To remedy this issue, inducing point methods reduce the effective number of input data from to and use such inducing points to construct an approximate covariance matrix. Since the rank of the effective covariance matrix is smaller than the original one, inducing point methods can deal with a large volume of data. Typical examples of this class include subset of regressors (SoR) [5], fully independent training conditional (FITC) [6], partially independent training conditional (PITC) [1], and structured kernel interpolation (SKI) [7]. The motivation of SoR is to replace covariance matrix of original inputs by a low-rank counterpart composed of covariance matrix of inducing points and the one between inducing points and training data. It can also be viewed as approximating original inputs by a linear transformation of inducing points. Based on SoR, FITC and PITC have been further developed based on different conditional independence assumptions regarding inducing points. Compared to SoR, their approximate covariance matrix is closer to the original one, but requiring no extra computational burden. SKI is one of scalable approaches, also based on SoR. It places inducing points on a dense grid and their covariance matrix is of certain forms, e.g., Kronecker product [8] or diagonal-constant (Toeplitz) [9]. Furthermore, it relaxes the restriction that input points must be on a grid. The structured matrix algebra [10] has been employed to relieve computational cost of covariance matrix between induced points and inputs, making it applicable to large-scale datasets.
In the traditional GPR, hyper-parameters of kernel functions are often learn-ed by maximum likelihood estimate (MLE). But the resulting model is generally a nonconvex optimization problem even for some simple kernel functions, e.g., SE kernel. Unsuitable initial values of hyper-parameters could lead to local optimum far away from global solutions, damaging the prediction accuracy of the resulting GPR model. Another more important issue regarding the traditional GPR is that kernel functions are generally dominated by pairwise distance or correlation between sample inputs. High-order statistical properties or global topology of the whole set of inputs are not fully exploited in the current framework, which essentially undermines its modeling capability. Recently, some researchers have introduced more information a prior to guide the estimation of covariance matrix. For instance, approximate precision matrix of target values is jointly learned along with covariance matrix in Miao et al. [11]. Since the problem of estimating precision matrix is convex, their optimal solutions can be reliably achieved and used in the training of covariance matrix.
Graphs play an important role in many tasks of signal processing and machine learning, since they are able to describe both regular and irregular data. A graph is composed of vertices (or nodes) and edges. Vertices represent various entities, while edges denote not only concrete but also abstract relationships among vertices. Given sample data, various algorithms have been developed to estimate their topological structure [12]. For instance, a framework has been proposed in Egilmez et al. [13] to estimate graph Laplacians from observed data under structural constraints. It has been shown that graph Laplacian can be treated as precision matrix in maximum a posteriori parameter estimation of Gaussian–Markov random field models. This provides a potential to jointly learn the Laplacian matrix of a graph and covariance matrix of a GP. However, computational complexity of graph learning approaches developed in Egilmez et al. [13] could be high, since they can only handle a Laplacian matrix as a whole.
Topological structure of the underlying graph learned from observed data or specified by knowledge a prior can also be exploited to improve the performance of the traditional GPR. In [14], a graph Gaussian process (GGP) model has been developed for classification tasks. Outputs of a GP corresponding to different inputs are locally averaged in their neighborhood of a given rational graph to obtain latent variables, which are then employed to finally predict the class of inputs. In [15], [16], authors considered the scenario, where each sample input corresponds to a vector output, that are assumed to lie on a graph. However, in many regression tasks, only scalar target value is required to be predicted. In this situation, topological structure of the underlying graph learned from scalar target values becomes unreliable. On the other hand, since each sample input in the traditional GPR are generally multi-dimensional, topological structure of sample inputs is more informative.
In this paper, we propose a novel GPR model, where multi-dimensional sample inputs are viewed as signals generated over a weighted graph. A joint learning framework is developed to simultaneously estimate covariance matrix of target values and the underlying graph of sample inputs. In this way, more topological information of inputs could be exploited to estimate covariance matrix of target values and thus improve prediction accuracies of the GP model obtained. In addition, we also develop novel numerical approaches for the proposed models, which provide more reliable and efficient graph estimation compared to the state-of-the-arts. The paper is organized as follows. In Section 2, we first review the traditional GPR and some related work. The fundamentals of graphs and the proposed joint learning framework are introduced first in Section 3. Alternating optimization algorithms are also developed to tackle the resulting problems. Experimental results obtained from three sets of real data are presented in Section 4. Finally, Section 5 concludes the paper.
Section snippets
Traditional GPR
Let denotes training inputs and consists of target values, each associated with . For a regression problem, each is modeled as the output of an unknown function corrupted by additive noise, that is,where is supposed to follow the zero-mean isotropic Gaussian distribution, i.e., . Then, the probability density function (PDF) of is expressed aswhere denotes covariance matrix of target values
Joint learning framework
Modern data analysis and processing typically involves a large volume of structured data, where the structure carries critical information about the essence of data [17]. Graphs offer a useful way to describe relationships in complex datasets. They have become a powerful mathematical model and a practical tool of modern data analysis [18], [19], [20], [21]. A weighted graph consists of a finite vertex set and an edge subset [22]. Every edge is often associated with a
Experimental setup
Three real-world datasets are used to evaluate the prediction performance of the proposed GPR algorithms. The normalized mean square error (NMSE) defined below is adopted to measure prediction accuracy [3]In our experiments, three representative kernel functions are employed in (3):
- a)
Isotropic SE kernel (SEiso in short) or radial basis function (RBF) kernel:Hyper-parameters include and in the above kernel function, and standard
Conclusions
In this paper, we have proposed a novel joint learning framework, which combines the GPR with graph learning, such that topological information can be effectively used to improve prediction accuracies of GP models obtained. Two strategies have been developed for constructing graphs of sample inputs. The resulting problems can be tackled by the alternating optimization scheme. Theoretical analyses regarding optimal solutions to the underlying graph learning problems have also been presented to
CRediT authorship contribution statement
Xiaoyu Miao: Methodology. Aimin Jiang: Formal analysis. Yanping Zhu: Software. Hon Keung Kwan: Writing – original draft.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported in part by the National Key Research and Development Program 2018AAA0100800, the National Nature Science Foundation of China under grant 61801055.
References (31)
- et al.
Estimating the Laplacian matrix of Gaussian mixtures for signal processing on graphs
Signal Process.
(2018) - et al.
A unifying view of sparse approximate Gaussianprocess regression
J. Mach. Learn. Res.
(2005) - et al.
Gaussian Processes for Machine Learning
(2006) - et al.
Gaussian process kernels for pattern discovery and extrapolation
International Conference on Machine Learning
(2013) Some aspects of the spline smoothing approach to non-parametric regression curve fitting
J. R. Stat. Soc.
(1985)- et al.
Sparse Gaussian processes using pseudo-inputs
Adv. Neural Inf. Process. Syst.
(2006) - et al.
Kernel interpolation for scalable structured Gaussianprocesses (KISS-GP)
International Conference on Machine Learning
(2015) Scalable Inference for Structured Gaussian Process Models
(2012)- et al.
Fast kernel learning for multidimensional pattern extrapolation
NIPS
(2014) - et al.
Faster kernel interpolation for Gaussian processes
International Conference on Artificial Intelligence and Statistics
(2021)
Gaussian processes regression with joint learning of precision matrix
2020 28th European Signal Processing Conference (EUSIPCO)
Learning graphs from data: a signal representation perspective
IEEE Signal Process. Mag.
Graph learning from data under Laplacian and structural constraints
IEEE J. Sel. Top. Signal Process.
Predicting graph signals using kernel regression where the input signal is agnostic to a graph
IEEE Trans. Signal Inf. Process. Netw.
Cited by (3)
Kernel Regression for Matrix-Variate Gaussian Distributed Signals Over Sample Graphs
2022, IEEE Transactions on Signal and Information Processing over Networks