An Uncertainty-Aware, Mesh-Free Numerical Method for Kolmogorov PDEs

Inoue, Daisuke; Ito, Yuji; Kashiwabara, Takahito; Saito, Norikazu; Yoshida, Hiroaki

doi:10.1007/s10915-025-02846-9

An Uncertainty-Aware, Mesh-Free Numerical Method for Kolmogorov PDEs

Open access
Published: 24 March 2025

Volume 103, article number 41, (2025)
Cite this article

Download PDF

You have full access to this open access article

Journal of Scientific Computing Aims and scope Submit manuscript

An Uncertainty-Aware, Mesh-Free Numerical Method for Kolmogorov PDEs

Download PDF

404 Accesses
2 Altmetric
Explore all metrics

Abstract

This study introduces an uncertainty-aware, mesh-free numerical method for solving Kolmogorov PDEs. In the proposed method, we use Gaussian process regression (GPR) to smoothly interpolate pointwise solutions that are obtained by Monte Carlo methods based on the Feynman–Kac formula. The proposed method has two main advantages: 1. uncertainty assessment, which provides numerical information about the validity of the solution, and 2. mesh-free computation, which allows one to choose any low-dimensional bounded domain for reducing the computational cost. The quality of the solution is improved by adjusting the kernel function and incorporating noise information from the Monte Carlo samples into the GPR noise model. The performance of the method is rigorously analyzed based on a theoretical lower bound on the posterior variance, which serves as a measure of the error between the numerical and true solutions. Extensive tests on three representative PDEs demonstrate the high accuracy and robustness of the method compared to existing methods.

Sparse Information Filter for Fast Gaussian Process Regression

Depth-based weighted jackknife empirical likelihood for non-smooth U-structure equations

Article 03 July 2019

Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

Article Open access 14 December 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Kolmogorov partial differential equations (PDEs) have found wide application in physics [32], chemistry [46], finance [3], and the natural sciences [7]. Various numerical methods such as finite difference [4, 21, 22, 52] and finite element method [11, 53] have been proposed and widely adopted by practitioners. While these numerical methods are undoubtedly of high value in solving various engineering problems, it is desirable to overcome the following two challenges to address more complex problems that arise in real-world scenarios:

1.
Evaluating the numerical validity. When performing numerical computations, it is desirable to estimate the error between the numerical results and the true solution. Although existing numerical methods can address this through error analysis techniques, they typically require careful mathematical analysis, which is different for each PDE of interest. Ideally, one would obtain numerical information about the validity of the solution simultaneously with the computed solution.
2.
Constraining the solution domain. In mesh-based methods, one typically partitions the domain on which the PDE is defined into a mesh, and then solves a globally defined linear equation to obtain the solution. Since the computational complexity dramatically increases with the degrees of freedom in meshes, it becomes difficult to solve problems where the degrees of freedom inevitably become large, such as those defined on unbounded domains or those in high-dimensional settings. Under such circumstances, it is desirable to restrict the domain of interest to a low-dimensional bounded domain so that the solution can be obtained efficiently.

The first challenge has long been the focus of error analysis in conventional numerical methods, such as finite difference and finite element methods [43, 47]. In particular, in the finite element method, a priori or a posteriori error estimation plays a crucial role, which is performed before and after the numerical solution is obtained, respectively [5, 23]. A priori estimation primarily involves truncation error analysis [6], while a posteriori estimation often involves flux recovery methods, residual methods, and duality-based approaches [44]. In general, tight error analysis requires careful mathematical evaluation, which depends on the form of the PDEs. It is therefore desirable to evaluate the error numerically, regardless of the form of the PDEs. Furthermore, the reliance on meshes in these conventional methods poses computational challenges when the degrees of freedom in meshes become large, such as in problems defined on unbounded domains or in problems with high-dimensional spaces.

For the second challenge, Monte Carlo-based mesh-free calculations are actively investigated. In particular, methods based on the Feynman–Kac (FK) formula have attracted attention as a promising strategy for constraining the solution domain [8, 9, 16, 17, 26, 34, 36]. This method selects a point in space-time to be solved, calculates multiple stochastic differential equations (SDEs) starting from the point, and approximates the solution of the PDE at the selected point as an ensemble average of the SDE solutions. Here, the FK formula provides a solution at a single point in space-time, requiring regressions from multiple sampled points to derive a solution over the domain of interest. Regression via linear interpolation and neural networks [2, 9, 38, 45] has been proposed. Nonetheless, these approaches give only estimates of the PDE solutions and cannot guarantee the accuracy of the estimated results.

This study proposes a new numerical method to solve the above two challenges simultaneously. In the proposed method, Gaussian Process Regression (GPR) is used to obtain solutions to PDEs over a domain of interest while evaluating their uncertainty. Specifically, GPR smoothly interpolates solutions at discrete points obtained by a Monte Carlo method based on the FK formula.

Here, GPR is compatible with Monte Carlo methods from two perspectives that are unattainable with other regression techniques. Firstly, GPR can incorporate a priori knowledge about the regularity of PDE solutions. In particular, designing a kernel function allows for determining the differentiability of the regression results. Secondly, using the variance information of noise contained in the Monte Carlo samples increases the accuracy of the regression. The variance is easily estimated by the sample variance of the Monte Carlo samples.

To address the first challenge, the proposed method can assess the uncertainty of the regression through the posterior variance in the GPR. Under appropriate assumptions, the posterior variance allows a posteriori probabilistic evaluation of the error between the numerical solution and the true solution. In addition, the use of GPR also provides a priori estimation of the posterior variance. We establish a theoretical lower bound on the posterior variance, which is crucial for estimating the best-case performance of numerical solutions. In standard numerical error analysis, the estimated error is given as an upper bound. In contrast, our lower bound is an optimistic estimate of the error, yet it helps to determine the minimum sample size required to obtain a desired posterior variance.

To address the second challenge, the proposed method can choose an arbitrary low-dimensional bounded domain on which to compute the solution. This is achieved by restricting the Monte Carlo sampling points to the low-dimensional bounded domain and performing the GPR only on these points.

To demonstrate its effectiveness, we perform numerical computations on 10-dimensional PDEs. Instead of covering the entire 10-dimensional domain, we focus on applying GPR to a bounded domain in 1-3 dimensions. Although the computational complexity of GPR scales cubically with the number of observation points in the FK samples, making full-dimensional solutions impractical, this strategy allows efficient computation in low-dimensional bounded domains. Numerical results show that our proposed method is more accurate and robust than an existing technique, as confirmed by our experiments.

The remainder of the paper is structured as follows: In Sect. 2, we define the class of the PDEs to be addressed. In Sect. 3, we propose a numerical method that combines the GPR and the FK formula. Theoretical results for the estimation of the posterior variance in the proposed method are presented in Sect. 4. In Sect. 5, we numerically apply the proposed method to three types of 10-dimensional PDEs and show that more accurate and less uncertain solutions are obtained than with existing methods. Section 6 provides a brief summary and suggests possible future work.

2 Problem Formulation

Throughout this paper, we consider the following Cauchy problem for the Kolmogorov PDEs:

$$\begin{aligned}&\begin{aligned} \partial _t v(t,x)&+\frac{1}{2} \operatorname {tr}\left\{ a(t, x) a(t, x)^{\top } \partial _{xx} v(t,x)\right\} + b(t, x)^\top \partial _x v(t, x) \\&\quad +c(t, x) v(t,x) + h(t, x)=0,\quad \quad \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$

(1)

$$\begin{aligned}&v(T,x)=g(x), \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$

(2)

where $T>0$ denotes the terminal time, $t\in [0,T]$ denotes the time, and $v:[0,T]\times {\mathbb {R}}^d\rightarrow {\mathbb {R}}$ denotes the variables of the PDE. The function $b:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^d$ denotes the advection coefficient, $a:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^{d \times m}$ denotes the diffusion coefficient, $c:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}$ denotes the reaction coefficient, $h: [0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}$ denotes the generative coefficient, and $g:\mathbb {R}^d \rightarrow \mathbb {R}$ denotes the terminal condition. We assume that each function satisfies the following conditions.

Assumption 1

The functions $b:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^d, a:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^{d \times m}$ $c:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}$, $h: [0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}$ are uniformly continuous. The function c is bounded. There exists a constant $L>0$ and the following are satisfied for $\varphi \in \{ b, a, c, h\}$:

$$\begin{aligned} \begin{aligned} |\varphi (t, x)-\varphi (t, x^{\prime })|&\le L|x-x^{\prime }|, \quad \forall t \in [0, T], x, x^{\prime } \in \mathbb {R}^d, \\ |\varphi (t, 0)|&\le L, \qquad \qquad \forall t \in [0, T]. \end{aligned} \end{aligned}$$

(3)

Proposition 1

(Ref. [51]) Suppose that Assumption 1 holds. Then, (1) and (2) have a unique viscosity solution.

Finding a solution to (1) and (2) in closed form is possible only in very limited cases and thus numerical calculations are usually required. The goal of this study is to find a numerical solution v of (1) and (2). More specifically, under the known functions a, b, c, h, and g, we aim to find the solution $v(0,x)\ (x\in {\mathcal {X}})$ at the initial time $t=0$, where the set ${\mathcal {X}}\subseteq {\mathbb {R}}^d$ is the domain of interest for which we want to know the solution. In addition, we also aim to assess the uncertainty of the solution, that is, how much the results of the numerical calculation deviate from the true solution.

Remark 1

The terminal value problem (1) and (2) is transformed into the initial value problem by introducing the variable transformation $w(t,x){:=}v(T-t,x)$:

$$\begin{aligned}&\begin{aligned} -\partial _s w(s,x)&+\frac{1}{2} \operatorname {tr}\left\{ a(s, x) a(s, x)^{\top } \partial _{xx} w(s, x)\right\} + b(s, x)^\top \partial _x w(s,x) \\&\quad +c(s, x) w(s, x) + h(s, x)=0, \quad \quad \ (s, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$

(4)

$$\begin{aligned}&w(0,x)=g(x), \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$

(5)

Thus, for initial value problems, this variable transformation should be performed before applying the proposed method described below.

3 Proposed Algorithm

In this section, we propose a method to find the solution $v(0,x)\ (x\in {\mathcal {X}})$ of the PDE (1) and (2) at the initial time $t=0$. We use GPR to compute the possibility that each candidate function is a solution v of the PDE. Using prior knowledge and data sets, the GPR provides such a possibility as a distribution over possible functions. The prior knowledge represents the possibility regarding the solution v as a probability distribution p(v). If the data set U associated with v is observed, the posterior probability p(v|U) is analytically obtained by Bayesian estimation. In this sense, the proposed data-driven method estimates the possibility p(v|U) that each candidate of v is the solution to the PDE instead of solving the PDE directly. In Sect. 3.1, we describe how to obtain the posterior distribution by using GPR. In Sect. 3.2, we describe how to obtain the data set that approximates the solution of the PDE by using the Feynman–Kac formula.

3.1 Gaussian Process Regression

We design the prior as a Gaussian process. For any $X=[x_1,\ldots ,x_N]^\top \in {\mathcal {X}}^N$, we assume that the probability distribution $p(v_X|X)$, where

$$\begin{aligned} v_{X}{:=}[v(0,x_1),\ldots ,v(0,x_N)]^\top \in {\mathbb {R}}^N, \end{aligned}$$

(6)

is a normal distribution:

$$\begin{aligned} p(v_X|X) = {\mathcal {N}}(0, k_{XX}). \end{aligned}$$

(7)

Here, the matrix $k_{XX}\in {\mathbb {R}}^{N\times N}$ is defined as $\left[ k_{XX}\right] _{i j}=k\left( x_i, x_j\right) $ with a positive-definite kernel function $k:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}$, where $[k_{XX}]_{ij}$ denotes the (i, j) component of $k_{XX}$. See Appendix for a detailed definition of the Gaussian process. The method for designing the kernel function k is discussed in Sect. 3.2.

We assume that the data set is given as follows. For N observation points $X=\left[ x_1, \ldots , x_{N}\right] ^\top \in \mathcal {X}^N$ and M sample size, we assume that the data

$$\begin{aligned} U=\left[ u_1(x_1),\ldots ,u_M(x_1), \ldots , u_1(x_{N}),\ldots , u_M(x_{N})\right] ^\top \in \mathbb {R}^{MN}, \end{aligned}$$

(8)

which approximates the solution of the PDE, are obtained. We assume that these data satisfy

$$\begin{aligned} u_j(x_i)= v(0,x_i) + \xi _{i,j}, \quad i\in \{1, \ldots , N\},\ j\in \{1, \ldots , M\}, \end{aligned}$$

(9)

where $\xi _{i,j}\ (i\in \{1, \ldots , N\},\ j\in \{1, \ldots , M\})$ are random variables that are independent and follow a normal distribution with mean 0 and variance $r(x_i)$:

$$\begin{aligned} \xi _{i,j} \sim \mathcal {N}\left( 0, r(x_i)\right) , \quad i\in \{1, \ldots , N\},\ j\in \{1, \ldots , M\}. \end{aligned}$$

(10)

The function $r:{\mathcal {X}}\rightarrow {\mathbb {R}}_{> 0}$ is a noise variance function. To reduce the effect of noise, we derive the posterior distribution using the sample mean ${\overline{u}}(x_i)$ for the data at the same observation point $x_i$:

$$\begin{aligned} {\overline{U}} = \left[ {\overline{u}}(x_1),\ldots ,{\overline{u}}(x_N)\right] ^\top \in {\mathbb {R}}^N, \end{aligned}$$

(11)

where $\overline{{u}} (x_i){:=}\sum _{j=1}^M {u}_j (x_i)/M$. Since the variance of the sample mean is the variance of each sample divided by M, (9) is rewritten as follows:

$$\begin{aligned} {\overline{u}}(x_i) = v(0,x_i) + {\overline{\xi }}_{i}, \quad i\in \{1, \ldots , N\}, \end{aligned}$$

(12)

where

$$\begin{aligned} {\overline{\xi }}_{i}\sim \mathcal {N}\left( 0, \frac{r(x_i)}{M}\right) , \quad i\in \{1, \ldots , N\}. \end{aligned}$$

(13)

In summary, the dataset is observed to satisfy

$$\begin{aligned} p({\overline{U}}|v_X,X) = {\mathcal {N}}\left( v_X, \frac{r_{XX}}{M}\right) , \end{aligned}$$

(14)

where $r_{X X}\in \mathbb {R}^{N \times N}$ is a noise variance matrix of the form $\left[ r_{X X}\right] _{i j}=\delta _{ij} r(x_i)$. The methods for obtaining the data set ${\overline{U}}$ and the noise variance matrix $r_{XX}$ are described in Sect. 3.2.

In GPR, the posterior distribution can be analytically obtained from the prior and the data set, as stated in the next proposition. Recalling that k is the kernel function representing the prior in (7), we define $k_{X x}=k_{x X}^{\top }=\left[ k\left( x_1, x\right) , \ldots , k\left( x_{N}, x\right) \right] ^{\top }$.

Proposition 2

(Ref. [14]) Suppose that the matrix $k_{X X}+r_{XX}/M$ is invertible. Suppose also that (7) and (14) holds. For any $x\in {\mathcal {X}}$, the posterior distribution is calculated as

$$\begin{aligned} p(v(0,x)| X,{\overline{U}},x) = {\mathcal {N}}(\widetilde{u}(x), \widetilde{k}(x,x)), \end{aligned}$$

(15)

where

$$\begin{aligned} \widetilde{u}(x)&=k_{x X}\left( k_{X X}+\frac{r_{X X}}{M}\right) ^{-1}{\overline{U}}, \quad x \in \mathcal {X}, \end{aligned}$$

(16)

$$\begin{aligned} \widetilde{k}\left( x, x^{\prime }\right)&=k\left( x, x^{\prime }\right) -k_{x X}\left( k_{X X}+\frac{r_{XX}}{M}\right) ^{-1} k_{X x^{\prime }}, \quad x, x^{\prime } \in \mathcal {X}. \end{aligned}$$

(17)

Here, we estimate the solution of the PDE by evaluating the posterior mean function (16). We use the posterior variance ${\widetilde{\sigma }}^2(x) {:=}\widetilde{k}(x,x)$ to assess the uncertainty of the estimates. This value represents the expectation of the mean squared error (MSE) between the true solution v(0, x) and the posterior mean function, where the expectation is evaluated for the probability $p(v(0,x)| X,{\overline{U}},x)$.

Remark 2

In standard GPR, the random variable $\xi _{i,j}$ has constant variance $r(x)\equiv \sigma ^2$, which is a hyperparameter learned to maximize the likelihood. As described in Sect. 3.2, in our method the noise variance is estimated as a function over the space ${\mathcal {X}}$ by using the sampled data set. Although the concept of heteroscedastic GPR has already been studied for many years [24, 25, 29, 30], to the best of our knowledge, this is the first time that it has been used for the numerical calculation of PDEs.

3.2 Prior and Data Set

In the GPR introduced in the previous section, the following design parameters have not yet been determined:

1.
The kernel function k in (7)
2.
The data set U and the noise variance matrix $r_{XX}$ that satisfy (14)

We first discuss the kernel function k. We propose the use of the following Matérn kernel to reflect a priori knowledge about the regularity of the PDE solution:

$$\begin{aligned} k\left( x, x^{\prime }\right) =\frac{1}{2^{\alpha -1} \varGamma (\alpha )}\left( \frac{\sqrt{2 \alpha }\left\| x-x^{\prime }\right\| }{h}\right) ^\alpha K_\alpha \left( \frac{\sqrt{2 \alpha }\left\| x-x^{\prime }\right\| }{h}\right) , \end{aligned}$$

(18)

where $\alpha >0$ and $h>0$ are hyperparameters, $\varGamma $ is the Gamma function, and $K_\alpha $ is the modified Bessel function of the second kind of order $\alpha $.

This is justified by the following. According to the Moore–Aronszajn theorem [1], for any positive-definite kernel k, there is only one reproducing kernel Hilbert space (RKHS) $\mathcal {H}_k$ for k. The norm defined on this RKHS can evaluate the smoothness of the function. Particularly, the RKHS of the Matérn kernel is known to be norm-equivalent to the following Sobolev space:

Proposition 3

(RKHSs of Matérn kernels: Sobolev spaces [19, 49])

Let k be the Matérn kernel on $\mathcal {X} \subseteq \mathbb {R}^d$. Suppose that $s{:=}\alpha +d / 2$ is an integer. Then, the RKHS $\mathcal {H}_{k}$ is norm-equivalent to the Sobolev space of order s.

Therefore, in situations where information about the regularity of the PDE solutions is known, the smoothness of the regression results can be made to match the smoothness of the true solution by appropriately choosing the parameter $\alpha $ in the Matérn kernel.

Next, to observe the data that satisfy (9), we employ the Feynman-Kac formula. Using the coefficients of (1) and (2), we define a random variable

$$\begin{aligned} \begin{aligned} {\widehat{u}}(t,x) {:=}\,&\int _t^T h(s, X(s ; t, x)) e^{-\int _t^s c(r, X(r ; t, x)) \textrm{d}r} \textrm{d}s \\&+g(X(T ; t, x)) e^{-\int _t^T c(r, X(r ; t, x)) \textrm{d}r}, \end{aligned} \end{aligned}$$

(19)

where $X(\cdot ) \equiv X(\cdot ; t, x)$ is a strong solution of the following stochastic differential equation:

$$\begin{aligned} \begin{aligned} \textrm{d}X(s)&=b(s, X(s)) \textrm{d}s+a(s, X(s)) \textrm{d}W(s), \quad s \in [t, T], \\ X(t)&=x, \end{aligned} \end{aligned}$$

(20)

where $(t, x) \in [0, T) \times \mathbb {R}^d$ and $W(\cdot )$ is a standard m-dimensional Wiener process starting at $W(t)=0$ at time t.

As stated below, the Feynman–Kac formula guarantees that ${\widehat{u}}$ appropriately approximates the solution v of the PDE. For any $(t,x)\in [0,T]\times {\mathbb {R}}^d$, let ${\widehat{\xi }}(t,x){:=}\widehat{u}(t,x)-v(t,x)$:

$$\begin{aligned} \begin{aligned} {\widehat{u}}(t,x) = v(t, x) + {{\widehat{\xi }}}(t,x),\quad (t, x) \in [0, T] \times \mathbb {R}^d, \end{aligned} \end{aligned}$$

(21)

where $\widehat{u}$ is defined as (19) and v satisfies (1) and (2).

Proposition 4

(Feynman–Kac formula for Kolmogorov PDE [51]) Suppose that Assumption 1 holds. Then, ${{\widehat{\xi }}}(t,x)$ satisfies

$$\begin{aligned} {\mathbb {E}}[{{\widehat{\xi }}}(t,x)|t,x,v] = 0. \end{aligned}$$

(22)

In the classical Feynman–Kac formula, a nondegeneracy condition on the diffusion term is required for the existence of classical solutions; however, since we consider viscosity solutions here, this condition is unnecessary.

In this study, numerical computation of (19) is called Feynman–Kac (FK) sampling. Note that in the implementation of FK sampling, the computation of (20) requires a numerical calculation of the SDE using a method such as the Euler–Maruyama method, and the computation of (19) requires a numerical integration using a technique such as the trapezoidal rule.

We describe how to use the data resulting from FK sampling as a dataset ${\overline{U}}$ and noise variance matrix $r_{XX}$ that satisfy (14). For the data set ${\overline{U}}$ in (11), we use ${\check{U}}=[{\check{u}} (x_1),\ldots ,{\check{u}} (x_N)]^\top $, where ${\check{u}} (x_i)$ is the sample mean of M FK samples $\{{\widehat{u}}_j(0,x_i)\}_{j=1}^M$:

$$\begin{aligned} {\check{u}}{(x_i)}{:=}\sum _{j=1}^M \frac{{{\widehat{u}}}_j (0,x_i)}{M}, \quad i\in \{1, \ldots , N\}, \end{aligned}$$

(23)

where ${\widehat{u}}_j$ denotes the jth FK sample. It is also appropriate to use $\check{r}_{XX}$ for the noise variance matrix $r_{XX}$ in (14), which satisfies $[\check{r}_{XX}]_{ij}=\check{r}(x_i)\delta _{ij}$, where

$$\begin{aligned} \check{r}(x_i) {:=}\sum _{j=1}^M \frac{({\widehat{u}}_j(0,x_i) - \check{u}(x_i))^2}{M-1}, \quad i\in \{1, \ldots , N\}, \end{aligned}$$

(24)

is the unbiased sample variance of M FK samples. In summary, our dataset satisfies the following:

$$\begin{aligned} \check{u}(x_i) = v(0,x_i) + \check{\xi }_i, \quad i\in \{1, \ldots , N\}, \end{aligned}$$

(25)

where $\check{\xi }_i{:=}\sum _{j=1}^M {{\widehat{\xi }}}_j(0,x_i)/M$ and ${{\widehat{\xi }}}_j(0,x_i) {:=}{\widehat{u}}_j(0,x_i) - v(0,x_i)$.

Remark 3

The proposed method allows one to choose arbitrary low-dimensional bounded domain on which to solve the PDE. This is accomplished by restricting the FK sampling points to the low-dimensional bounded domain and then applying GPR only to those points. For example, in a d-dimensional problem, suppose that we focus on a bounded domain of ${\tilde{d}}$ dimensions and set the number of observation points per dimension to ${\tilde{N}}$. In this case, the most computationally intensive operation of the proposed method is the computation of the posterior variance in GPR, with a computational complexity of $O({\tilde{N}}^{3{\tilde{d}}})$. On the other hand, FEM requires computations across all dimensions. Thus, when using direct solvers for dense linear systems, the computational complexity is $O({\tilde{N}}^{3d})$, where $\widetilde{N}$ denotes the degrees of freedom in meshes per dimension. Therefore, when ${\tilde{d}} < d$, the proposed method requires much lower computational cost than FEM. This computational advantage is demonstrated in Sect. 5.

Remark 4

Because the noise term $\check{\xi }$ in (25) is not normally distributed, (12) and (25) do not correspond directly. However, if the variance of ${{\widehat{\xi }}}$ is bounded and the sample size M is large enough, the central limit theorem guarantees that the random variable $\check{\xi }$ approximately follows a normal distribution with mean 0 and variance $\check{r}/M$ [18]. Because of this fact, we use (25) as an approximate model for (12). In addition, several results assess the non-normality of variables for finite sample sizes M [18, 28]. For example, the error of the distributions for finite M is estimated by the Berry–Esseen theorem [18], which states that the rate of convergence is $O(M^{-1/2})$.

4 Posterior Variance Analysis

In this section, we carry out an a priori evaluation of the posterior variance in (17). We focus on the integral value of the MSE on the input space ${\mathcal {X}}$, called integrated MSE (IMSE):

$$\begin{aligned} \textrm{IMSE}{:=}\int _{{\mathcal {X}}} {\widetilde{\sigma }}^2(x) \textrm{d}\nu (x), \end{aligned}$$

(26)

where $\nu $ is a given probability measure of the observation points whose support is ${\mathcal {X}}$.

Here, we aim to characterize the speed of the decay of IMSE with respect to an increase in the number of observation points N and the sample size M. To achieve this, we first characterize the decay speed using the eigenvalues of the kernel function and the magnitude of the noise term. Specifically, we derive a probabilistic lower bound for the IMSE and show that the IMSE is expected to decrease as N and M increase (see Theorem 1). Among the variables used in this lower bound, the eigenvalues can be considered known, but the magnitude of the noise term is unknown. We therefore probabilistically assess the error between the true value of the noise term and its estimate, and show that the evaluation of the IMSE is justified under sufficiently large sample size M (see Theorem 2). After deriving these theorems, we give a more conservative estimate of the IMSE, which does not include information on the eigenvalues of the kernel and the magnitude of the noise term (see Corollary 1).

Assumption 2

1.
The kernel function k is bounded and positive-definite on ${\mathcal {X}}$.
2.
The observation points $X=[x_1,\ldots ,x_N]^\top $ are sampled from the known probability measure $\nu $.
3.
The data set ${\overline{U}}$ in (11) satisfies (14).
4.
For the noise variance function r in (10), there exists a positive minimum value $r_\textrm{min}=\min _{x\in {\mathcal {X}}} r(x)>0$.

Theorem 1

Suppose that Assumption 2 holds. Then, for any $\epsilon >0$, there exists an integer $\widetilde{N}(\epsilon )>0$ such that for any $N>\widetilde{N}$ the following holds:

$$\begin{aligned} \textrm{Pr} \left( \textrm{IMSE} > (1-\epsilon ) L_\textrm{IMSE}\right) = 1, \end{aligned}$$

(27)

where

$$\begin{aligned} L_\textrm{IMSE}{:=}r_\textrm{min} \sum _{p \in I} \frac{\lambda _p}{r_\textrm{min} + NM \lambda _p}. \end{aligned}$$

(28)

Here, $\{\lambda _p\}_{p\in I}$ denotes the eigenvalues of the kernel function k for the probability measure $\nu $. The set $I\subseteq {\mathbb {N}}$ denotes the index set of the eigenvalues of the kernel.

Remark 5

The provided lower bound helps estimate the minimum sample size necessary to achieve a predetermined level of IMSE. This characteristic contrasts with the upper bound, which is commonly provided as a sufficient condition to guarantee the required performance. In particular, the fastest order O(1/MN) provided in Equation (30) implies that to achieve a tenfold improvement in the current IMSE, it is necessary to increase the sample size (M or N) by a factor of ten. Understanding the lower bound of the IMSE thus aids in designing numerical computations.

Remark 6

The evaluation of lower bounds on the posterior variance has long been studied as an asymptotic analysis of Gaussian processes [41, 50]. To the best of our knowledge, they only address the case where the noise model in the GPR is uniform. We extend the results to the heteroscedastic case.

Remark 7

In the right-hand side of (27), the actual decay speed depends on the decay rate of the eigenvalues of the kernel. For some smooth kernels, the decay rate of the eigenvalues has been evaluated under known measures of the observation points [39]. It is also possible to evaluate the eigenvalues numerically by performing a sampling routine (see chapter 4 in [50]).

To evaluate the right-hand side of Theorem 1, it is necessary to find the minimum $r_{\min }$ of the noise variance function r of the GPR. However, as mentioned in Sect. 3.2, r can only be estimated pointwise using sample variance as in (24). Therefore, it is natural to use ${\overline{r}}_{\min } {:=}\overline{r}(x_{i^*})\ (i^* {:=}\mathop {\textrm{argmin}}\limits _{i\in \{1,\ldots ,N\}}\overline{r}(x_i))$ as an estimate of $r_{\min }$, where ${\overline{r}}(x_i) = \sum _{j=1}^M (u_j(x_i)-{\overline{u}}(x_i))^2/(M-1)$ is the unbiased variance of (9). The following theorem probabilistically evaluates the error between $r_{\min }$ and ${\overline{r}}_{\min }$.

Theorem 2

Suppose that Assumption 2.2-$-$2.4 holds. Then, for any N, M, $\delta >0$, and $\epsilon >\delta $,

$$\begin{aligned} \begin{aligned}&\textrm{Pr}(|{\overline{r}}_{\min } - r_{\min }|\ge \epsilon , \delta >|r(x_{i^*}) - r_{\min }|) \\&\le 1-\prod _{i=1}^N \max \left\{ 1-\frac{2 r^2\left( x_i\right) }{(\epsilon -\delta )^2(M-1)}, 0\right\} . \end{aligned} \end{aligned}$$

(29)

Remark 8

From (29), we can justify replacing $r_{\min }$ in (28) with the minimum sample variance ${\overline{r}}_{\min }$ under a sufficiently large M relative to N.

Apart from the results above, it is also possible to evaluate the lower bound of the posterior variance in a form that does not include eigenvalues or noise variance. The following lower bound is more conservative than the one given in Theorem 1 but is more useful for an intuitive understanding.

Corollary 1

Suppose that Assumption 2 holds. Then, there exists a constant $C>0$ such that for any $N\in {\mathbb {N}}$ and $M\in {\mathbb {N}}$,

$$\begin{aligned} L_\textrm{IMSE} \ge \frac{C}{NM}. \end{aligned}$$

(30)

Remark 9

From (30), we see that the fastest convergence rate of the posterior variance is $O((NM)^{-1})$ for an FK sample size M and the number of GPR observation points N.

5 Numerical Calculation

In this section, we apply the proposed method to several PDEs and evaluate their performance. The target PDEs are the heat equation, the advection–diffusion equation, and the Hamilton–Jacobi–Bellman (HJB) equation. The goal is to learn the functional form of the solution on $x\in [0,1]^d$ for $d=10$ spatial dimensions. We learn only the first $\widetilde{d}$-dimensional component of the PDE solution, where $\widetilde{d}\in \{1,2,3\}$. Specifically, we use N equally spaced grid points in the range [0, 1] for the $\widetilde{d}$-dimensional component and fix the other $(d-\widetilde{d})$-dimensional components to 0.5. In other words, we set ${\mathcal {X}}=[0, 1]\times \cdots \times [0, 1]\times \{0.5\}\times \cdots \times \{0.5\}$ and we use a $\widetilde{d}$-dimensional Lebesgue measure as the probability measure $\nu $. This is because using ${\widetilde{N}} = N^d$ as the number of observation points would drain the machine’s memory for allocating variables in GPR. The approximation methods proposed in [13] would solve this problem, but this implementation is beyond the scope of this paper and is not done here.

We compare the performance of the following three estimation methods:

Heteroscedastic GPR (HGPR): The proposed method, where the noise variance function in GPR is estimated using the sample variance of the FK sample, using (24).
Standard GPR: A method that treats $r\equiv \sigma ^2\ (\sigma \in {\mathbb {R}})$ as a hyperparameter in GPR, where the hyperparameters are determined to maximize the marginal likelihood.
Linear interpolation: A method to linearly interpolate the function value $\{{\overline{u}}(x_i)\}_{i=1}^N$ obtained by FK sampling.

We evaluate each method using the following criteria:

1.
Squared $L^2$-norm error of the estimates for the test data:
$$\begin{aligned} \textrm{Error} = {\int _{\mathcal {X}} {[}\widetilde{u}(x) - v(x)]^2 \textrm{d}x}, \end{aligned}$$
(31)
where $\widetilde{u}(x)$ is the estimate of the solution of the PDE and v(x) is the test data to approximate the true solution v(0, x) of the PDE (1) and (2). For numerical integration in (31), the trapezoidal law is used.
2.
Integrated mean squared error (IMSE) defined in (26), along with its bound in (28). Here, to evaluate the bound in (28), the eigenvalues are approximated using a sampling routine, as described in chapter 4 in [50]. The term $r_{\min }$ is approximated with $\check{r}_{\min }=\min _{i} \check{r}(x_i)$, where $\{\check{r}(x_i)\}_{i=1}^N$ are the sample variance of the FK samples (19).

The first criterion evaluates the accuracy of the estimate, while the second criterion evaluates the uncertainty. We explain the method of creating test data in each subsection. Note that only GPR-based methods can calculate the IMSE.

In GPR, the hyperparameters $\alpha $ and h of the Matérn kernel in (18) need to be determined. We determine $\alpha $ to correspond to a Sobolev space of order 2, taking into account that we seek to solve second-order PDEs. The hyperparameter h is determined to maximize the marginal log-likelihood of the data:

$$\begin{aligned} \begin{aligned} h^*&= \mathop {\textrm{argmax}}\limits _{h} \log p({\overline{U}}|X,h) \\&= \mathop {\textrm{argmax}}\limits _{h} \log {\mathcal {N}}({\overline{U}}|0, k_{X X}(h)), \end{aligned} \end{aligned}$$

(32)

where $k_{X X}(h)$ is the Matérn kernel matrix with the hyperparameter h. This maximization is performed by a gradient ascent method on the log-likelihood of the hyperparameters. Since the log-likelihood is a non-convex function, the optimization steps converge to a local solution depending on the choice of initial values (see Chapter 5 in [50]). To search for a better local solution, 27 initial values of hyperparameter are randomly determined, and the value with the largest likelihood is adopted. Functions built into gpytorch are used for optimizing hyperparameters, calculating posterior distributions and kernel functions [12].

Since the matrix $k_{XX}$ is a positive semidefinite matrix and $r_{XX}$ is positive diagonal matrix, which is assumed in Assumption 2, the inverse matrix in (16) exists. However, when the noise variance function r is small or the sample size M is extremely large, the condition number of the matrix may deteriorate, potentially leading to instability in computing the inverse matrix. In such cases, it is possible to stabilize the inverse matrix computation by adding a homoscedastic additional noise term. However, in our numerical computations, the calculations did not become unstable without adding additional noise terms.

In FK sampling, the Euler–Maruyama method is used to numerically calculate the SDE of (20) at each observation point, and the trapezoidal law is used for numerical integration in (21). The results of the FK sampling depend on the sample path of the Brownian motion. To suppress the effect of this randomness, In each experiment, we generated FK samples using multiple random seeds and performed GPR with them. Specifically, for ${\tilde{d}} = 1$, we compared the average error and IMSE over 50 different random seeds; for ${\tilde{d}} = 2$, over 10 different random seeds; and for ${\tilde{d}} = 3$, over 5 different random seeds.

Before discussing the results for each PDE, we give a summary of the results that are common to each experiment. First, regarding the squared $L^2$-norm error, the error tends to become smaller as M, N are increased for all three methods. The errors for HGPR and GPR are comparable, and these two methods achieve smaller errors than linear interpolation. With respect to IMSE, both HGPR and GPR achieve smaller IMSE for larger N and M. Comparing the two, HGPR achieves a smaller IMSE overall, suggesting that the proposed method achieves estimation with lower uncertainty.

5.1 Heat Equation

We consider the following initial value problem for the heat equation:

$$\begin{aligned}&\begin{aligned} \partial _t w(t,x)&=\frac{1}{2} \operatorname {tr}\left\{ a a^{\top } \partial _{xx} w(t, x)\right\} , \end{aligned} \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$

(33)

$$\begin{aligned}&w(0,x)=\exp \left( -\lambda (x-0.5)^{\top } (x-0.5)\right) , \qquad \qquad x \in \mathbb {R}^d. \end{aligned}$$

(34)

we set $a \equiv \sigma I_d$ with $\sigma =0.4$, $\lambda = 5.0$, and $T=1.0$. To solve the initial value problem, the proposed method is applied after using the variable transformation described in Remark 1.

The solution of this Cauchy problem is given as

$$\begin{aligned} \begin{aligned} w(t,x)&= \frac{1}{(4\lambda \alpha )^{d/2}} \exp \left( -\frac{(x-0.5)^{\top } (x-0.5)}{4\alpha }\right) , \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$

(35)

where $\alpha = {1}/{(4\lambda )} + \sigma ^2 t/2$. We evaluate the accuracy of the estimation results using this analytical solution as test data.

First, we consider the case with $\widetilde{d}=1$, which means that we solve first out of ten dimensional problem. Figure 1 shows the estimation results for the first dimension of the solution to a 10-dimensional PDE. Here, we used a FK sample size of $M = 800$ and the number of observation points $N = 12$. In this figure, HGPR provides estimation results that are closest to the analytical solution and also exhibits the smallest uncertainty in the estimations.

We examine the behavior of the squared $L^2$-norm error and IMSE with respect to sample size M and number of observation points N. Here, we set $N=20$ when varying M and $M=800$ when varying N. The results are shown in Fig. 2. We see that increasing M or N in each method tends to decrease the $L^2$-norm error and IMSE. Regarding the $L^2$-norm error, all methods decrease the error as the sample size M is increased. This shows that the proposed method approximates the solution of the PDE more accurately as the parameters are increased. The errors for HGPR and GPR are comparable, while the errors for linear interpolation are worse. As the number of observation points N is increased, linear interpolation keeps a larger error, while HGPR and GPR have smaller errors. In the region where N is small, HGPR reduces the error more than GPR. For both HGPR and GPR, the IMSE decreases monotonically with increasing M. For all M, HGPR achieves a smaller IMSE than GPR. When N is increased, the IMSE decreases in both HGPR and GPR, and the value in HGPR is smaller than the value in GPR. The approximated IMSE bounds are found to decay on the similar order of magnitude as the measured values.

Next, we present the results of estimating the first and second components out of the ten dimensions, i.e., $\widetilde{d}=2$. We set $N = 12^2$ when varying M and $M = 1600$ when varying N. The results are shown in Fig. 3. Note that in the figure, the horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of $N^2$. Basically, as N and M increase, the error and IMSE of each method decreases. In GPR-based methods, the IMSE when increasing N stops decreasing. In many parameter regions, HGPR outperforms the other methods in terms of both error and IMSE.

Finally, we present the results of estimating the first to third components out of the ten dimensions, that is, $\widetilde{d}=3$. We set $N = 12^3$ when varying M and $M = 800$ when varying N. The results are shown in Fig. 4. The horizontal axis N represents the number of observation points per dimension, resulting in a total of $N^3$ observation points. The errors and IMSE behavior for M share the same characteristics as in Fig. 3. As N is increased, error improves for each of the three methods, and IMSE decreases for both numerical solutions and lower bounds.

Summarizing the results for both criteria, HGPR is better than linear interpolation but similar to GPR in terms of error, and better than GPR in terms of uncertainty.

5.2 Advection–Diffusion Equation

Next, we consider the following initial value problem for the advection–diffusion equation:

$$\begin{aligned}&\begin{aligned} \partial _t w(t,x)&=\frac{1}{2} \operatorname {tr}\left\{ a a^{\top } \partial _{xx} w(t, x)\right\} - b^\top \partial _x w(t, x), \end{aligned} \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$

(36)

$$\begin{aligned}&w(0,x)=\exp \left( -\lambda (x-0.5)^{\top } (x-0.5)\right) , \qquad \qquad \quad x \in \mathbb {R}^d, \end{aligned}$$

(37)

where we have defined $a \equiv \sigma I_d$ with $\sigma =0.4$, $b=0.01$, $\lambda =5.0$, and $T=1.0$. As in the previous subsection, the proposed method is applied after using the variable transformation described in Remark 1.

The analytic solution of this Cauchy problem is given as

$$\begin{aligned} \begin{aligned} w(t,x)&= \frac{1}{(4\lambda \alpha )^{d/2}} \exp \left( -\frac{(x-0.5 + b t)^{\top } (x-0.5 + b t)}{4\alpha }\right) , \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$

(38)

where $\alpha = {1}/{(4\lambda )} + \sigma ^2 t/2$. This analytical solution is used as test data to evaluate the accuracy of the estimation results.

First, we consider the case with $\widetilde{d}=1$. Figure 5 presents the estimation results for the first component of the solution to a 10-dimensional PDE. We use a FK sample size of $M = 200$ and a number of observation points $N = 12$. The estimation results of both HGPR and GPR show negligible differences in terms of both posterior mean and posterior variance. Moreover, the results from both methods exhibit smaller errors compared to those obtained through linear interpolation.

The squared $L^2$-norm error and IMSE when varying M and N are plotted in Fig. 6. Similar to the example above, increasing M or N in each method tends to decrease the $L^2$-norm error and IMSE. All methods decrease the $L^2$-norm error as the sample size M is increased. In many parameter regions, the errors for HGPR and GPR are smaller than the errors for linear interpolation. For both HGPR and GPR, the IMSE decreases monotonically when increasing M. For all M, HGPR achieves a smaller IMSE than GPR. When N is increased, the IMSE of both HGPR and GPR are decreased. In many regions, HGPR achieves a lower IMSE than GPR. The approximately calculated IMSE bounds decay on a similar order of magnitude to the measured values.

Next, we present the results for the case where $\widetilde{d}=2$. Here, we set $N = 12^2$ when varying M and $M = 1600$ when varying N. The results are shown in Fig. 7. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of $N^2$. Similar to the case of the two-dimensional heat equation, HGPR exhibits better performance than the other two methods in many parameter regions.

Finally, we present the results for the case where $\widetilde{d}=3$. Here, we set $N = 12^3$ when varying M and $M = 800$ when varying N. The results are shown in Fig. 8. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of $N^3$. These results basically share common characteristics with the estimation results in Fig. 4.

5.3 Hamilton–Jacobi–Bellman Equation

Finally, we consider the following terminal value problem for the Hamilton–Jacobi–Bellman (HJB) equation:

$$\begin{aligned} -\partial _t v(t, x)&= \ell (x) -\frac{1}{2} \partial _x v(t, x)^{\top } B R^{-1} B^{\top } \partial _x v(t, x) + \frac{1}{2}\operatorname {tr}\left\{ aa^\top \partial _{x x} v(t, x)\right\} ,\nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$

(39)

$$\begin{aligned} v(T, x)&=g(x),\qquad \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$

(40)

where we have defined $\ell (x) = (x-0.5\mathbbm {1}_d)^\top Q (x-0.5\mathbbm {1}_d)/2$ with $Q=I_d$, $g(x)\equiv 0$, $B=I_d$, $R=I_d$, $a \equiv \sigma I_d$ with $\sigma =0.4$, and $T=1.0$. This PDE corresponds to the following optimal control problem:

$$\begin{aligned}&\text {minimize } J(\alpha ) = \mathbb {E}\left\{ \int _{0}^{T} \ell (X(s)) + \frac{1}{2}\alpha (s)^\top R \alpha (s) \textrm{d} s + g (X(T))\right\} , \end{aligned}$$

(41)

$$\begin{aligned}&\text {subject to } \textrm{d}X (s)=B\alpha (s)\textrm{d}s + a \textrm{d}W(s), \ \ s \in [0, T], \end{aligned}$$

(42)

$$\begin{aligned}&\qquad \qquad \quad X(0) = x. \end{aligned}$$

(43)

More specifically, the solution $\alpha ^*$ of this optimal control problem is represented as follows:

$$\begin{aligned} \alpha ^*(t,x) = -R^{-1} B^\top \partial _x v(t,x). \end{aligned}$$

(44)

Since the HJB equation contains nonlinear terms, the proposed method is not directly applicable. Thus, we use the Cole–Hopf transformation to linearize the equation [20]. First, we find the real number $\lambda $ which satisfies $\lambda BR^{-1}B^\top = a a^\top $. Using this, we transform the PDE with $\widetilde{v}=\exp {(}-{v}/{\lambda })$:

$$\begin{aligned}&\begin{aligned} \partial _t \widetilde{v}(t,x)&=-\frac{1}{2} \operatorname {tr}\left\{ a a^{\top } \partial _{xx} \widetilde{v}(t, x)\right\} + \frac{1}{\lambda } \ell (x) \widetilde{v}(t,x), \end{aligned} (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$

(45)

$$\begin{aligned}&\widetilde{v}(T,x)=\exp \left( -\frac{g(x)}{\lambda }\right) , \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$

(46)

To apply the proposed method, we first conduct FK sampling on the linearized PDE. Next, we perform the Cole–Hopf inverse transformation on the obtained pointwise data. Finally, we carry out regression using each method on the inverse-transformed data. Such transformations may violate the assumptions regarding the distribution of data and noise as stated in Assumption 2.3. However, contrary to this concern, our numerical experiments demonstrate practical performance as described below.

The terminal value problem (39) and (40) do not have a closed-form analytical solution. However, the solution can be characterized through the solution of the Riccati ordinary differential equation. The terminal value problem of the Riccati equation is defined below:

$$\begin{aligned} \dot{P}(t)&= P(t)BR^{-1}B^\top P(t) - Q, \end{aligned}$$

(47)

$$\begin{aligned} P(T)&= 0. \end{aligned}$$

(48)

Using the solution P(t), the solution of the HJB equation is expressed as follows [10]:

$$\begin{aligned} v(t,x) = (x-0.5)^\top P(t) (x-0.5) + \operatorname {tr} \{a a^\top \} \int _{t}^T P(s) \textrm{d}s. \end{aligned}$$

(49)

Therefore, by numerically solving this ODE, it is possible to calculate the solution to the PDE. We compare the errors between the numerical solutions obtained by the proposed method and the numerical solution of the ODE. For the discretization of the ODE, the Euler method with a time step of $\varDelta t = 0.01$ is used.

First, we consider the case with $\widetilde{d}=1$. Figure 9 shows the estimation results for the first component of the solution to a 10-dimensional PDE. Figure 9a presents the estimation results of the solution, while Fig. 9b illustrates the estimation results of the spatial derivatives of the solution. We use a FK sample size of $M = 100$ and the number of observation points $N = 12$. We compute the spatial derivatives of the solution since they are important for calculating the optimal control input in (44). In GPR-based methods, the derivatives of the posterior mean is analytically obtained by computing the derivatives of the kernel function [48]. We perform these calculations using the automatic differentiation feature of pytorch [33]. For linear interpolation, the derivatives are approximated using finite differences based on the mean values of the FK samples at adjacent observation points. Regarding the estimation results, the GPR-based methods provide estimates that are close to the test data and exhibit small uncertainty. Furthermore, the GPR-based methods demonstrate the high approximation capability for the derivative values.

The squared $L^2$-norm error and IMSE when varying M and N are plotted in Fig. 10. Similar to the two examples above, increasing M or N in each method tends to decrease the $L^2$-norm error and IMSE. All methods decrease the $L^2$-norm error as sample size M is increased. The values for HGPR and GPR are comparable, while the values for linear interpolation are larger. When the number of observation points N is increased, linear interpolation has a larger error, while HGPR and GPR have smaller errors. HGPR reduces the error more than GPR, particularly in the region where N is small. Next, for both HGPR and GPR, the IMSE decreases monotonically with increasing M, and for all M, HGPR achieves a smaller IMSE than GPR. As N is increased, the IMSE of the HGPR decreases more than that of the GPR. The tightness of IMSE bounds is improved compared to the two examples above. In the region where N is small, the IMSE of the GPR is close to the estimated lower bound. Note that the lower bound given in Theorem 1 is a probabilistic inequality that holds when N is sufficiently large and does not necessarily hold for small N.

Next, we present the results for the case where $\widetilde{d}=2$. Here, we set $N = 12^2$ when varying M and $M = 1600$ when varying N. The results are shown in Fig. 11. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of $N^2$. As N and M increase, the error decreases for each method. As M is increased, the IMSE becomes smaller, but not when N is increased. GPR-based methods achieve smaller errors than linear interpolation; however, HGPR and GPR exhibit similar performance aside from a slight difference in IMSE.

Finally, we present the results for the case where $\widetilde{d}=3$. Here, we set $N = 8^3$ when varying M and $M = 800$ when varying N. The results are shown in Fig. 12. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of $N^3$. The errors and IMSE behavior for M share the same characteristics as in Fig. 11. On the other hand, errors and IMSE behavior with respect to N have features in common with Fig. 4.

6 Conclusion

We have proposed a numerical method for PDEs by combining the heteroscedastic Gaussian process regression and the Feynman–Kac formula. The proposed method is able to assess the uncertainty in the numerical results and choose an arbitrary low-dimensional bounded domain on which to compute the solution. The quality of the solution can be improved by adjusting the kernel function and incorporating noise information obtained from Monte Carlo samples into the GPR noise model. The method succeeds in providing a lower bound on the posterior variance, which is useful for estimating the best-case performance in numerical calculations. The numerical results confirm that the proposed method provides better approximations in the wide parameter domains compared to naive linear interpolation and standard GPR-based methods.

In future work, we will explore iterative observation schemes that leverage uncertainty information. Active learning of observation points for achieving efficient function approximation and extremum search is widely investigated [15, 37, 40]. In line with these previous studies, we expect to efficiently estimate the solution of the PDE by adaptively selecting sampling points. We will also investigate the design of uncertainty-aware acquisition functions to guide the search for regions of interest, such as those that maximize or minimize a specific physical quantity.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950). https://doi.org/10.1090/S0002-9947-1950-0051437-7
Article MathSciNet Google Scholar
Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving the kolmogorov PDE by means of deep learning. J. Sci. Computing 88(3), 73 (2021). https://doi.org/10.1007/s10915-021-01590-0
Article MathSciNet Google Scholar
Black, F., Scholes, M.: The pricing of options and corporate liabilities. J. Political Econ. 81(3), 637–654 (1973)
Article MathSciNet Google Scholar
Brennan, M.J., Schwartz, E.S.: Finite difference methods and jump processes arising in the pricing of contingent claims: a synthesis. J. Financ. Quant. Anal. 13(3), 461–474 (1978). https://doi.org/10.2307/2330152
Article Google Scholar
Chamoin, L., Legoll, F.: An introductory review on a posteriori error estimation in finite element computations. SIAM Rev. 65(4), 963–1028 (2023). https://doi.org/10.1137/21M1464841
Article MathSciNet Google Scholar
Chazan, D.: Introduction to approximation theory (E.W. Cheney). SIAM Rev. 10(3), 393–394 (1968). https://doi.org/10.1137/1010083
Edelstein-Keshet, L. (2005): Mathematical Models in Biology. SIAM
Exarchos, I., Theodorou, E.A.: Learning Optimal Control via Forward and Backward Stochastic Differential Equations. In: 2016 American Control Conference (ACC), pp. 2155–2161 (2016). https://doi.org/10.1109/ACC.2016.7525237
Exarchos, I., Theodorou, E.A.: Stochastic optimal control via forward and backward stochastic differential equations and importance sampling. Automatica 87, 159–165 (2018). https://doi.org/10.1016/j.automatica.2017.09.004
Article MathSciNet Google Scholar
Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions Probability Stochastic Modelling and Applied, 2nd edn. Springer-Verlag, New York (2006). https://doi.org/10.1007/0-387-31071-1
Floris, C.: Numeric solution of the fokker-planck-kolmogorov equation. Engineering 05(12), 975 (2013). https://doi.org/10.4236/eng.2013.512119
Article Google Scholar
Gardner, J., Pleiss, G., Weinberger, K.Q., Bindel, D., Wilson, A.G. (2018): GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. In: Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc
Gardner, J., Pleiss, G., Wu, R., Weinberger, K., Wilson, A. (2018): Product Kernel Interpolation for Scalable Gaussian Processes. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 1407–1416. PMLR
Goldberg, P., Williams, C., Bishop, C. (1997): Regression with Input-dependent Noise: A Gaussian Process Treatment. In: Advances in Neural Information Processing Systems, vol. 10. MIT Press
Gu, S., Wang, H., Zhou, X.: Active learning for saddle point calculation. J. Sci. Computing 93(3), 78 (2022). https://doi.org/10.1007/s10915-022-02040-1
Article MathSciNet Google Scholar
Han, J., Jentzen, A., E, W.: Solving High-Dimensional Partial Differential Equations using Deep Learning. Proceedings of the National Academy of Sciences 115(34), 8505–8510 (2018). https://doi.org/10.1073/pnas.1718942115
Hawkins, K.P., Pakniyat, A., Tsiotras, P.: On the Time Discretization of the Feynman-Kac Forward-Backward Stochastic Differential Equations for Value Function Approximation. In: 2021 60th IEEE Conference on Decision and Control (CDC), pp. 892–897 (2021). https://doi.org/10.1109/CDC45484.2021.9683583
Kallenberg, O.: Foundations of Modern Probability. Springer, Berlin Heidelberg (1997)
Google Scholar
Kanagawa, M., Hennig, P., Sejdinovic, D., Sriperumbudur, B.K. (2018).: Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences. https://doi.org/10.48550/arXiv.1807.02582
Kappen, H.J.: Linear theory for control of nonlinear stochastic systems. Phys. Rev. Lett. 95(20), 200201 (2005). https://doi.org/10.1103/PhysRevLett.95.200201
Article MathSciNet Google Scholar
Kumar, P., Narayanan, S.: Solution of fokker-Planck equation by finite element and finite difference methods for nonlinear systems. Sadhana 31(4), 445–461 (2006). https://doi.org/10.1007/BF02716786
Article MathSciNet Google Scholar
Kushner, H.J.: Finite difference methods for the weak solutions of the kolmogorov equations for the density of both diffusion and conditional diffusion processes. J. Math. Anal. Appl. 53(2), 251–265 (1976). https://doi.org/10.1016/0022-247X(76)90109-8
Article MathSciNet Google Scholar
Ladeveze, P., Leguillon, D.: Error estimate procedure in the finite element method and applications. SIAM J. Numer. Anal. 20(3), 485–509 (1983). https://doi.org/10.1137/0720033
Article MathSciNet Google Scholar
Lázaro-Gredilla, M., Titsias, M.K. (2011): Variational Heteroscedastic Gaussian Process Regression. In: ICML
Le, Q.V., Smola, A.J., Canu, S.: Heteroscedastic Gaussian Process Regression. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 489–496 (2005)
Le Cavil, A., Oudjane, N., Russo, F.: Forward feynman-Kac type representation for semilinear non-conservative partial differential equations. Stochastics 91(8), 1206–1248 (2019). https://doi.org/10.1080/17442508.2019.1594809
Article MathSciNet Google Scholar
Le Gratiet, L., Garnier, J.: Asymptotic analysis of the learning curve for gaussian process regression. Mach. Learning 98(3), 407–433 (2015). https://doi.org/10.1007/s10994-014-5437-0
Article MathSciNet Google Scholar
Li, X., Ding, P.: General forms of finite population central limit theorems with applications to causal inference. J. Am. Stat. Assoc. 112(520), 1759–1769 (2017). https://doi.org/10.1080/01621459.2017.1295865
Article MathSciNet Google Scholar
Liu, H., Ong, Y.S., Cai, J.: Large-scale Heteroscedastic Regression via Gaussian Process (2020). https://doi.org/10.48550/arXiv.1811.01179
Makarova, A., Usmanova, I., Bogunovic, I., Krause, A.: Risk-averse heteroscedastic bayesian optimization. Adv. Neural Inf. Process. Syst. 34, 17235–17245 (2021)
Google Scholar
Mood, A.M.: Introduction to the Theory of Statistics. McGraw-Hill (1950)
Pascucci, A.: Kolmogorov Equations in Physics and in Finance. In: C. Bandle, H. Berestycki, B. Brighi, A. Brillard, M. Chipot, J.M. Coron, C. Sbordone, I. Shafrir, V. Valente, G.V. Caffarelli (eds.) Elliptic and Parabolic Problems: A Special Tribute to the Work of Haim Brezis, pp. 353–364. Birkhäuser, Basel (2005). https://doi.org/10.1007/3-7643-7384-9_35
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019)
Pereira, M., Wang, Z., Chen, T., Reed, E., Theodorou, E. (2020): Feynman-Kac Neural Network Architectures for Stochastic Control Using Second-Order FBSDE Theory. In: Proceedings of the 2nd Conference on Learning for Dynamics and Control, pp. 728–738. PMLR
Petersen, K.B., Pedersen, M.S., et al. (2008): The Matrix Cookbook. Technical University of Denmark
Pham, H.: Feynman-Kac representation of fully nonlinear PDEs and applications. Acta Math. Vietnam. 40(2), 255–269 (2015). https://doi.org/10.1007/s40306-015-0128-x
Article MathSciNet Google Scholar
Prince, M.: Does active learning work? a review of the research. J. Eng. Educ. 93(3), 223–231 (2004). https://doi.org/10.1002/j.2168-9830.2004.tb00809.x
Article Google Scholar
Richter, L., Berner, J. (2022): Robust SDE-Based Variational Formulations for Solving Linear PDEs via Deep Learning. In: Proceedings of the 39th International Conference on Machine Learning, pp. 18649–18666. PMLR
Santin, G., Schaback, R.: Approximation of eigenfunctions in kernel-based spaces. Adv. Computational Math. 42(4), 973–993 (2016). https://doi.org/10.1007/s10444-015-9449-5
Article MathSciNet Google Scholar
Settles, B.: From Theories to Queries: Active Learning in Practice. In: Active Learning and Experimental Design Workshop In Conjunction with AISTATS 2010, pp. 1–18. JMLR Workshop and Conference Proceedings (2011)
Sollich, P., Halees, A.: Learning curves for gaussian process regression: approximations and bounds. Neural Computation 14(6), 1393–1428 (2002). https://doi.org/10.1162/089976602753712990
Article Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Springer Science & Business Media, Berlin Heidelberg (2008)
Google Scholar
Strikwerda, J.C. (2004) : Finite Difference Schemes and Partial Differential Equations, Second Edition. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics. doi 10.1137/1.9780898717938
Strouboulis, T., Haque, K.A.: Recent experiences with error estimation and adaptivity, part I: review of error estimators for scalar elliptic problems. Computer Methods Appl. Mech. Eng. 97(3), 399–436 (1992). https://doi.org/10.1016/0045-7825(92)90053-M
Article Google Scholar
Takahashi, A., Yamada, T.: Solving Kolmogorov PDEs without the curse of dimensionality via deep learning and asymptotic expansion with Malliavin calculus. Partial Differ. Equ. Appl. 4(4), 27 (2023). https://doi.org/10.1007/s42985-023-00240-4
Article MathSciNet Google Scholar
Van Kampen, N.G.: Stochastic Processes in Physics and Chemistry. Elsevier, Amsterdam (1992)
Google Scholar
Verfürth, R.: A Posteriori Error Estimation Techniques for Finite Element Methods. Oxford University Press, Oxford (2013). https://doi.org/10.1093/acprof:oso/9780199679423.001.0001
Wang, H., Zhou, X.: Explicit estimation of derivatives from data and differential equations by gaussian process regression. Int. J. Uncertain. Quantif. 11(4), 41–57 (2021). https://doi.org/10.1615/Int.J.UncertaintyQuantification.2021034382
Article MathSciNet Google Scholar
Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2004). https://doi.org/10.1017/CBO9780511617539
Williams, C.K., Rasmussen, C.E.: Gaussian Processes for Machine Learning, vol. 2. MIT press Cambridge, MA (2006)
Google Scholar
Yong, J., Zhou, X.Y.: Stochastic Controls. Springer, Berlin Heidelberg (1999). https://doi.org/10.1007/978-1-4612-1466-3
Zhao, J., Davison, M., Corless, R.M.: Compact finite difference method for American option pricing. J. Computational Appl. Math. 206(1), 306–321 (2007). https://doi.org/10.1016/j.cam.2006.07.006
Article MathSciNet Google Scholar
Zienkiewicz, O.C., Taylor, R.L., Zhu, J.Z. (eds.): : The Finite Element Method: Its Basis and Fundamentals. Butterworth-Heinemann, Oxford (2013). https://doi.org/10.1016/B978-1-85617-633-0.00019-8

Download references

Acknowledgements

We would like to thank Dr. Yuta Koike of the Graduate School of Mathematical Sciences at the University of Tokyo for helpful comments on the proof of Theorem 2.

Funding

This study was fully funded by the internal budget of Toyota Central R&D Labs., Inc.

Author information

Authors and Affiliations

Toyota Central R&D Labs., Inc., Nagakute, Aichi, 480-1192, Japan
Daisuke Inoue, Yuji Ito & Hiroaki Yoshida
Graduate School of Mathematical Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo, 153-8914, Japan
Takahito Kashiwabara & Norikazu Saito

Authors

Daisuke Inoue
View author publications
You can also search for this author inPubMed Google Scholar
Yuji Ito
View author publications
You can also search for this author inPubMed Google Scholar
Takahito Kashiwabara
View author publications
You can also search for this author inPubMed Google Scholar
Norikazu Saito
View author publications
You can also search for this author inPubMed Google Scholar
Hiroaki Yoshida
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: D.I.; Methodology: D.I.; Formal analysis and investigation: D.I., Y.I., T.K., and N.S.; Writing - original draft preparation: D.I.; Writing - review and editing: Y.I., T.K, N.S., and H.Y.; Resources: N.S. and H.Y.; Supervision: N.S. and H.Y.

Corresponding author

Correspondence to Daisuke Inoue.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proof of Theorem 1

Our proof is based on the proof of Theorem 1 of [27]. The main difference is the extension of the noise model in the GPR to the heteroscedastic case. Proofs for nondegenerate kernels can be given by way of using the Karhunen–Loève expansion as in [27]. Thus, we only consider degenerate kernels, that is, the case where the number of nonzero eigenvalues of the kernel is finite.

Let $\{\lambda _p, \phi _p\}_{p\in I}$ be the eigensystems of the kernel function k for the probability measure $\nu $. Let ${\overline{p}}$ be the number of nonzero eigenvalues. We define $\varLambda {:=}\operatorname {diag}\left( \lambda _i\right) _{1 \le i \le \overline{p}}$, $\phi (x){:=}\left[ \phi _1(x), \ldots , \phi _{\overline{p}}(x)\right] ^\top $, and $\varPhi {:=}\left[ \phi \left( x_1\right) , \ldots , \phi \left( x_{N}\right) \right] ^\top $. Then, by using (79), the MSE in (17) is rewritten in the following form:

$$\begin{aligned} \begin{aligned} {\widetilde{\sigma }}^2(x)&=\phi (x)^\top \varLambda \phi (x)-\phi (x)^\top \varLambda \varPhi ^\top \left( \varPhi \varLambda \varPhi ^\top + \frac{r_{XX}}{M}\right) ^{-1} \varPhi \varLambda \phi (x) \\&\ge \phi (x)^\top \varLambda \phi (x)-\phi (x)^\top \varLambda \varPhi ^\top \left( \varPhi \varLambda \varPhi ^\top + \frac{r_{\min }}{M} I_{N}\right) ^{-1} \varPhi \varLambda \phi (x), \end{aligned} \end{aligned}$$

(50)

where $I_N\in {\mathbb {R}}^{N\times N}$ denotes the identity matrix of size N. By using Woodbury-Sherman-Morrison formula [35], this is rearranged as

$$\begin{aligned} \begin{aligned} {\widetilde{\sigma }}^2(x)&\ge \phi (x)^\top \left( \frac{M}{r_{\min }}\varPhi ^\top \varPhi + \varLambda ^{-1}\right) ^{-1} \phi (x). \end{aligned} \end{aligned}$$

(51)

On the right-hand side, each component of the matrix $\varPhi ^\top \varPhi $ is written as

$$\begin{aligned} (\varPhi ^\top \varPhi )_{p q} = \sum _{i\in \{1,\ldots ,N\}} \phi _p(x_i)\phi _q(x_i). \end{aligned}$$

(52)

By the strong law of large numbers, the right-hand side divided by N converges almost surely to the Kronecker delta (see (78)). That is, for any $\epsilon > 0$, there exists $\widetilde{N}(\epsilon )>0$ and for any $N\ge \widetilde{N}(\epsilon )$,

$$\begin{aligned} \textrm{Pr}\left( \left| \frac{1}{N}\sum _{i\in \{1,\ldots ,N\}} \phi _p(x_i)\phi _q(x_i)- \delta _{pq}\right| < \frac{\epsilon }{{\overline{p}}} \right) = 1. \end{aligned}$$

(53)

From this and the Gershgorin circle theorem, we can obtain

$$\begin{aligned} \textrm{Pr}\left( \frac{M}{r_{\min }}\varPhi ^\top \varPhi + \varLambda ^{-1} \prec \frac{NM(1+\epsilon )}{r_{\min }} I_{{\overline{p}}} + \varLambda ^{-1}\right) = 1, \end{aligned}$$

(54)

and hence,

$$\begin{aligned} \textrm{Pr}\left( \left( \frac{M}{r_{\min }}\varPhi ^\top \varPhi + \varLambda ^{-1}\right) ^{-1} \succ \left( \frac{NM(1+\epsilon )}{r_{\min }} I_{{\overline{p}}} + \varLambda ^{-1}\right) ^{-1}\right) = 1. \end{aligned}$$

(55)

By combining (55) with (51), we can obtain

$$\begin{aligned} \textrm{Pr} \left( {\widetilde{\sigma }}^2(x) > \sum _{p \le \overline{p}} \frac{r_{\min } \lambda _p}{r_{\min }+NM\lambda _p(1+\epsilon ) }\ \phi _p(x)^2\right) = 1. \end{aligned}$$

(56)

Since $r_{\min } > 0$, the right-hand side of the inequality inside the probability is bounded by

$$\begin{aligned} \begin{aligned} \frac{r_{\min } \lambda _p}{r_{\min }+NM\lambda _p(1+\epsilon ) }&> \frac{r_{\min } \lambda _p}{r_{\min }+NM \lambda _p} (1-\epsilon ). \end{aligned} \end{aligned}$$

(57)

By substituting (57) into (56), we obtain

$$\begin{aligned} \textrm{Pr} \left( {\widetilde{\sigma }}^2(x) > \sum _{p \le \overline{p}} \frac{r_{\min } \lambda _p}{r_{\min }+NM \lambda _p} (1-\epsilon )\ \phi _p(x)^2\right) = 1. \end{aligned}$$

(58)

Integrating both sides of the expression inside the probability with measure $\nu (x)$ completes the proof. $\square $

1.2 Proof of Theorem 2

First, by using the triangle inequality, we have

$$\begin{aligned} |{\overline{r}}_{\min } - r_{\min }| \le |{\overline{r}}_{\min } - r(x_{i^*})| + |r(x_{i^*}) - r_{\min }|. \end{aligned}$$

(59)

Similarly, when an event $\delta > \left| r\left( x_{i^*}\right) -r_{\min }\right| $ occurs,

$$\begin{aligned} \begin{aligned} \left| \overline{r}_{\min }-r\left( x_{i^*}\right) \right|&\ge \left| \overline{r}_{\min }-r_{\min }\right| -\left| r\left( x_{i^*}\right) -r_{\min }\right| \\&>\left| \overline{r}_{\min }-r_{\min }\right| -\delta . \end{aligned} \end{aligned}$$

(60)

Combining these inequalities, we obtain

$$\begin{aligned} \begin{aligned}&\operatorname {Pr}\left( \left| \overline{r}_{\min }-r_{\min }\right| \ge \epsilon , \delta >\left| r\left( x_{i^*}\right) -r_{\min }\right| \right) \\ &\quad \le \operatorname {Pr}\left( \left| \overline{r}_{\min }-r_{\min }\right| \ge \epsilon ,\left| \overline{r}_{\min }-r\left( x_{i^*}\right) \right| \ge \left| \overline{r}_{\min }-r_{\min }\right| -\delta \right) \\ &\quad \le \operatorname {Pr}\left( \left| \overline{r}_{\min }-r\left( x_{i^*}\right) \right| \ge \epsilon -\delta \right) . \end{aligned} \end{aligned}$$

(61)

For the rightmost side, we have the following inequality:

$$\begin{aligned} \begin{aligned}&\operatorname {Pr}\left( \left| \overline{r}_{\text{ min } }-r\left( x_{i *}\right) \right| \ge \epsilon -\delta \right) \\&\quad \le \operatorname {Pr}\left( \max _i\left| \overline{r}\left( x_i\right) -r\left( x_i\right) \right| \ge \epsilon -\delta \right) \\&\quad =\operatorname {Pr}\left( \exists i, \quad \left| \overline{r}\left( x_i\right) -r\left( x_i\right) \right| \ge \epsilon -\delta \right) \\&\quad =1-\operatorname {Pr}\left( \forall i \in \{1,2, \ldots , N\}, \quad \left| \overline{r}\left( x_i\right) -r\left( x_i\right) \right|<\epsilon -\delta \right) \\&\quad =1-\prod _{i=1}^N \operatorname {Pr}\left( \left| \overline{r}\left( x_i\right) -r\left( x_i\right) \right| <\epsilon -\delta \right) . \end{aligned} \end{aligned}$$

(62)

We will find a bound for the probability on the rightmost side. For the Chebyshev’s inequality [18]

$$\begin{aligned} \textrm{Pr}(|Z|\ge \epsilon )\le \frac{{\mathbb {E}}[{|Z|^2}]}{\epsilon ^2}, \end{aligned}$$

(63)

we substitute $Z={\overline{r}}(x)-r(x)$, where ${\overline{r}}(x)$ denotes the unbiased sample variance of $\{u_j(x)\}_{j=1}^M$ and r(x) denotes the variance of u(x):

$$\begin{aligned} \textrm{Pr}(|{\overline{r}}(x)-r(x)|\ge \epsilon )\le \frac{{\mathbb {E}}[{|{\overline{r}}(x)-r(x)|^2}]}{\epsilon ^2} = \frac{{\mathbb {V}}[{\overline{r}}(x)]}{\epsilon ^2}. \end{aligned}$$

(64)

The variance of the unbiased sample variance is known to be written as

$$\begin{aligned} {\mathbb {V}}[{\overline{r}}(x)] = \frac{1}{M}\left( \mu _4[u(x)]-\frac{M-3}{M-1} r^2(x)\right) , \end{aligned}$$

(65)

where $\mu _4[u(x)]$ denote the 4th central moment of u(x) (see [31]). Note that $\mu _4[u(x)]=3r^2(x)$ when u follows a normal distribution. By substituting this, we obtain the following concentration inequality:

$$\begin{aligned} \textrm{Pr}(|{\overline{r}}(x)-r(x)|\ge \epsilon ) \le \frac{2r^2(x)}{\epsilon ^2 (M-1)}, \end{aligned}$$

(66)

and thus,

$$\begin{aligned} \begin{aligned} \operatorname {Pr}(|\overline{r}(x)-r(x)|<\epsilon )&=1-\operatorname {Pr}(|\overline{r}(x)-r(x)| \ge \epsilon ) \\&\ge \max \left\{ 1-\frac{2 r^2(x)}{\epsilon ^2(M-1)}, 0\right\} . \end{aligned} \end{aligned}$$

(67)

Substituting this into (62) yields

$$\begin{aligned} \begin{aligned} \operatorname {Pr}\left( \left| \overline{r}_{\text{ min } }-r\left( x_{i *}\right) \right| \ge \epsilon -\delta \right)&\le 1-\prod _{i=1}^N \max \left\{ 1-\frac{2 r^2\left( x_i\right) }{(\epsilon -\delta )^2(M-1)}, 0\right\} . \end{aligned} \end{aligned}$$

(68)

Combining this and (61) completes the proof. $\square $

1.3 Proof of Corollary 1

Let $c \in (0, \infty )$ be a constant. We have

$$\begin{aligned}&\forall \lambda _p \ge c, \quad \frac{\lambda _p}{r_{\min }+NM \lambda _p} \ge \frac{c}{r_{\min }+NM c}, \end{aligned}$$

(69)

$$\begin{aligned}&\forall \lambda _p<c, \quad \frac{\lambda _p}{r_{\min }+NM \lambda _p} \ge 0. \end{aligned}$$

(70)

For $I_{\ge c}{:=}\left\{ p \in I \mid \lambda _p \ge c\right\} $, we have

$$\begin{aligned} L_{\textrm{IMSE}} \ge r_{\min } \sum _{p \in I_{\ge c}} \frac{c}{r_{\min }+NM c}=\frac{c\ r_{\min }}{r_{\min }+NM c}\left| I_{\ge c}\right| . \end{aligned}$$

(71)

Because $\lambda _p$ is independent of N and M, substituting $N=1$ and $M=1$ into the above equation yields the boundedness of $\left| I_{\ge c}\right| $:

$$\begin{aligned} \left| I_{\ge c}\right| \le \frac{\left( r_{\min }+c\right) }{c\ r_{\min }} L_{\textrm{IMSE}} <\infty . \end{aligned}$$

(72)

Thus, there exists $C>0$ such that for all $N\in {\mathbb {N}}$ and $M\in {\mathbb {N}}$,

$$\begin{aligned} L_{\textrm{IMSE}} \ge \frac{c\ r_{\min }}{r_{\min }+NM c}\left| I_{\ge c}\right| \ge \frac{C}{NM}, \end{aligned}$$

(73)

which completes the proof. $\square $

1.4 Review of Gaussian Process Regression

We provide a brief review of the fundamentals of Gaussian process regression so that this paper is effectively self-contained. The main purpose is to give an overview of the eigenfunction expansions of kernels and Gaussian processes, which are necessary for the proof of Theorem 1.

Definition 1

(Positive-definite kernel)

For a set $\mathcal {X}\subseteq {\mathbb {R}}^d$, the function $k: \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}$ is a positive-definite kernel if the following is satisfied for any $n \in \mathbb {N},\left[ c_1, \ldots , c_n\right] ^\top \in \mathbb {R}^n$ and $\left[ x_1, \ldots , x_n\right] ^\top \in \mathcal {X}^n$:

$$\begin{aligned} \sum _{i=1}^n \sum _{j=1}^n c_i c_j k\left( x_i, x_j\right) \ge 0 . \end{aligned}$$

(74)

Definition 2

(Gaussian process)

For a set $\mathcal {X}\subseteq {\mathbb {R}}^d$, let $k: \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}$ be a positive-definite kernel and $m: \mathcal {X} \rightarrow \mathbb {R}$ be a real-valued function. For a function $f: \mathcal {X} \rightarrow \mathbb {R}$, suppose that for any $n \in \mathbb {N}$ and $X=\left[ x_1, \ldots , x_n\right] ^\top \in \mathcal {X}^n$, the vector value $f_X=\left[ f\left( x_1\right) , \ldots , f\left( x_n\right) \right] ^{\top }$ is a random variable that follows the multivariate Gaussian distribution $\mathcal {N}\left( m_X, k_{X X}\right) $ with mean $m_X=\left[ m\left( x_1\right) , \ldots , m\left( x_n\right) \right] ^{\top }\in \mathbb {R}^{n}$ and covariance $k_{X X}=\left[ k\left( x_i, x_j\right) \right] _{i, j=1}^n \in \mathbb {R}^{n \times n}$. Then, f is called a Gaussian process ${\mathcal {G}}{\mathcal {P}}(m, k)$ with mean m and covariance k.

Definition 3

(reproducing kernel Hilbert space, RKHS)

Consider a set $\mathcal {X}\subseteq {\mathbb {R}}^d$ and a positive-definite kernel k on $\mathcal {X}$. A Hilbert space $\mathcal {H}_k$ consisting of functions on $\mathcal {X}$ whose inner product is defined by $\langle \cdot , \cdot \rangle _{\mathcal {H}_k}$ is called a reproducing kernel Hilbert space (RKHS) for a kernel k if ${\mathcal {H}}_k$ satisfies the following conditions:

1.
For any $x \in \mathcal {X}$, $k(\cdot , x) \in \mathcal {H}_k$.
2.
For any $x \in \mathcal {X}$ and $f \in \mathcal {H}_k$, $f(x)=\langle f, k(\cdot , x)\rangle _{\mathcal {H}_k}$.

Let $\nu $ be a Borel measure on $\mathcal {X}$. Let $L_2(\nu )$ be a Hilbert space consisting of square-integrable functions on $\nu $. Define the map $T_k: L_2(\nu ) \rightarrow L_2(\nu )$ as follows:

$$\begin{aligned} T_k f{:=}\int k(\cdot , x) f(x) \textrm{d}\nu (x), \quad f \in L_2(\nu ) . \end{aligned}$$

(75)

Since $T_k$ is a compact, positive-valued, self-adjoint function, eigenfunction expansion is possible according to spectral theory:

$$\begin{aligned} T_k f=\sum _{i \in I} \lambda _i\left\langle \phi _i, f\right\rangle _{L_2(\nu )} \phi _i, \end{aligned}$$

(76)

where the convergence is in the $L_2(\nu )$-sense, and $I \subseteq \mathbb {N}$ denotes the index set. For example, $I=\mathbb {N}$ if RKHS is infinite dimensional and $I=\{1, \ldots , K\}$ if $K(\in \mathbb {N})$ dimensional. Let $\left\{ (\phi _i, \lambda _i)\right\} _{i \in I} \subset L_2(\nu ) \times (0, \infty )$ be the eigenfunction and eigenvalue of $T_k$, that is,

$$\begin{aligned} T_k \phi _i=\lambda _i \phi _i, \quad i \in I,\ \lambda _1 \ge \lambda _2 \ge \cdots >0 . \end{aligned}$$

(77)

Note that the eigenfunctions $\left\{ \phi _i\right\} _{i \in \mathbb {N}}$ satisfy

$$\begin{aligned} \int \phi _i(x) \phi _j(x) \textrm{d}\nu (x) = \delta _{ij}, \end{aligned}$$

(78)

where $\delta _{i j}$ is the Kronecker delta.

Mercer’s theorem states that the kernel k can be represented by a system of eigenvalues and eigenfunctions.

Proposition 5

(Mercer’s theorem, [42])

Let $\mathcal {X}\subseteq {\mathbb {R}}^d$ be a compact topological space, $k: \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}$ be a continuous kernel, and $\nu $ be a bounded Borel measure on ${\mathcal {X}}$. Then,

$$\begin{aligned} k\left( x, x^{\prime }\right) =\sum _{i \in I} \lambda _i \phi _i(x) \phi _i\left( x^{\prime }\right) , \quad x, x^{\prime } \in \mathcal {X}, \end{aligned}$$

(79)

where the convergence is absolute and uniform over $x, x^{\prime } \in \mathcal {X}$.

Note that the eigenvalues and eigenfunctions $\left\{ (\phi _i, \lambda _i)\right\} _{i \in I}$ in (79) depend on the measure $\nu $ because they are defined through integral operators of (75).

Definition 4

A kernel that has only a finite number of nonzero eigenvalues is called a degenerate kernel. A kernel that is not degenerate is called a nondegenerate kernel.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Inoue, D., Ito, Y., Kashiwabara, T. et al. An Uncertainty-Aware, Mesh-Free Numerical Method for Kolmogorov PDEs. J Sci Comput 103, 41 (2025). https://doi.org/10.1007/s10915-025-02846-9

Download citation

Received: 09 May 2024
Revised: 31 January 2025
Accepted: 10 February 2025
Published: 24 March 2025
DOI: https://doi.org/10.1007/s10915-025-02846-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Uncertainty-Aware, Mesh-Free Numerical Method for Kolmogorov PDEs

Abstract

Similar content being viewed by others

Sparse Information Filter for Fast Gaussian Process Regression

Depth-based weighted jackknife empirical likelihood for non-smooth U-structure equations

Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

1 Introduction

2 Problem Formulation

Assumption 1

Proposition 1

Remark 1

3 Proposed Algorithm

3.1 Gaussian Process Regression

Proposition 2

Remark 2

3.2 Prior and Data Set

Proposition 3

Proposition 4

Remark 3

Remark 4

4 Posterior Variance Analysis

Assumption 2

Theorem 1

Remark 5

Remark 6

Remark 7

Theorem 2

Remark 8

Corollary 1

Remark 9

5 Numerical Calculation

5.1 Heat Equation

5.2 Advection–Diffusion Equation

5.3 Hamilton–Jacobi–Bellman Equation

6 Conclusion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 2

1.3 Proof of Corollary 1

1.4 Review of Gaussian Process Regression

Definition 1

Definition 2

Definition 3

Proposition 5

Definition 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification