1 Introduction

Kolmogorov partial differential equations (PDEs) have found wide application in physics [32], chemistry [46], finance [3], and the natural sciences [7]. Various numerical methods such as finite difference [4, 21, 22, 52] and finite element method [11, 53] have been proposed and widely adopted by practitioners. While these numerical methods are undoubtedly of high value in solving various engineering problems, it is desirable to overcome the following two challenges to address more complex problems that arise in real-world scenarios:

  1. 1.

    Evaluating the numerical validity. When performing numerical computations, it is desirable to estimate the error between the numerical results and the true solution. Although existing numerical methods can address this through error analysis techniques, they typically require careful mathematical analysis, which is different for each PDE of interest. Ideally, one would obtain numerical information about the validity of the solution simultaneously with the computed solution.

  2. 2.

    Constraining the solution domain. In mesh-based methods, one typically partitions the domain on which the PDE is defined into a mesh, and then solves a globally defined linear equation to obtain the solution. Since the computational complexity dramatically increases with the degrees of freedom in meshes, it becomes difficult to solve problems where the degrees of freedom inevitably become large, such as those defined on unbounded domains or those in high-dimensional settings. Under such circumstances, it is desirable to restrict the domain of interest to a low-dimensional bounded domain so that the solution can be obtained efficiently.

The first challenge has long been the focus of error analysis in conventional numerical methods, such as finite difference and finite element methods [43, 47]. In particular, in the finite element method, a priori or a posteriori error estimation plays a crucial role, which is performed before and after the numerical solution is obtained, respectively [5, 23]. A priori estimation primarily involves truncation error analysis [6], while a posteriori estimation often involves flux recovery methods, residual methods, and duality-based approaches [44]. In general, tight error analysis requires careful mathematical evaluation, which depends on the form of the PDEs. It is therefore desirable to evaluate the error numerically, regardless of the form of the PDEs. Furthermore, the reliance on meshes in these conventional methods poses computational challenges when the degrees of freedom in meshes become large, such as in problems defined on unbounded domains or in problems with high-dimensional spaces.

For the second challenge, Monte Carlo-based mesh-free calculations are actively investigated. In particular, methods based on the Feynman–Kac (FK) formula have attracted attention as a promising strategy for constraining the solution domain [8, 9, 16, 17, 26, 34, 36]. This method selects a point in space-time to be solved, calculates multiple stochastic differential equations (SDEs) starting from the point, and approximates the solution of the PDE at the selected point as an ensemble average of the SDE solutions. Here, the FK formula provides a solution at a single point in space-time, requiring regressions from multiple sampled points to derive a solution over the domain of interest. Regression via linear interpolation and neural networks [2, 9, 38, 45] has been proposed. Nonetheless, these approaches give only estimates of the PDE solutions and cannot guarantee the accuracy of the estimated results.

This study proposes a new numerical method to solve the above two challenges simultaneously. In the proposed method, Gaussian Process Regression (GPR) is used to obtain solutions to PDEs over a domain of interest while evaluating their uncertainty. Specifically, GPR smoothly interpolates solutions at discrete points obtained by a Monte Carlo method based on the FK formula.

Here, GPR is compatible with Monte Carlo methods from two perspectives that are unattainable with other regression techniques. Firstly, GPR can incorporate a priori knowledge about the regularity of PDE solutions. In particular, designing a kernel function allows for determining the differentiability of the regression results. Secondly, using the variance information of noise contained in the Monte Carlo samples increases the accuracy of the regression. The variance is easily estimated by the sample variance of the Monte Carlo samples.

To address the first challenge, the proposed method can assess the uncertainty of the regression through the posterior variance in the GPR. Under appropriate assumptions, the posterior variance allows a posteriori probabilistic evaluation of the error between the numerical solution and the true solution. In addition, the use of GPR also provides a priori estimation of the posterior variance. We establish a theoretical lower bound on the posterior variance, which is crucial for estimating the best-case performance of numerical solutions. In standard numerical error analysis, the estimated error is given as an upper bound. In contrast, our lower bound is an optimistic estimate of the error, yet it helps to determine the minimum sample size required to obtain a desired posterior variance.

To address the second challenge, the proposed method can choose an arbitrary low-dimensional bounded domain on which to compute the solution. This is achieved by restricting the Monte Carlo sampling points to the low-dimensional bounded domain and performing the GPR only on these points.

To demonstrate its effectiveness, we perform numerical computations on 10-dimensional PDEs. Instead of covering the entire 10-dimensional domain, we focus on applying GPR to a bounded domain in 1-3 dimensions. Although the computational complexity of GPR scales cubically with the number of observation points in the FK samples, making full-dimensional solutions impractical, this strategy allows efficient computation in low-dimensional bounded domains. Numerical results show that our proposed method is more accurate and robust than an existing technique, as confirmed by our experiments.

The remainder of the paper is structured as follows: In Sect. 2, we define the class of the PDEs to be addressed. In Sect. 3, we propose a numerical method that combines the GPR and the FK formula. Theoretical results for the estimation of the posterior variance in the proposed method are presented in Sect. 4. In Sect. 5, we numerically apply the proposed method to three types of 10-dimensional PDEs and show that more accurate and less uncertain solutions are obtained than with existing methods. Section 6 provides a brief summary and suggests possible future work.

2 Problem Formulation

Throughout this paper, we consider the following Cauchy problem for the Kolmogorov PDEs:

$$\begin{aligned}&\begin{aligned} \partial _t v(t,x)&+\frac{1}{2} \operatorname {tr}\left\{ a(t, x) a(t, x)^{\top } \partial _{xx} v(t,x)\right\} + b(t, x)^\top \partial _x v(t, x) \\&\quad +c(t, x) v(t,x) + h(t, x)=0,\quad \quad \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned}&v(T,x)=g(x), \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$
(2)

where \(T>0\) denotes the terminal time, \(t\in [0,T]\) denotes the time, and \(v:[0,T]\times {\mathbb {R}}^d\rightarrow {\mathbb {R}}\) denotes the variables of the PDE. The function \(b:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^d\) denotes the advection coefficient, \(a:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^{d \times m}\) denotes the diffusion coefficient, \(c:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}\) denotes the reaction coefficient, \(h: [0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}\) denotes the generative coefficient, and \(g:\mathbb {R}^d \rightarrow \mathbb {R}\) denotes the terminal condition. We assume that each function satisfies the following conditions.

Assumption 1

The functions \(b:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^d, a:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}^{d \times m}\) \(c:[0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}\), \(h: [0, T] \times \mathbb {R}^d \rightarrow \mathbb {R}\) are uniformly continuous. The function c is bounded. There exists a constant \(L>0\) and the following are satisfied for \(\varphi \in \{ b, a, c, h\}\):

$$\begin{aligned} \begin{aligned} |\varphi (t, x)-\varphi (t, x^{\prime })|&\le L|x-x^{\prime }|, \quad \forall t \in [0, T], x, x^{\prime } \in \mathbb {R}^d, \\ |\varphi (t, 0)|&\le L, \qquad \qquad \forall t \in [0, T]. \end{aligned} \end{aligned}$$
(3)

Proposition 1

(Ref. [51]) Suppose that Assumption 1 holds. Then, (1) and (2) have a unique viscosity solution.

Finding a solution to (1) and (2) in closed form is possible only in very limited cases and thus numerical calculations are usually required. The goal of this study is to find a numerical solution v of (1) and (2). More specifically, under the known functions abch, and g, we aim to find the solution \(v(0,x)\ (x\in {\mathcal {X}})\) at the initial time \(t=0\), where the set \({\mathcal {X}}\subseteq {\mathbb {R}}^d\) is the domain of interest for which we want to know the solution. In addition, we also aim to assess the uncertainty of the solution, that is, how much the results of the numerical calculation deviate from the true solution.

Remark 1

The terminal value problem (1) and (2) is transformed into the initial value problem by introducing the variable transformation \(w(t,x){:=}v(T-t,x)\):

$$\begin{aligned}&\begin{aligned} -\partial _s w(s,x)&+\frac{1}{2} \operatorname {tr}\left\{ a(s, x) a(s, x)^{\top } \partial _{xx} w(s, x)\right\} + b(s, x)^\top \partial _x w(s,x) \\&\quad +c(s, x) w(s, x) + h(s, x)=0, \quad \quad \ (s, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned}&w(0,x)=g(x), \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$
(5)

Thus, for initial value problems, this variable transformation should be performed before applying the proposed method described below.

3 Proposed Algorithm

In this section, we propose a method to find the solution \(v(0,x)\ (x\in {\mathcal {X}})\) of the PDE (1) and (2) at the initial time \(t=0\). We use GPR to compute the possibility that each candidate function is a solution v of the PDE. Using prior knowledge and data sets, the GPR provides such a possibility as a distribution over possible functions. The prior knowledge represents the possibility regarding the solution v as a probability distribution p(v). If the data set U associated with v is observed, the posterior probability p(v|U) is analytically obtained by Bayesian estimation. In this sense, the proposed data-driven method estimates the possibility p(v|U) that each candidate of v is the solution to the PDE instead of solving the PDE directly. In Sect. 3.1, we describe how to obtain the posterior distribution by using GPR. In Sect. 3.2, we describe how to obtain the data set that approximates the solution of the PDE by using the Feynman–Kac formula.

3.1 Gaussian Process Regression

We design the prior as a Gaussian process. For any \(X=[x_1,\ldots ,x_N]^\top \in {\mathcal {X}}^N\), we assume that the probability distribution \(p(v_X|X)\), where

$$\begin{aligned} v_{X}{:=}[v(0,x_1),\ldots ,v(0,x_N)]^\top \in {\mathbb {R}}^N, \end{aligned}$$
(6)

is a normal distribution:

$$\begin{aligned} p(v_X|X) = {\mathcal {N}}(0, k_{XX}). \end{aligned}$$
(7)

Here, the matrix \(k_{XX}\in {\mathbb {R}}^{N\times N}\) is defined as \(\left[ k_{XX}\right] _{i j}=k\left( x_i, x_j\right) \) with a positive-definite kernel function \(k:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}\), where \([k_{XX}]_{ij}\) denotes the (ij) component of \(k_{XX}\). See Appendix for a detailed definition of the Gaussian process. The method for designing the kernel function k is discussed in Sect. 3.2.

We assume that the data set is given as follows. For N observation points \(X=\left[ x_1, \ldots , x_{N}\right] ^\top \in \mathcal {X}^N\) and M sample size, we assume that the data

$$\begin{aligned} U=\left[ u_1(x_1),\ldots ,u_M(x_1), \ldots , u_1(x_{N}),\ldots , u_M(x_{N})\right] ^\top \in \mathbb {R}^{MN}, \end{aligned}$$
(8)

which approximates the solution of the PDE, are obtained. We assume that these data satisfy

$$\begin{aligned} u_j(x_i)= v(0,x_i) + \xi _{i,j}, \quad i\in \{1, \ldots , N\},\ j\in \{1, \ldots , M\}, \end{aligned}$$
(9)

where \(\xi _{i,j}\ (i\in \{1, \ldots , N\},\ j\in \{1, \ldots , M\})\) are random variables that are independent and follow a normal distribution with mean 0 and variance \(r(x_i)\):

$$\begin{aligned} \xi _{i,j} \sim \mathcal {N}\left( 0, r(x_i)\right) , \quad i\in \{1, \ldots , N\},\ j\in \{1, \ldots , M\}. \end{aligned}$$
(10)

The function \(r:{\mathcal {X}}\rightarrow {\mathbb {R}}_{> 0}\) is a noise variance function. To reduce the effect of noise, we derive the posterior distribution using the sample mean \({\overline{u}}(x_i)\) for the data at the same observation point \(x_i\):

$$\begin{aligned} {\overline{U}} = \left[ {\overline{u}}(x_1),\ldots ,{\overline{u}}(x_N)\right] ^\top \in {\mathbb {R}}^N, \end{aligned}$$
(11)

where \(\overline{{u}} (x_i){:=}\sum _{j=1}^M {u}_j (x_i)/M\). Since the variance of the sample mean is the variance of each sample divided by M, (9) is rewritten as follows:

$$\begin{aligned} {\overline{u}}(x_i) = v(0,x_i) + {\overline{\xi }}_{i}, \quad i\in \{1, \ldots , N\}, \end{aligned}$$
(12)

where

$$\begin{aligned} {\overline{\xi }}_{i}\sim \mathcal {N}\left( 0, \frac{r(x_i)}{M}\right) , \quad i\in \{1, \ldots , N\}. \end{aligned}$$
(13)

In summary, the dataset is observed to satisfy

$$\begin{aligned} p({\overline{U}}|v_X,X) = {\mathcal {N}}\left( v_X, \frac{r_{XX}}{M}\right) , \end{aligned}$$
(14)

where \(r_{X X}\in \mathbb {R}^{N \times N}\) is a noise variance matrix of the form \(\left[ r_{X X}\right] _{i j}=\delta _{ij} r(x_i)\). The methods for obtaining the data set \({\overline{U}}\) and the noise variance matrix \(r_{XX}\) are described in Sect. 3.2.

In GPR, the posterior distribution can be analytically obtained from the prior and the data set, as stated in the next proposition. Recalling that k is the kernel function representing the prior in (7), we define \(k_{X x}=k_{x X}^{\top }=\left[ k\left( x_1, x\right) , \ldots , k\left( x_{N}, x\right) \right] ^{\top }\).

Proposition 2

(Ref. [14]) Suppose that the matrix \(k_{X X}+r_{XX}/M\) is invertible. Suppose also that (7) and (14) holds. For any \(x\in {\mathcal {X}}\), the posterior distribution is calculated as

$$\begin{aligned} p(v(0,x)| X,{\overline{U}},x) = {\mathcal {N}}(\widetilde{u}(x), \widetilde{k}(x,x)), \end{aligned}$$
(15)

where

$$\begin{aligned} \widetilde{u}(x)&=k_{x X}\left( k_{X X}+\frac{r_{X X}}{M}\right) ^{-1}{\overline{U}}, \quad x \in \mathcal {X}, \end{aligned}$$
(16)
$$\begin{aligned} \widetilde{k}\left( x, x^{\prime }\right)&=k\left( x, x^{\prime }\right) -k_{x X}\left( k_{X X}+\frac{r_{XX}}{M}\right) ^{-1} k_{X x^{\prime }}, \quad x, x^{\prime } \in \mathcal {X}. \end{aligned}$$
(17)

Here, we estimate the solution of the PDE by evaluating the posterior mean function (16). We use the posterior variance \({\widetilde{\sigma }}^2(x) {:=}\widetilde{k}(x,x)\) to assess the uncertainty of the estimates. This value represents the expectation of the mean squared error (MSE) between the true solution v(0, x) and the posterior mean function, where the expectation is evaluated for the probability \(p(v(0,x)| X,{\overline{U}},x)\).

Remark 2

In standard GPR, the random variable \(\xi _{i,j}\) has constant variance \(r(x)\equiv \sigma ^2\), which is a hyperparameter learned to maximize the likelihood. As described in Sect. 3.2, in our method the noise variance is estimated as a function over the space \({\mathcal {X}}\) by using the sampled data set. Although the concept of heteroscedastic GPR has already been studied for many years [24, 25, 29, 30], to the best of our knowledge, this is the first time that it has been used for the numerical calculation of PDEs.

3.2 Prior and Data Set

In the GPR introduced in the previous section, the following design parameters have not yet been determined:

  1. 1.

    The kernel function k in (7)

  2. 2.

    The data set U and the noise variance matrix \(r_{XX}\) that satisfy (14)

We first discuss the kernel function k. We propose the use of the following Matérn kernel to reflect a priori knowledge about the regularity of the PDE solution:

$$\begin{aligned} k\left( x, x^{\prime }\right) =\frac{1}{2^{\alpha -1} \varGamma (\alpha )}\left( \frac{\sqrt{2 \alpha }\left\| x-x^{\prime }\right\| }{h}\right) ^\alpha K_\alpha \left( \frac{\sqrt{2 \alpha }\left\| x-x^{\prime }\right\| }{h}\right) , \end{aligned}$$
(18)

where \(\alpha >0\) and \(h>0\) are hyperparameters, \(\varGamma \) is the Gamma function, and \(K_\alpha \) is the modified Bessel function of the second kind of order \(\alpha \).

This is justified by the following. According to the Moore–Aronszajn theorem [1], for any positive-definite kernel k, there is only one reproducing kernel Hilbert space (RKHS) \(\mathcal {H}_k\) for k. The norm defined on this RKHS can evaluate the smoothness of the function. Particularly, the RKHS of the Matérn kernel is known to be norm-equivalent to the following Sobolev space:

Proposition 3

(RKHSs of Matérn kernels: Sobolev spaces [19, 49])

Let k be the Matérn kernel on \(\mathcal {X} \subseteq \mathbb {R}^d\). Suppose that \(s{:=}\alpha +d / 2\) is an integer. Then, the RKHS \(\mathcal {H}_{k}\) is norm-equivalent to the Sobolev space of order s.

Therefore, in situations where information about the regularity of the PDE solutions is known, the smoothness of the regression results can be made to match the smoothness of the true solution by appropriately choosing the parameter \(\alpha \) in the Matérn kernel.

Next, to observe the data that satisfy (9), we employ the Feynman-Kac formula. Using the coefficients of (1) and (2), we define a random variable

$$\begin{aligned} \begin{aligned} {\widehat{u}}(t,x) {:=}\,&\int _t^T h(s, X(s ; t, x)) e^{-\int _t^s c(r, X(r ; t, x)) \textrm{d}r} \textrm{d}s \\&+g(X(T ; t, x)) e^{-\int _t^T c(r, X(r ; t, x)) \textrm{d}r}, \end{aligned} \end{aligned}$$
(19)

where \(X(\cdot ) \equiv X(\cdot ; t, x)\) is a strong solution of the following stochastic differential equation:

$$\begin{aligned} \begin{aligned} \textrm{d}X(s)&=b(s, X(s)) \textrm{d}s+a(s, X(s)) \textrm{d}W(s), \quad s \in [t, T], \\ X(t)&=x, \end{aligned} \end{aligned}$$
(20)

where \((t, x) \in [0, T) \times \mathbb {R}^d\) and \(W(\cdot )\) is a standard m-dimensional Wiener process starting at \(W(t)=0\) at time t.

As stated below, the Feynman–Kac formula guarantees that \({\widehat{u}}\) appropriately approximates the solution v of the PDE. For any \((t,x)\in [0,T]\times {\mathbb {R}}^d\), let \({\widehat{\xi }}(t,x){:=}\widehat{u}(t,x)-v(t,x)\):

$$\begin{aligned} \begin{aligned} {\widehat{u}}(t,x) = v(t, x) + {{\widehat{\xi }}}(t,x),\quad (t, x) \in [0, T] \times \mathbb {R}^d, \end{aligned} \end{aligned}$$
(21)

where \(\widehat{u}\) is defined as (19) and v satisfies (1) and (2).

Proposition 4

(Feynman–Kac formula for Kolmogorov PDE [51]) Suppose that Assumption 1 holds. Then, \({{\widehat{\xi }}}(t,x)\) satisfies

$$\begin{aligned} {\mathbb {E}}[{{\widehat{\xi }}}(t,x)|t,x,v] = 0. \end{aligned}$$
(22)

In the classical Feynman–Kac formula, a nondegeneracy condition on the diffusion term is required for the existence of classical solutions; however, since we consider viscosity solutions here, this condition is unnecessary.

In this study, numerical computation of (19) is called Feynman–Kac (FK) sampling. Note that in the implementation of FK sampling, the computation of (20) requires a numerical calculation of the SDE using a method such as the Euler–Maruyama method, and the computation of (19) requires a numerical integration using a technique such as the trapezoidal rule.

We describe how to use the data resulting from FK sampling as a dataset \({\overline{U}}\) and noise variance matrix \(r_{XX}\) that satisfy (14). For the data set \({\overline{U}}\) in (11), we use \({\check{U}}=[{\check{u}} (x_1),\ldots ,{\check{u}} (x_N)]^\top \), where \({\check{u}} (x_i)\) is the sample mean of M FK samples \(\{{\widehat{u}}_j(0,x_i)\}_{j=1}^M\):

$$\begin{aligned} {\check{u}}{(x_i)}{:=}\sum _{j=1}^M \frac{{{\widehat{u}}}_j (0,x_i)}{M}, \quad i\in \{1, \ldots , N\}, \end{aligned}$$
(23)

where \({\widehat{u}}_j\) denotes the jth FK sample. It is also appropriate to use \(\check{r}_{XX}\) for the noise variance matrix \(r_{XX}\) in (14), which satisfies \([\check{r}_{XX}]_{ij}=\check{r}(x_i)\delta _{ij}\), where

$$\begin{aligned} \check{r}(x_i) {:=}\sum _{j=1}^M \frac{({\widehat{u}}_j(0,x_i) - \check{u}(x_i))^2}{M-1}, \quad i\in \{1, \ldots , N\}, \end{aligned}$$
(24)

is the unbiased sample variance of M FK samples. In summary, our dataset satisfies the following:

$$\begin{aligned} \check{u}(x_i) = v(0,x_i) + \check{\xi }_i, \quad i\in \{1, \ldots , N\}, \end{aligned}$$
(25)

where \(\check{\xi }_i{:=}\sum _{j=1}^M {{\widehat{\xi }}}_j(0,x_i)/M\) and \({{\widehat{\xi }}}_j(0,x_i) {:=}{\widehat{u}}_j(0,x_i) - v(0,x_i)\).

Remark 3

The proposed method allows one to choose arbitrary low-dimensional bounded domain on which to solve the PDE. This is accomplished by restricting the FK sampling points to the low-dimensional bounded domain and then applying GPR only to those points. For example, in a d-dimensional problem, suppose that we focus on a bounded domain of \({\tilde{d}}\) dimensions and set the number of observation points per dimension to \({\tilde{N}}\). In this case, the most computationally intensive operation of the proposed method is the computation of the posterior variance in GPR, with a computational complexity of \(O({\tilde{N}}^{3{\tilde{d}}})\). On the other hand, FEM requires computations across all dimensions. Thus, when using direct solvers for dense linear systems, the computational complexity is \(O({\tilde{N}}^{3d})\), where \(\widetilde{N}\) denotes the degrees of freedom in meshes per dimension. Therefore, when \({\tilde{d}} < d\), the proposed method requires much lower computational cost than FEM. This computational advantage is demonstrated in Sect. 5.

Remark 4

Because the noise term \(\check{\xi }\) in (25) is not normally distributed, (12) and (25) do not correspond directly. However, if the variance of \({{\widehat{\xi }}}\) is bounded and the sample size M is large enough, the central limit theorem guarantees that the random variable \(\check{\xi }\) approximately follows a normal distribution with mean 0 and variance \(\check{r}/M\) [18]. Because of this fact, we use (25) as an approximate model for (12). In addition, several results assess the non-normality of variables for finite sample sizes M [18, 28]. For example, the error of the distributions for finite M is estimated by the Berry–Esseen theorem [18], which states that the rate of convergence is \(O(M^{-1/2})\).

4 Posterior Variance Analysis

In this section, we carry out an a priori evaluation of the posterior variance in (17). We focus on the integral value of the MSE on the input space \({\mathcal {X}}\), called integrated MSE (IMSE):

$$\begin{aligned} \textrm{IMSE}{:=}\int _{{\mathcal {X}}} {\widetilde{\sigma }}^2(x) \textrm{d}\nu (x), \end{aligned}$$
(26)

where \(\nu \) is a given probability measure of the observation points whose support is \({\mathcal {X}}\).

Here, we aim to characterize the speed of the decay of IMSE with respect to an increase in the number of observation points N and the sample size M. To achieve this, we first characterize the decay speed using the eigenvalues of the kernel function and the magnitude of the noise term. Specifically, we derive a probabilistic lower bound for the IMSE and show that the IMSE is expected to decrease as N and M increase (see Theorem 1). Among the variables used in this lower bound, the eigenvalues can be considered known, but the magnitude of the noise term is unknown. We therefore probabilistically assess the error between the true value of the noise term and its estimate, and show that the evaluation of the IMSE is justified under sufficiently large sample size M (see Theorem 2). After deriving these theorems, we give a more conservative estimate of the IMSE, which does not include information on the eigenvalues of the kernel and the magnitude of the noise term (see Corollary 1).

Assumption 2

  1. 1.

    The kernel function k is bounded and positive-definite on \({\mathcal {X}}\).

  2. 2.

    The observation points \(X=[x_1,\ldots ,x_N]^\top \) are sampled from the known probability measure \(\nu \).

  3. 3.

    The data set \({\overline{U}}\) in (11) satisfies (14).

  4. 4.

    For the noise variance function r in (10), there exists a positive minimum value \(r_\textrm{min}=\min _{x\in {\mathcal {X}}} r(x)>0\).

Theorem 1

Suppose that Assumption 2 holds. Then, for any \(\epsilon >0\), there exists an integer \(\widetilde{N}(\epsilon )>0\) such that for any \(N>\widetilde{N}\) the following holds:

$$\begin{aligned} \textrm{Pr} \left( \textrm{IMSE} > (1-\epsilon ) L_\textrm{IMSE}\right) = 1, \end{aligned}$$
(27)

where

$$\begin{aligned} L_\textrm{IMSE}{:=}r_\textrm{min} \sum _{p \in I} \frac{\lambda _p}{r_\textrm{min} + NM \lambda _p}. \end{aligned}$$
(28)

Here, \(\{\lambda _p\}_{p\in I}\) denotes the eigenvalues of the kernel function k for the probability measure \(\nu \). The set \(I\subseteq {\mathbb {N}}\) denotes the index set of the eigenvalues of the kernel.

Remark 5

The provided lower bound helps estimate the minimum sample size necessary to achieve a predetermined level of IMSE. This characteristic contrasts with the upper bound, which is commonly provided as a sufficient condition to guarantee the required performance. In particular, the fastest order O(1/MN) provided in Equation (30) implies that to achieve a tenfold improvement in the current IMSE, it is necessary to increase the sample size (M or N) by a factor of ten. Understanding the lower bound of the IMSE thus aids in designing numerical computations.

Remark 6

The evaluation of lower bounds on the posterior variance has long been studied as an asymptotic analysis of Gaussian processes [41, 50]. To the best of our knowledge, they only address the case where the noise model in the GPR is uniform. We extend the results to the heteroscedastic case.

Remark 7

In the right-hand side of (27), the actual decay speed depends on the decay rate of the eigenvalues of the kernel. For some smooth kernels, the decay rate of the eigenvalues has been evaluated under known measures of the observation points [39]. It is also possible to evaluate the eigenvalues numerically by performing a sampling routine (see chapter 4 in [50]).

To evaluate the right-hand side of Theorem 1, it is necessary to find the minimum \(r_{\min }\) of the noise variance function r of the GPR. However, as mentioned in Sect. 3.2, r can only be estimated pointwise using sample variance as in (24). Therefore, it is natural to use \({\overline{r}}_{\min } {:=}\overline{r}(x_{i^*})\ (i^* {:=}\mathop {\textrm{argmin}}\limits _{i\in \{1,\ldots ,N\}}\overline{r}(x_i))\) as an estimate of \(r_{\min }\), where \({\overline{r}}(x_i) = \sum _{j=1}^M (u_j(x_i)-{\overline{u}}(x_i))^2/(M-1)\) is the unbiased variance of (9). The following theorem probabilistically evaluates the error between \(r_{\min }\) and \({\overline{r}}_{\min }\).

Theorem 2

Suppose that Assumption 2.2-\(-\)2.4 holds. Then, for any N, M, \(\delta >0\), and \(\epsilon >\delta \),

$$\begin{aligned} \begin{aligned}&\textrm{Pr}(|{\overline{r}}_{\min } - r_{\min }|\ge \epsilon , \delta >|r(x_{i^*}) - r_{\min }|) \\&\le 1-\prod _{i=1}^N \max \left\{ 1-\frac{2 r^2\left( x_i\right) }{(\epsilon -\delta )^2(M-1)}, 0\right\} . \end{aligned} \end{aligned}$$
(29)

Remark 8

From (29), we can justify replacing \(r_{\min }\) in (28) with the minimum sample variance \({\overline{r}}_{\min }\) under a sufficiently large M relative to N.

Apart from the results above, it is also possible to evaluate the lower bound of the posterior variance in a form that does not include eigenvalues or noise variance. The following lower bound is more conservative than the one given in Theorem 1 but is more useful for an intuitive understanding.

Corollary 1

Suppose that Assumption 2 holds. Then, there exists a constant \(C>0\) such that for any \(N\in {\mathbb {N}}\) and \(M\in {\mathbb {N}}\),

$$\begin{aligned} L_\textrm{IMSE} \ge \frac{C}{NM}. \end{aligned}$$
(30)

Remark 9

From (30), we see that the fastest convergence rate of the posterior variance is \(O((NM)^{-1})\) for an FK sample size M and the number of GPR observation points N.

5 Numerical Calculation

In this section, we apply the proposed method to several PDEs and evaluate their performance. The target PDEs are the heat equation, the advection–diffusion equation, and the Hamilton–Jacobi–Bellman (HJB) equation. The goal is to learn the functional form of the solution on \(x\in [0,1]^d\) for \(d=10\) spatial dimensions. We learn only the first \(\widetilde{d}\)-dimensional component of the PDE solution, where \(\widetilde{d}\in \{1,2,3\}\). Specifically, we use N equally spaced grid points in the range [0, 1] for the \(\widetilde{d}\)-dimensional component and fix the other \((d-\widetilde{d})\)-dimensional components to 0.5. In other words, we set \({\mathcal {X}}=[0, 1]\times \cdots \times [0, 1]\times \{0.5\}\times \cdots \times \{0.5\}\) and we use a \(\widetilde{d}\)-dimensional Lebesgue measure as the probability measure \(\nu \). This is because using \({\widetilde{N}} = N^d\) as the number of observation points would drain the machine’s memory for allocating variables in GPR. The approximation methods proposed in [13] would solve this problem, but this implementation is beyond the scope of this paper and is not done here.

We compare the performance of the following three estimation methods:

  • Heteroscedastic GPR (HGPR): The proposed method, where the noise variance function in GPR is estimated using the sample variance of the FK sample, using (24).

  • Standard GPR: A method that treats \(r\equiv \sigma ^2\ (\sigma \in {\mathbb {R}})\) as a hyperparameter in GPR, where the hyperparameters are determined to maximize the marginal likelihood.

  • Linear interpolation: A method to linearly interpolate the function value \(\{{\overline{u}}(x_i)\}_{i=1}^N\) obtained by FK sampling.

We evaluate each method using the following criteria:

  1. 1.

    Squared \(L^2\)-norm error of the estimates for the test data:

    $$\begin{aligned} \textrm{Error} = {\int _{\mathcal {X}} {[}\widetilde{u}(x) - v(x)]^2 \textrm{d}x}, \end{aligned}$$
    (31)

    where \(\widetilde{u}(x)\) is the estimate of the solution of the PDE and v(x) is the test data to approximate the true solution v(0, x) of the PDE (1) and (2). For numerical integration in (31), the trapezoidal law is used.

  2. 2.

    Integrated mean squared error (IMSE) defined in (26), along with its bound in (28). Here, to evaluate the bound in (28), the eigenvalues are approximated using a sampling routine, as described in chapter 4 in [50]. The term \(r_{\min }\) is approximated with \(\check{r}_{\min }=\min _{i} \check{r}(x_i)\), where \(\{\check{r}(x_i)\}_{i=1}^N\) are the sample variance of the FK samples (19).

The first criterion evaluates the accuracy of the estimate, while the second criterion evaluates the uncertainty. We explain the method of creating test data in each subsection. Note that only GPR-based methods can calculate the IMSE.

In GPR, the hyperparameters \(\alpha \) and h of the Matérn kernel in (18) need to be determined. We determine \(\alpha \) to correspond to a Sobolev space of order 2, taking into account that we seek to solve second-order PDEs. The hyperparameter h is determined to maximize the marginal log-likelihood of the data:

$$\begin{aligned} \begin{aligned} h^*&= \mathop {\textrm{argmax}}\limits _{h} \log p({\overline{U}}|X,h) \\&= \mathop {\textrm{argmax}}\limits _{h} \log {\mathcal {N}}({\overline{U}}|0, k_{X X}(h)), \end{aligned} \end{aligned}$$
(32)

where \(k_{X X}(h)\) is the Matérn kernel matrix with the hyperparameter h. This maximization is performed by a gradient ascent method on the log-likelihood of the hyperparameters. Since the log-likelihood is a non-convex function, the optimization steps converge to a local solution depending on the choice of initial values (see Chapter 5 in [50]). To search for a better local solution, 27 initial values of hyperparameter are randomly determined, and the value with the largest likelihood is adopted. Functions built into gpytorch are used for optimizing hyperparameters, calculating posterior distributions and kernel functions [12].

Since the matrix \(k_{XX}\) is a positive semidefinite matrix and \(r_{XX}\) is positive diagonal matrix, which is assumed in Assumption 2, the inverse matrix in (16) exists. However, when the noise variance function r is small or the sample size M is extremely large, the condition number of the matrix may deteriorate, potentially leading to instability in computing the inverse matrix. In such cases, it is possible to stabilize the inverse matrix computation by adding a homoscedastic additional noise term. However, in our numerical computations, the calculations did not become unstable without adding additional noise terms.

In FK sampling, the Euler–Maruyama method is used to numerically calculate the SDE of (20) at each observation point, and the trapezoidal law is used for numerical integration in (21). The results of the FK sampling depend on the sample path of the Brownian motion. To suppress the effect of this randomness, In each experiment, we generated FK samples using multiple random seeds and performed GPR with them. Specifically, for \({\tilde{d}} = 1\), we compared the average error and IMSE over 50 different random seeds; for \({\tilde{d}} = 2\), over 10 different random seeds; and for \({\tilde{d}} = 3\), over 5 different random seeds.

Before discussing the results for each PDE, we give a summary of the results that are common to each experiment. First, regarding the squared \(L^2\)-norm error, the error tends to become smaller as MN are increased for all three methods. The errors for HGPR and GPR are comparable, and these two methods achieve smaller errors than linear interpolation. With respect to IMSE, both HGPR and GPR achieve smaller IMSE for larger N and M. Comparing the two, HGPR achieves a smaller IMSE overall, suggesting that the proposed method achieves estimation with lower uncertainty.

5.1 Heat Equation

We consider the following initial value problem for the heat equation:

$$\begin{aligned}&\begin{aligned} \partial _t w(t,x)&=\frac{1}{2} \operatorname {tr}\left\{ a a^{\top } \partial _{xx} w(t, x)\right\} , \end{aligned} \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$
(33)
$$\begin{aligned}&w(0,x)=\exp \left( -\lambda (x-0.5)^{\top } (x-0.5)\right) , \qquad \qquad x \in \mathbb {R}^d. \end{aligned}$$
(34)

we set \(a \equiv \sigma I_d\) with \(\sigma =0.4\), \(\lambda = 5.0\), and \(T=1.0\). To solve the initial value problem, the proposed method is applied after using the variable transformation described in Remark 1.

The solution of this Cauchy problem is given as

$$\begin{aligned} \begin{aligned} w(t,x)&= \frac{1}{(4\lambda \alpha )^{d/2}} \exp \left( -\frac{(x-0.5)^{\top } (x-0.5)}{4\alpha }\right) , \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$
(35)

where \(\alpha = {1}/{(4\lambda )} + \sigma ^2 t/2\). We evaluate the accuracy of the estimation results using this analytical solution as test data.

Fig. 1
figure 1

Estimation results of the heat equation. The blue solid line represents HGPR, the orange dash-dotted line represents GPR, the green dotted line represents linear interpolation, and the red dashed line represents the analytical solution. In the GPR-based methods, the posterior variance is displayed as a shaded region (Color figure online)

First, we consider the case with \(\widetilde{d}=1\), which means that we solve first out of ten dimensional problem. Figure 1 shows the estimation results for the first dimension of the solution to a 10-dimensional PDE. Here, we used a FK sample size of \(M = 800\) and the number of observation points \(N = 12\). In this figure, HGPR provides estimation results that are closest to the analytical solution and also exhibits the smallest uncertainty in the estimations.

Fig. 2
figure 2

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional heat equation. The estimation result of the first out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=20\), and when varying the number of observation points N, the sample size M is set as \(M=800\) (Color figure online)

We examine the behavior of the squared \(L^2\)-norm error and IMSE with respect to sample size M and number of observation points N. Here, we set \(N=20\) when varying M and \(M=800\) when varying N. The results are shown in Fig. 2. We see that increasing M or N in each method tends to decrease the \(L^2\)-norm error and IMSE. Regarding the \(L^2\)-norm error, all methods decrease the error as the sample size M is increased. This shows that the proposed method approximates the solution of the PDE more accurately as the parameters are increased. The errors for HGPR and GPR are comparable, while the errors for linear interpolation are worse. As the number of observation points N is increased, linear interpolation keeps a larger error, while HGPR and GPR have smaller errors. In the region where N is small, HGPR reduces the error more than GPR. For both HGPR and GPR, the IMSE decreases monotonically with increasing M. For all M, HGPR achieves a smaller IMSE than GPR. When N is increased, the IMSE decreases in both HGPR and GPR, and the value in HGPR is smaller than the value in GPR. The approximated IMSE bounds are found to decay on the similar order of magnitude as the measured values.

Fig. 3
figure 3

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional heat equation. The estimation result of the first and second out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=12^2\), and when varying the number of observation points N, the sample size M is set as \(M=1600\) (Color figure online)

Next, we present the results of estimating the first and second components out of the ten dimensions, i.e., \(\widetilde{d}=2\). We set \(N = 12^2\) when varying M and \(M = 1600\) when varying N. The results are shown in Fig. 3. Note that in the figure, the horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of \(N^2\). Basically, as N and M increase, the error and IMSE of each method decreases. In GPR-based methods, the IMSE when increasing N stops decreasing. In many parameter regions, HGPR outperforms the other methods in terms of both error and IMSE.

Fig. 4
figure 4

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional heat equation. The estimation of the first, second, and third dimensions out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=12^3\), and when varying the number of observation points N, the sample size M is set as \(M=800\) (Color figure online)

Finally, we present the results of estimating the first to third components out of the ten dimensions, that is, \(\widetilde{d}=3\). We set \(N = 12^3\) when varying M and \(M = 800\) when varying N. The results are shown in Fig. 4. The horizontal axis N represents the number of observation points per dimension, resulting in a total of \(N^3\) observation points. The errors and IMSE behavior for M share the same characteristics as in Fig. 3. As N is increased, error improves for each of the three methods, and IMSE decreases for both numerical solutions and lower bounds.

Summarizing the results for both criteria, HGPR is better than linear interpolation but similar to GPR in terms of error, and better than GPR in terms of uncertainty.

5.2 Advection–Diffusion Equation

Next, we consider the following initial value problem for the advection–diffusion equation:

$$\begin{aligned}&\begin{aligned} \partial _t w(t,x)&=\frac{1}{2} \operatorname {tr}\left\{ a a^{\top } \partial _{xx} w(t, x)\right\} - b^\top \partial _x w(t, x), \end{aligned} \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$
(36)
$$\begin{aligned}&w(0,x)=\exp \left( -\lambda (x-0.5)^{\top } (x-0.5)\right) , \qquad \qquad \quad x \in \mathbb {R}^d, \end{aligned}$$
(37)

where we have defined \(a \equiv \sigma I_d\) with \(\sigma =0.4\), \(b=0.01\), \(\lambda =5.0\), and \(T=1.0\). As in the previous subsection, the proposed method is applied after using the variable transformation described in Remark 1.

The analytic solution of this Cauchy problem is given as

$$\begin{aligned} \begin{aligned} w(t,x)&= \frac{1}{(4\lambda \alpha )^{d/2}} \exp \left( -\frac{(x-0.5 + b t)^{\top } (x-0.5 + b t)}{4\alpha }\right) , \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned} \end{aligned}$$
(38)

where \(\alpha = {1}/{(4\lambda )} + \sigma ^2 t/2\). This analytical solution is used as test data to evaluate the accuracy of the estimation results.

Fig. 5
figure 5

Estimation results of the advection–diffusion equation. The blue solid line represents HGPR, the orange dash-dotted line represents GPR, the green dotted line represents linear interpolation, and the red dashed line represents the analytical solution. For the GPR-based methods, the posterior variance is illustrated as a shaded region (Color figure online)

First, we consider the case with \(\widetilde{d}=1\). Figure 5 presents the estimation results for the first component of the solution to a 10-dimensional PDE. We use a FK sample size of \(M = 200\) and a number of observation points \(N = 12\). The estimation results of both HGPR and GPR show negligible differences in terms of both posterior mean and posterior variance. Moreover, the results from both methods exhibit smaller errors compared to those obtained through linear interpolation.

Fig. 6
figure 6

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional advection–diffusion equation. The estimation result of the first out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, we set the number of observation points N as \(N=20\), and when varying the number of observation points N, we set the sample size M as \(M=800\) (Color figure online)

The squared \(L^2\)-norm error and IMSE when varying M and N are plotted in Fig. 6. Similar to the example above, increasing M or N in each method tends to decrease the \(L^2\)-norm error and IMSE. All methods decrease the \(L^2\)-norm error as the sample size M is increased. In many parameter regions, the errors for HGPR and GPR are smaller than the errors for linear interpolation. For both HGPR and GPR, the IMSE decreases monotonically when increasing M. For all M, HGPR achieves a smaller IMSE than GPR. When N is increased, the IMSE of both HGPR and GPR are decreased. In many regions, HGPR achieves a lower IMSE than GPR. The approximately calculated IMSE bounds decay on a similar order of magnitude to the measured values.

Fig. 7
figure 7

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional advection–diffusion equation. The estimation result of the first and second out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=12^2\), and when varying the number of observation points N, the sample size M is set as \(M=1600\) (Color figure online)

Next, we present the results for the case where \(\widetilde{d}=2\). Here, we set \(N = 12^2\) when varying M and \(M = 1600\) when varying N. The results are shown in Fig. 7. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of \(N^2\). Similar to the case of the two-dimensional heat equation, HGPR exhibits better performance than the other two methods in many parameter regions.

Fig. 8
figure 8

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional advection–diffusion equation. The estimation of the first, second, and third dimensions out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=12^3\), and when varying the number of observation points N, the sample size M is set as \(M=800\) (Color figure online)

Finally, we present the results for the case where \(\widetilde{d}=3\). Here, we set \(N = 12^3\) when varying M and \(M = 800\) when varying N. The results are shown in Fig. 8. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of \(N^3\). These results basically share common characteristics with the estimation results in Fig. 4.

5.3 Hamilton–Jacobi–Bellman Equation

Finally, we consider the following terminal value problem for the Hamilton–Jacobi–Bellman (HJB) equation:

$$\begin{aligned} -\partial _t v(t, x)&= \ell (x) -\frac{1}{2} \partial _x v(t, x)^{\top } B R^{-1} B^{\top } \partial _x v(t, x) + \frac{1}{2}\operatorname {tr}\left\{ aa^\top \partial _{x x} v(t, x)\right\} ,\nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$
(39)
$$\begin{aligned} v(T, x)&=g(x),\qquad \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$
(40)

where we have defined \(\ell (x) = (x-0.5\mathbbm {1}_d)^\top Q (x-0.5\mathbbm {1}_d)/2\) with \(Q=I_d\), \(g(x)\equiv 0\), \(B=I_d\), \(R=I_d\), \(a \equiv \sigma I_d\) with \(\sigma =0.4\), and \(T=1.0\). This PDE corresponds to the following optimal control problem:

$$\begin{aligned}&\text {minimize } J(\alpha ) = \mathbb {E}\left\{ \int _{0}^{T} \ell (X(s)) + \frac{1}{2}\alpha (s)^\top R \alpha (s) \textrm{d} s + g (X(T))\right\} , \end{aligned}$$
(41)
$$\begin{aligned}&\text {subject to } \textrm{d}X (s)=B\alpha (s)\textrm{d}s + a \textrm{d}W(s), \ \ s \in [0, T], \end{aligned}$$
(42)
$$\begin{aligned}&\qquad \qquad \quad X(0) = x. \end{aligned}$$
(43)

More specifically, the solution \(\alpha ^*\) of this optimal control problem is represented as follows:

$$\begin{aligned} \alpha ^*(t,x) = -R^{-1} B^\top \partial _x v(t,x). \end{aligned}$$
(44)

Since the HJB equation contains nonlinear terms, the proposed method is not directly applicable. Thus, we use the Cole–Hopf transformation to linearize the equation [20]. First, we find the real number \(\lambda \) which satisfies \(\lambda BR^{-1}B^\top = a a^\top \). Using this, we transform the PDE with \(\widetilde{v}=\exp {(}-{v}/{\lambda })\):

$$\begin{aligned}&\begin{aligned} \partial _t \widetilde{v}(t,x)&=-\frac{1}{2} \operatorname {tr}\left\{ a a^{\top } \partial _{xx} \widetilde{v}(t, x)\right\} + \frac{1}{\lambda } \ell (x) \widetilde{v}(t,x), \end{aligned} (t, x) \in [0, T) \times \mathbb {R}^d, \end{aligned}$$
(45)
$$\begin{aligned}&\widetilde{v}(T,x)=\exp \left( -\frac{g(x)}{\lambda }\right) , \qquad \qquad \qquad \qquad \qquad \qquad x \in \mathbb {R}^d, \end{aligned}$$
(46)

To apply the proposed method, we first conduct FK sampling on the linearized PDE. Next, we perform the Cole–Hopf inverse transformation on the obtained pointwise data. Finally, we carry out regression using each method on the inverse-transformed data. Such transformations may violate the assumptions regarding the distribution of data and noise as stated in Assumption 2.3. However, contrary to this concern, our numerical experiments demonstrate practical performance as described below.

The terminal value problem (39) and (40) do not have a closed-form analytical solution. However, the solution can be characterized through the solution of the Riccati ordinary differential equation. The terminal value problem of the Riccati equation is defined below:

$$\begin{aligned} \dot{P}(t)&= P(t)BR^{-1}B^\top P(t) - Q, \end{aligned}$$
(47)
$$\begin{aligned} P(T)&= 0. \end{aligned}$$
(48)

Using the solution P(t), the solution of the HJB equation is expressed as follows [10]:

$$\begin{aligned} v(t,x) = (x-0.5)^\top P(t) (x-0.5) + \operatorname {tr} \{a a^\top \} \int _{t}^T P(s) \textrm{d}s. \end{aligned}$$
(49)

Therefore, by numerically solving this ODE, it is possible to calculate the solution to the PDE. We compare the errors between the numerical solutions obtained by the proposed method and the numerical solution of the ODE. For the discretization of the ODE, the Euler method with a time step of \(\varDelta t = 0.01\) is used.

Fig. 9
figure 9

Estimation results of the HJB equation. The blue solid line represents HGPR, the orange dash-dotted line represents GPR, the green dotted line represents linear interpolation, and the red dashed line represents the analytical solution. In the GPR-based methods, the posterior variance is illustrated as a shaded region (Color figure online)

First, we consider the case with \(\widetilde{d}=1\). Figure 9 shows the estimation results for the first component of the solution to a 10-dimensional PDE. Figure 9a presents the estimation results of the solution, while Fig. 9b illustrates the estimation results of the spatial derivatives of the solution. We use a FK sample size of \(M = 100\) and the number of observation points \(N = 12\). We compute the spatial derivatives of the solution since they are important for calculating the optimal control input in (44). In GPR-based methods, the derivatives of the posterior mean is analytically obtained by computing the derivatives of the kernel function [48]. We perform these calculations using the automatic differentiation feature of pytorch [33]. For linear interpolation, the derivatives are approximated using finite differences based on the mean values of the FK samples at adjacent observation points. Regarding the estimation results, the GPR-based methods provide estimates that are close to the test data and exhibit small uncertainty. Furthermore, the GPR-based methods demonstrate the high approximation capability for the derivative values.

Fig. 10
figure 10

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional HJB equation. The estimation result of the first out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, we set the number of observation points N as \(N=20\), and when varying the number of observation points N, we set the sample size M as \(M=800\) (Color figure online)

The squared \(L^2\)-norm error and IMSE when varying M and N are plotted in Fig. 10. Similar to the two examples above, increasing M or N in each method tends to decrease the \(L^2\)-norm error and IMSE. All methods decrease the \(L^2\)-norm error as sample size M is increased. The values for HGPR and GPR are comparable, while the values for linear interpolation are larger. When the number of observation points N is increased, linear interpolation has a larger error, while HGPR and GPR have smaller errors. HGPR reduces the error more than GPR, particularly in the region where N is small. Next, for both HGPR and GPR, the IMSE decreases monotonically with increasing M, and for all M, HGPR achieves a smaller IMSE than GPR. As N is increased, the IMSE of the HGPR decreases more than that of the GPR. The tightness of IMSE bounds is improved compared to the two examples above. In the region where N is small, the IMSE of the GPR is close to the estimated lower bound. Note that the lower bound given in Theorem 1 is a probabilistic inequality that holds when N is sufficiently large and does not necessarily hold for small N.

Fig. 11
figure 11

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional HJB equation. The estimation result of the first and second out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=12^2\), and when varying the number of observation points N, the sample size M is set as \(M=1600\) (Color figure online)

Next, we present the results for the case where \(\widetilde{d}=2\). Here, we set \(N = 12^2\) when varying M and \(M = 1600\) when varying N. The results are shown in Fig. 11. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of \(N^2\). As N and M increase, the error decreases for each method. As M is increased, the IMSE becomes smaller, but not when N is increased. GPR-based methods achieve smaller errors than linear interpolation; however, HGPR and GPR exhibit similar performance aside from a slight difference in IMSE.

Fig. 12
figure 12

Squared \(L^2\)-norm error and IMSE when each method is applied to the 10-dimensional HJB equation. The estimation of the first, second, and third dimensions out of the ten dimensions are displayed. Blue (solid): HGPR / Orange (dash-dot): GPR / Green (dotted): linear interpolation / Red (dashed): lower bound of IMSE in HGPR. The error bars represent the standard error of the calculation results for 50 different seeds used in the FK sampling. When varying the sample size M, the number of observation points N is set as \(N=8^3\), and when varying the number of observation points N, the sample size M is set as \(M=800\) (Color figure online)

Finally, we present the results for the case where \(\widetilde{d}=3\). Here, we set \(N = 8^3\) when varying M and \(M = 800\) when varying N. The results are shown in Fig. 12. The horizontal axis N represents the number of observation points per dimension, resulting in an actual number of observation points of \(N^3\). The errors and IMSE behavior for M share the same characteristics as in Fig. 11. On the other hand, errors and IMSE behavior with respect to N have features in common with Fig. 4.

6 Conclusion

We have proposed a numerical method for PDEs by combining the heteroscedastic Gaussian process regression and the Feynman–Kac formula. The proposed method is able to assess the uncertainty in the numerical results and choose an arbitrary low-dimensional bounded domain on which to compute the solution. The quality of the solution can be improved by adjusting the kernel function and incorporating noise information obtained from Monte Carlo samples into the GPR noise model. The method succeeds in providing a lower bound on the posterior variance, which is useful for estimating the best-case performance in numerical calculations. The numerical results confirm that the proposed method provides better approximations in the wide parameter domains compared to naive linear interpolation and standard GPR-based methods.

In future work, we will explore iterative observation schemes that leverage uncertainty information. Active learning of observation points for achieving efficient function approximation and extremum search is widely investigated [15, 37, 40]. In line with these previous studies, we expect to efficiently estimate the solution of the PDE by adaptively selecting sampling points. We will also investigate the design of uncertainty-aware acquisition functions to guide the search for regions of interest, such as those that maximize or minimize a specific physical quantity.