The presence of missing components in incomplete instances precludes a kernel-based model from incorporating partially observed components of incomplete instances and computing kernels, including Gaussian kernels that are extensively used in machine learning modeling and applications. Existing methods with Gaussian kernels to handle incomplete data, however, are based on independence among variables. In this study, we propose a new method, the expected Gaussian kernel with correlated variables, that estimates the Gaussian kernel with incomplete data, by considering the correlation among variables. In the proposed method, the squared distance between two instance vectors is modeled with the sum of the correlated squared unit-dimensional distances between the instances, and the Gaussian kernel with missing values is obtained by estimating the expected Gaussian kernel function under the probability distribution for the squared distance between the vectors. The proposed method is evaluated on synthetic data and real-life data from benchmarks and a case from a multi-pattern photolithographic process for wafer fabrication in semiconductor manufacturing. The experimental results show the improvement by the proposed method in the estimation of Gaussian kernels with incomplete data of correlated variables.

This research was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (MSIT) of Korea (No. RS-2023-00208412).
Appendix A: Proof of Proposition 1
For the squared distance between two real vectors \({\mathbf{x}}_{i}\) and \({\mathbf{x}}_{j}\), a Gamma variable \(\zeta_{ij}\) is approximated from the sum of correlated Gamma variables \(\gamma_{ijp}\) \(\sim\) \(Gamma\left( {k_{ijp} ,\theta_{ijp} } \right)\) for \(p\) \(=\) 1, …, \(D\) based on the approximation in Feng et al., (2016). The shape parameter \(k_{ijp}\), which is estimated using \(E\left[ {\gamma_{ijp} } \right]\) in (19) and \(Var\left( {\gamma_{p} } \right)\) in (20) from the moments of the missing components in the original space, satisfies the condition \(k_{ijp}\) \(\ge\) \(\frac{1}{2}\) if \(\sigma_{pp,i} + \sigma_{pp,j}\) \(>\) \(0\):
Similarly, the scale parameter \(\theta_{ijp}\) satisfies the condition \(\theta_{ijp}\) \(>\) \(0\) if \(\sigma_{pp,i} + \sigma_{pp,j}\) \(>\) \(0\):
Appendix B: Covariance between squared unit-dimensional distances
Under the assumption of the independence between two instances \({\mathbf{x}}_{i}\) and \({\mathbf{x}}_{j}\), The covariance between \(\gamma_{ijp}\) and \(\gamma_{ijq}\) in (20) can be rewritten as
where the two terms in (20) are
To compute the high-order moments of the \(i\)-th instance in (B.1), let \({\mathbf{x}}_{i(pq)}\) \(=\) \(\left[ {X_{ip} ,X_{iq} } \right]^{{\text{T}}}\) be the bivariate normal distribution, as a subset of the variables in \({\mathbf{x}}_{i}\), with the mean \({\tilde{\mathbf{x}}}_{{i\left( {pq} \right)}} = \left[ {\tilde{x}_{ip} , \tilde{x}_{iq} } \right]^{{\text{T}}}\) and covariance matrix \({\tilde{\mathbf{S}}}_{{i\left( {pq} \right)}}\) \(=\) \(\left[ {\begin{array}{*{20}c} {\sigma_{pp,i} } & {\sigma_{pq,i} } \\ {\sigma_{pq,i} } & {\sigma_{qq,i} } \\ \end{array} } \right]\). Let \(M\left( {\mathbf{t}} \right)\) be the moment generating function of \({\mathbf{x}}_{{i\left( {pq} \right)}}\) with a variable vector \({\mathbf{t}}\) \(=\) \(\left[ {t_{p} ,t_{q} } \right]^{{\text{T}}}\) as
High-order raw cross moments of \({\mathbf{x}}_{{i\left( {pq} \right)}}\) are given by
and, accordingly, we have
From (B.4) to (B.7), the covariance in (B.1) becomes
