Abstract
Consider the regression problem with a response variable Y and a feature vector X. For the regression function m(x) = E{Y ∣X = x}, we introduce new and simple estimators of the minimum mean squared error Mean squared error—( Minimum mean squared error—( \({L}^{{\ast}} = \mathbf{E}\{{(Y - m(\mathbf{X}))}^{2}\}\), and prove their strong consistenciesConsistency—(. We bound the rate of convergenceRate of convergence, too.
Keywords
- Minimum Mean
- Strong Universal Consistency
- Regression Function
- Noiseless Observations
- Martingale Difference
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Let the label Y be a real-valued random variable and let the feature vector \(\mathbf{X} = (X_{1},\ldots,X_{d})\) be a d-dimensional random vector. The regression function m is defined by
The minimum mean squared error, called also variance of the residual Y − m(X), is denoted by
The regression function m and the minimum mean squared error L ∗ cannot be calculated when the distribution of (X, Y ) is unknown. Assume, however, that we observe data \(D_{n} =\{ (\mathbf{X}_{1},Y _{1}),\ldots,(\mathbf{X}_{n},Y _{n})\}\) consisting of independent and identically distributed copies of (X, Y ). D n can be used to produce an estimate of L ∗.
For nonparametric estimates of the minimum mean squared errorMinimum mean squared error \({L}^{{\ast}} = \mathbf{E}\{{(Y - m(\mathbf{X}))}^{2}\}\) see, e.g., Dudoit and van der Laan [4], Liitiäinen et al. [9–11], Müller and Stadtmüller [12], Neumann [14], Stadtmüller and Tsybakov [15], and Müller, Schick and Wefelmeyer [13] and the literature cited there.
Devroye et al. [3] proved that without any tail and smoothness condition, L ∗ cannot be estimated with a guaranteed rate of convergenceRate of convergence. They introduced a modified nearest neighbour cross-validation estimate
where Y j(i) is the label of the modified first nearest neighbour of X i from among \(\mathbf{X}_{1},\ldots,\mathbf{X}_{i-1},\mathbf{X}_{i+1},\ldots,\mathbf{X}_{n}\). If Y and X are bounded, and m is Lipschitz continuous
then for d ≥ 3, they proved that
Liitiäinen et al. [9, 11] introduced another estimate of the minimum mean squared errorMinimum mean squared error L ∗ by the first and second nearest neighbour cross-validation
where Y n, i, 1 and Y n, i, 2 are the labels of the first and second nearest neighbours X n, i, 1 and X n, i, 2 of X i from among \(\mathbf{X}_{1},\ldots,\mathbf{X}_{i-1}\), and \(\mathbf{X}_{i+1},\ldots,\mathbf{X}_{n}\), resp. (In the sequel, assume that for calculating the first and second nearest neighbours, ties occur with probability 0. When X has a density, the case of ties among nearest neighbour distances occurs with probability 0.) If Y and X are bounded and m is Lipschitz continuous, then for d ≥ 2, they proved the rate of convergenceRate of convergence of order in the inequality (14.2).
In this chapter we introduce a non-recursive and a recursive estimator of the minimum mean squared error L ∗, and prove their distribution-free strong consistencies. Under some mild conditions on the regression function m and on the distribution of (X, Y ), we bound the rate of convergenceRate of convergence of the non-recursive estimate.
Universal consistency—(
2 Strong Universal Consistency
One can derive a new and simple estimator of L ∗, considering the definition
Obviously, E{Y 2} can be estimated by \(\frac{1} {n}\sum _{i=1}^{n}Y _{ i}^{2}\), while we estimate the term E{m(X)2} by \(\frac{1} {n}\sum _{i=1}^{n}Y _{ i}Y _{n,i,1}\). Thus we estimate L ∗ by
Theorem 14.1.
Assume that ties occur with probability 0. If |Y | is bounded then
If E {Y 2 } < ∞ then
Proof.
This theorem says that, for bounded | Y | , the estimate \(\tilde{L}_{n}\) is strongly consistent, while the estimate \(\bar{L}_{n}\) is strongly universally consistent. The theorem is an easy consequence of Ferrario and Walk [6] (Theorems 2.1 and 2.5), who proved that, for bounded Y,
a.s., and moreover, under the only condition E{Y 2} < ∞,
a.s. We simply use the decomposition
Then, as in the proof of Theorem 2.1 in Ferrario and Walk [6], on the basis of (21)–(25) in [9], one can show that, for bounded Y,
a.s. and
a.s. Similarly, as in the proof of Theorem 2.5 in [6], for E{Y 2} < ∞, one can show that
a.s. and
a.s. Now the statements of the theorem follow from (14.3), (14.5), and (14.6), and from (14.4), (14.7) and (14.8), respectively. □
Next we consider a recursive estimate
where Y 1, 1, 1: = 0. It is really recursive since
Theorem 14.2.
Assume that ties occur with probability 0. If E {Y 2 } < ∞ then
Proof.
We have to show that
For a > 0, introduce the truncation function
As in to the proof of Theorem 2.5 in Ferrario and Walk [6], one can check that in order to show (14.9), it suffices to prove that
a.s. Let \(\mathcal{F}_{i-1}\) be the σ-algebra generated by \((\mathbf{X}_{1},Y _{1}),\ldots,(\mathbf{X}_{i-1},Y _{i-1})\). Introduce the decomposition
where
and
I n is an average of martingale differences such that the a.s. convergence
can be derived from the Chow [1] theorem if
We have that
Because of \(\mathbf{E}\{Y _{1}^{2}\} < \infty \),
Recall now the following useful lemma.
Lemma 14.1.
(Györfi et al. [7],Corollary 6.1) Under the assumption that ties occur with probability 0,
a.s., where I denotes the indicator and γ d < ∞ depends only on d.
Lemma 14.1 implies that
Therefore
and so (14.11) is verified, which implies (14.10). Concerning the term J n , the derivations below are based on the fact that the ordinary 1-NN regression estimate is not universally consistent; however, it is strongly Cesàro convergent in the weak topology, and for noiseless observations (\(Y _{i} = m(\mathbf{X}_{i})\)) it is strongly convergent in L 2. Introduce the notation
Let X i−1, 1(x) denote the 1-NN (first nearest neighbour) of x from among \(\{\mathbf{X}_{1},\ldots,\mathbf{X}_{i-1}\}\) and Y i−1, 1(x) denote the corresponding label (x ∈ R d, i ≥ 2); then
The representation
holds, where Y 0, 1(x): = 0. It remains to show that
Before proving (14.12) we use two lemmas. Let μ denote the distribution of X.
Lemma 14.2.
If E {Y 2 } < ∞ then
Proof.
The proof is in the spirit of the proof of Theorem 4.1 and Problems 4.5 and 6.3 in Györfi et al. [7]. □
The following lemma is a reformulation of a classic deterministic Tauberian theorem of Landau [8] in summability theory. For a proof and further references, see Lemma 1 in Walk [16].
Lemma 14.3.
If the sequence a n, \(n = 1,2,\ldots\) of real numbers is bounded from below and satisfies
then
Proof of (14.12).
It suffices to show
In fact, we notice that for each α > 0
If we can show
for some constant c, then this together with \(\int _{{R}^{d}}\mid m_{i}(\mathbf{x}) - m(\mathbf{x}){\mid }^{2}\mu (d\mathbf{x}) \rightarrow 0\) implies
But α → 0 yields that left-hand side equals 0 a.s. This, together with (14.13), implies (14.12). Therefore, to complete the proof it remains to show (14.14) and (14.13). In the first part we show (14.14). Set \(r(\mathbf{x}):= \mathbf{E}\{{Y }^{2}\vert \mathbf{X} = \mathbf{x}\},\) \(r_{i}(\mathbf{x}):= \mathbf{E}\{T_{i}({Y }^{2})\vert \mathbf{X} = \mathbf{x}\}.\) In order to get (14.14) it is enough to show
where \(r_{1}(\mathbf{X}_{0,1}(\mathbf{x})):= 0,\) and
The latter follows from
a.s. (where the first equality holds by Lemma 14.2), which further yields that the sequence
\(n = 1,2,\ldots\), is a.s. bounded from below. In order to get (14.15) and therefore (14.14) by Lemma 14.3, it suffices to show
We now show (14.16). Set for it \(A_{i,j}:= \left \{\mathbf{x};\ \mathbf{X}_{i-1,1}(\mathbf{x}) = \mathbf{X}_{j}\right \}.\) We note
where the (n − 1) summands in brackets are orthogonal, because \(\mathbf{E}\{T_{i}(Y _{j}^{2}) - r_{i}(\mathbf{X}_{j})\mid \mathbf{X}_{1},\ldots,\mathbf{X}_{n-1},Y _{j}\prime\} = 0\) for all i and all j′ ≠ j (\(j,j\prime \in \{ 1,\ldots,n - 1\}\)). Thus (14.16) is equivalent to
Let the cones \(C_{1},\ldots,C_{\gamma _{d}}\) have top 0 and angle \(\frac{\pi }{3},\) which cover R d, and let B i, j, l be the subset of \(C_{j,l}:= \mathbf{X}_{j} + C_{l}\ (j = 1,\ldots,i - 1;\ l = 1,\ldots,\gamma _{d})\) consisting of all x that are closer to X j than the 1-NN of X j in \(\{\mathbf{X}_{1},\ldots,\mathbf{X}_{j-1},\mathbf{X}_{j+1},\ldots,\mathbf{X}_{i-1}\} \cap C_{j,l}.\) For j ≤ i − 1, a covering result of Devroye et al. [2], and also of pp. 489 and 490 in Györfi et al. [7], holds as follows:
It suffices to show, for each \(l \in \{ 1,\ldots,\gamma _{d}\}\),
We have that
According to [2] and pp. 489 and 490 in [7], one has that \(\mathbf{P}\{\mu (B_{i,j,l}) > \sqrt{p}\}\) equals the probability that a \(\mathit{Binom}(i - 2,\sqrt{p})\)-distributed random variable takes the value 0, i.e., \({(1 -\sqrt{p})}^{i-2}\) (0 < p < 1). Thus,
Thus the left-hand side of (14.17) is bounded by
(because of E{Y 2} < ∞), our having used the fact that \(\frac{1} {n}\sum _{j=1}^{n-1}{\left (\ln \frac{n} {j} \right )}^{2} \rightarrow \int _{ 0}^{1}{\left (\ln \frac{1} {t} \right )}^{2}dt =\int _{ 0}^{1}{(\ln t)}^{2}dt < \infty.\) Thus (14.17), and therefore (14.14) is proved. In the second part, it remains to show (14.13). In order to get it, according to the proof of Lemma 23.3 in Györfi et al. [7] it suffices to show
for some constant c ∗ and to show (14.13) for bounded Y. We prove first (14.18). Notice that
From \(\int m{(\mathbf{x})}^{2}\mu (d\mathbf{x}) \leq \mathbf{E}\{{Y }^{2}\}\) and from (14.14) we obtain (14.18), with \({c}^{{\ast}} = \frac{1} {2} + \frac{1} {2}c.\) By boundedness of Y, from some index on we have that \(T_{\sqrt{i}}(Y ) = Y.\) Therefore, and because of Lemma 14.2, it suffices to show
where m(X 0, 1(x)): = 0. By boundedness, because of Lemma 14.3 it is enough to show
Noticing
we obtain, with suitable constants c′ and c″, that the left-hand side of (14.19) equals
the latter is as in the proof of (14.14). Thus (14.13) is proved for bounded Y. Therefore (14.12) and thus the assertion have been verified. □
Universal consistency—) Rate of convergence—(
3 Rate of Convergence
Next we bound the rate of convergenceRate of convergence:
Theorem 14.3.
Assume that Y and X are bounded (|Y | < L, \(\|\mathbf{X}\| < K\) ) and m is Lipschitz continuous and ties occur with probability 0. In addition, suppose that
-
(i)
μ has a Lipschitz continuous density f,
-
(ii)
For any x from the support of μ and 0 < r < 2K,
$$\displaystyle{\mu (S_{\mathbf{x},r)} \geq \gamma {r}^{d},}$$with γ > 0.
Then for d ≥ 2, we have that
Proof.
Apply the decomposition
For the variance term \(\mathbf{Var}(\tilde{L}_{n})\), introduce the notation
For bounded Y ( | Y | ≤ L), we show that
from which we get that
and thus,
In the same way as in Liitiäinen et al. [9], we show (14.20) using the Efron–Stein inequality [5]. Replacement of \((\mathbf{X}_{j},Y _{j})\) by \((\mathbf{X}_{j}\prime,Y _{j}\prime)\) for fixed \(j \in \{ 1,\ldots,n\}\) (where \((\mathbf{X}_{1},Y _{1}),\ldots,(\mathbf{X}_{n},Y _{n}),(\mathbf{X}_{1}\prime,Y _{1}\prime),\ldots,(\mathbf{X}_{n}\prime,Y _{n}\prime)\) are independent and identically distributed) leads to the estimator
According to the Efron–Stein inequality we have that
Evaluate the difference \(R_{n} - R_{n,1}\):
One can check that \(\vert Y _{1}Y _{n,1,1} - Y \prime_{1}Y \prime_{n,1,1}\vert \leq 2{L}^{2}\). Introduce the following notations. Let n[i] be the index of the first nearest neighbour of X i from the set \(\{\mathbf{X}_{1},\mathbf{X}_{2},\ldots,\mathbf{X}_{n}\}\setminus \{\mathbf{X}_{i}\}\). Similarly, let n′[i] be the index of the first nearest neighbour of X i from the set \(\{\mathbf{X}\prime_{1},\mathbf{X}_{2},\ldots,\mathbf{X}_{n}\}\setminus \{\mathbf{X}_{i}\}\). For fixed i ≠ 1, notice
Thus
a.s., where in the last step we applied Lemma 14.1. Summarizing these bounds we get that
a.s., and the proof of (14.20) is complete. For the bias term \(\mathbf{E}\{\tilde{L}_{n}\} - {L}^{{\ast}}\), notice that
Because of
the Lipschitz condition (14.1) implies that
where C is the Lipschitz constant in (14.1). For d ≥ 3, Lemma 6.4 in Györfi et al. [7], and for d ≥ 2, Theorem 3.2 in Liitiäinen et al. [11] say that
Therefore
and so we have to prove that
In order to show (14.21), let’s calculate the density f n of \((\mathbf{X}_{1},\mathbf{X}_{n,1,1})\) with respect to μ ×μ. We have that
Therefore
It implies that
and
Thus,
and interchanging X 1 and X 2 we get that
and so
m satisfies the Lipschitz condition (14.1). Therefore
and so (14.22) implies that
For any 0 < a < b < 1, we have the inequality
Therefore
If \(c_{d}:= \mathit{Vol}(S_{\mathbf{0},1})\) then condition (i) implies that
Because of condition (ii), both \({e}^{-(n-2)\mu (S_{\mathbf{X}_{1},\Vert \mathbf{X}_{1}-\mathbf{X}_{2}\Vert })}\) and \({e}^{-(n-2)\mu (S_{\mathbf{X}_{2},\Vert \mathbf{X}_{1}-\mathbf{X}_{2}\Vert })}\) are upper bounded by \({e}^{-(n-1)\gamma \Vert \mathbf{X}_{1}-\mathbf{X}{_{2}\Vert }^{d} }\). Therefore
Note that the random variable \(R:=\| \mathbf{X}_{1} -\mathbf{X}_{2}\|\) has a density on [0, 2K] bounded above by \(c_{11}{r}^{d-1}\). Therefore
□
Mean squared error—) Minimum mean squared error—) Consistency—) Rate of convergence—)
References
Chow, Y.S.: Local convergence of martingales and the law of large numbers. Ann. Math. Stat. 36, 552–558 (1965)
Devroye, L., Györfi, L., Krzyżak, A., Lugosi, G.: On the strong universal consistency of nearest neighbor regression function estimation. Ann. Stat. 22, 1371–1385 (1994)
Devroye, L., Schäfer, D., Györfi, L., Walk, H.: The estimation problem of minimum mean squared error. Stat. Decis. 21, 15–28 (2003)
Dudoit, S., van der Laan, M.: Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat. Methodol. 2, 131–154 (2005)
Efron, B., Stein, C.: The jackknife estimate of variance. Ann. Stat. 9, 586–596 (1981)
Ferrario, P.G., Walk, H.: Nonparametric partitioning estimation of residual and local variance based on first and second nearest neighbors. J. Nonparametric Stat. 24, 1019–1039 (2012)
Györfi, L., Kohler, M., Krzyżak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, New York (2002)
Landau, E.: Über die Bedeutung einiger neuer Grenzwertsätze der Herren Hardy und Axer. Prace mat.-fiz. 21, 91–177 (1910)
Liitiäinen, E., Corona, F., Lendasse, A.: On nonparametric residual variance estimation. Neural Process. Lett. 28, 155–167 (2008)
Liitiäinen, E., Verleysen, M., Corona, F., Lendasse, A.: Residual variance estimation in machine learning. Neurocomputing 72, 3692–3703 (2009)
Liitiäinen, E., Corona, F., Lendasse, A.: Residual variance estimation using a nearest neighbor statistic. J. Multivar. Anal. 101, 811–823 (2010)
Müller, H.G., Stadtmüller, U.: Estimation of heteroscedasticity in regression analysis. Ann. Stat. 15, 610–625 (1987)
Müller, U., Schick, A., Wefelmeyer, W.: Estimating the error variance in nonparametric regression by a covariate-matched U-statistic. Statistics 37, 179–188 (2003)
Neumann, M.H.: Fully data-driven nonparametric variance estimators. Statistics 25, 189–212 (1994)
Stadtmüller, U., Tsybakov, A.: Nonparametric recursive variance estimation. Statistics 27, 55–63 (1995)
Walk, H.: Strong laws of large numbers by elementary Tauberian arguments. Monatsh. Math. 144, 329–346 (2005)
Acknowledgements
This work was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Devroye, L., Ferrario, P.G., Györfi, L., Walk, H. (2013). Strong Universal Consistent Estimate of the Minimum Mean Squared Error. In: Schölkopf, B., Luo, Z., Vovk, V. (eds) Empirical Inference. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41136-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-41136-6_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41135-9
Online ISBN: 978-3-642-41136-6
eBook Packages: Computer ScienceComputer Science (R0)