Analysis of the rate of convergence of least squares neural network regression estimates in case of measurement errors

doi:10.1016/j.neunet.2010.11.003

Neural Networks

Volume 24, Issue 3, April 2011, Pages 273-279

https://doi.org/10.1016/j.neunet.2010.11.003 Get rights and content

Abstract

Estimation of a regression function from data which consists of an independent and identically distributed sample of the underlying distribution with additional measurement errors in the independent variables is considered. It is allowed that the measurement errors are not independent and have a nonzero mean. It is shown that the rate of convergence of suitably defined least squares neural network estimates applied to this data is similar to the rate of convergence of least squares neural network estimates applied to an independent and identically distributed sample of the underlying distribution as long as the measurement errors are small.

Introduction

Let $(X, Y), (X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots$ be independent identically distributed $R^{d} \times R$ —valued random vectors with $E Y^{2} < \infty$ . In regression analysis we want to estimate $Y$ after having observed $X$ , i.e., we want to determine a function $f$ with $f (X)$ “close” to $Y$ . If “closeness” is measured by the mean squared error, then one wants to find a function $f^{*}$ minimizing the so-called $L_{2}$ -risk $E {{| f^{*} (X) - Y |}^{2}}$ , i.e., $f^{*}$ should satisfy $E {{| f^{*} (X) - Y |}^{2}} = min_{f} E {{| f (X) - Y |}^{2}} .$ Let $m (x) ≔ E {Y | X = x}$ be the regression function. The well-known relation which holds for each measurable function $f$ $E {{| f (X) - Y |}^{2}} = E {{| m (X) - Y |}^{2}} + \int {| f (x) - m (x) |}^{2} P_{X} (d x)$ implies that $m$ is the solution of the minimization problem (1), $E {{| m (X) - Y |}^{2}}$ is the minimum of (2) and for an arbitrary $f$ , the $L_{2}$ error $\int {| f (x) - m (x) |}^{2} P_{X} (d x)$ is the difference between $E {{| f (X) - Y |}^{2}}$ and $E {{| m (X) - Y |}^{2}}$ .

In the regression estimation problem the distribution of $(X, Y)$ (and consequently $m$ ) is unknown. Given a sequence $D_{n} = {(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}$ of independent observations of $(X, Y)$ , the goal is to construct an estimate $m_{n} (x) = m_{n} (x, D_{n})$ of $m (x)$ such that the $L_{2}$ error $\int {| m_{n} (x) - m (x) |}^{2} P_{X} (d x)$ is small. For a general introduction to regression estimation see, e.g., Györfi, Kohler, Krzyżak, and Walk (2002).

Sometimes it is possible to observe data from the underlying distribution only with measurement errors. In this context usually the problem is considered that the independent variable $X$ can be observed only with additional random errors which have mean zero. More precisely, instead of $X_{i}$ one observes $W_{i} = X_{i} + U_{i}$ for some random variables $U_{i}$ which satisfy $E {U_{i} | X_{i}} = 0$ , and the problem is to estimate the regression function from ${(W_{1}, Y_{1}), \dots, (W_{n}, Y_{n})}$ . In the literature in this context often estimates of the distribution of $U_{i}$ are constructed and estimates of the regression function are defined by using the estimated distribution of $U_{i}$ (see, e.g., Caroll, Maca, and Ruppert (1999), Delaigle, Fan, and Carroll (2009), Delaigle and Meister (2007), Fan and Truong (1993), and the references therein).

In this paper we consider a setting, where basically nothing is assumed on the nature of the measurement errors. In particular, the measurement errors do not have to be independent or identically distributed, and they do not need to have expectation zero. The only assumption we are making is that these measurement errors are somehow “small”.

More precisely, we assume that we have given data ${\bar{D}}_{n} = {({\bar{X}}_{1, n}, Y_{1}), \dots, ({\bar{X}}_{n, n}, Y_{n})},$ where the only assumption on the random variables ${\bar{X}}_{1, n}, \dots, {\bar{X}}_{n, n}$ is that the average measurement error $\frac{1}{n} \sum_{i = 1}^{n} {‖ X_{i} - {\bar{X}}_{i, n} ‖}_{2}$ is small, where ${‖ \cdot ‖}_{2}$ denotes the Euclidean norm. In particular, ${\bar{D}}_{n}$ does not need to be independent or identically distributed, and $E {Y_{1} | {\bar{X}}_{1, n} = x}$ does not need to be equal to $m (x) = E {Y | X = x}$ . For notational simplicity we will suppress in the sequel a possible dependency of ${\bar{X}}_{i} = {\bar{X}}_{i, n}$ on the sample size $n$ in our notation.

It is not clear how the $L_{2}$ error of an arbitrary regression estimate is influenced by additional measurement errors. Due to the fact that we assume nothing on the nature of these errors, in contrast to the classical setting described above there is now no chance to get rid of these errors, so these errors will necessarily increase the $L_{2}$ error of the estimate. Intuitively one can expect that measurement errors influence the error of the estimate not much as long as these measurement errors are small. In this article we show that this is indeed true for suitably defined least squares neural network estimates.

The basic idea behind the definition of our estimate is as follows: Since we assume that (3) is small, it is reasonable to estimate the $L_{2}$ risk of a Lipschitz continuous function $f$ by the so-called empirical $L_{2}$ risk $\frac{1}{n} \sum_{i = 1}^{n} {| f ({\bar{X}}_{i}) - Y_{i} |}^{2}$ computed with the aid of the data with measurement error, and to define least squares estimates as if no measurement errors are present by ${\bar{m}}_{n} (\cdot) = arg min_{f \in F_{n}} \frac{1}{n} \sum_{i = 1}^{n} {| f ({\bar{X}}_{i}) - Y_{i} |}^{2}$ for some set $F_{n}$ of Lipschitz continuous functions $f : R^{d} \to R$ . Here $z = arg {min}_{x \in A} G (x)$ is an abbreviation for $z \in A$ and $G (z) = {min}_{x \in A} G (x)$ and we assume for simplicity that the minima in (4) exist, however we do not require them to be unique.

In this article we will use for $F_{n}$ suitably defined sets of neural networks, which is one of the most promising choices for $F_{n}$ in view of performance of the estimate for high-dimensional data (cf., e.g., Barron, 1993). Our main result is that if we restrict the weights of the neural networks such that the resulting functions are Lipschitz continuous with respect to some Lipschitz constant depending on the sample size, then the $L_{2}$ error of the corresponding least squares neural network regression estimates applied to data with additional measurement errors in the independent variables is basically the sum of the usual error bound for such an estimate applied to data without measurement errors and the product of the measurement error (3) and the Lipschitz constant.

The sets of natural, real numbers and $d$ -dimensional real numbers are denoted by $N, R$ and $R^{d}$ , respectively. For $x \in R^{d}$ we denote by ${‖ x ‖}_{2}$ the Euclidian norm of $x$ . The least integer greater than or equal to a real number $x$ will be denoted by $⌈ x ⌉$ . For a function $f : R^{d} \to R$ ${‖ f ‖}_{\infty} = sup_{x \in R^{d}} | f (x) |$ denotes the supremum norm. $I_{A}$ is the indicator function of a set $A$ , and $# (Q)$ is the cardinality of a set $Q$ . For $z \in R$ and $β > 0$ we define $T_{β} z = min {max {z, - β}, β} .$

The definition of the estimate is given in Section 2, the main result is formulated in Section 3. Section 4 contains the proofs.

Section snippets

Definition of the least squares neural network regression estimates

A feedforward neural network with one hidden layer and $k$ hidden neurons is a real-valued function on $R^{d}$ of the form $f (x) = \sum_{i = 1}^{k} c_{i} \cdot σ (a_{i}^{T} x + b_{i}) + c_{0}$ where $σ : R \to [0, 1]$ is called a sigmoidal function and $a_{1}, \dots, a_{k} \in R^{d}, b_{1}, \dots, b_{k}, c_{0}, c_{1}, \dots, c_{k} \in R$ are the parameters that specify the network. For the sigmoidal function $σ$ one often uses so-called squashing functions, i.e. a function which is non-decreasing and satisfies $lim_{x \to - \infty} σ (x) = 0 and lim_{x \to \infty} σ (x) = 1 .$ It is well-known that feedforward neural networks with one hidden

Main results

Our main result is the following theorem.

Theorem 1

Set $β_{n} = c_{1} \cdot log (n)$ for some $c_{1} > 0$ and define the estimate $m_{n}$ as in Section 2. Assume that ${\bar{D}}_{n_{l}}$ is independent of $(X, Y), (X_{n_{l} + 1}, Y_{n_{l} + 1}), \dots, (X_{n}, Y_{n})$ , that $Y$ is sub-Gaussian in the sense that $E {e^{c_{2} Y^{2}}} < \infty$ for some $c_{2} > 0$ and that the regression function is bounded in absolute value by some $0 \leq L \leq β_{n}$ . Then $E \int {| m_{n} (x) - m (x) |}^{2} P_{X} (d x) \leq c_{3} \cdot (α_{n} \cdot log {(n)}^{2} \cdot E {\frac{1}{n} \sum_{i = 1}^{n} {‖ X_{i} - {\bar{X}}_{i, n} ‖}_{2}} + min_{k \in P_{n}} (\frac{k \cdot log {(n)}^{5}}{n} + inf_{f \in F_{k, n}} \int {| f (x) - m (x) |}^{2} P_{X} (d x)))$ for some constant $c_{3} > 0$ .

Remark 1

The sub-Gaussian condition (9) is in

Proofs

The following lemma is an extension of Lemma 1 in Bagirov, Clausen, and Kohler (2009) to data with measurement errors. It is about bounding the $L_{2}$ error of estimates, which are defined by splitting of the sample. Let $n = n_{l} + n_{t}$ , let $Q_{n}$ be a finite set of parameters and assume that for each parameter $h \in Q_{n}$ an estimate $m_{n_{l}}^{(h)} (\cdot) = m_{n_{l}}^{(h)} (\cdot, {\bar{D}}_{n_{l}})$ is given, which depends only on the training data ${\bar{D}}_{n_{l}} = {({\bar{X}}_{1}, Y_{1}), \dots, ({\bar{X}}_{n_{l}}, Y_{n_{l}})}$ , and which is Lipschitz continuous with Lipschitz constant $L_{n}$ . Then we define $m_{n} ($

Acknowledgements

The authors wish to thank three anonymous referees for various helpful comments.

References (23)

K. Funahashi
On the approximate realization of continuous mappings by neural networks
Neural Networks
(1989)
K. Hornik
Some new results on neural network approximation
Neural Networks
(1993)
K. Hornik et al.
Multi-layer feedforward networks are universal approximators
Neural Networks
(1989)
D.F. McCaffrey et al.
Convergence rates for single hidden layer feedforward networks
Neural Networks
(1994)
J. Mielniczuk et al.
Consistency of multilayer perceptron regression estimators
Neural Networks
(1993)
M. Anthony et al.
Neural networks and learning: theoretical foundations
(1999)
A.M. Bagirov et al.
An $L_{2}$ -boosting algorithm for estimation of a regression function
IEEE Transactions on Information Theory
(2009)
Barron, A.R. (1989). Statistical properties of artificial neural networks. In Proceedings of the 28th conference on...
A.R. Barron
Complexity regularization with application to artificial neural networks
A.R. Barron
Universal approximation bounds for superpositions of a sigmoidal function
IEEE Transactions on Information Theory
(1993)

H. Bauer

Probability theory

(1996)

Cited by (17)

A novel fractional order model of SARS-CoV-2 and Cholera disease with real data
2023, Journal of Computational and Applied Mathematics
This study presents a novel approach to investigating COVID-19 and Cholera disease. In this situation, a fractional-order model is created to investigate the COVID-19 and Cholera outbreaks in Congo. The existence, uniqueness, positivity, and boundedness of the solution are studied. The equilibrium points and their stability conditions are achieved. Subsequently, the basic reproduction number (the virus transmission coefficient) is calculated that simply refers to the number of people, to whom an infected person can make infected, as $R_{0} = 6.7442389 e - 10$ by using the next generation matrix method. Next, the sensitivity analysis of the parameters is performed according to $R_{0}$ . To determine the values of the parameters in the model, the least squares curve fitting method is utilized. A total of 22 parameter values in the model are estimated by using real Cholera data from Congo. Finally, to find out the dynamic behavior of the system, numerical simulations are presented. The outcome of the study indicates that the severity of the Cholera epidemic cases will decrease with the decrease in cases of COVID-19, through the implementation and follow-up of safety measures that have been taken to reduce COVID-19 cases.
Optimal nonparametric inference via deep neural network
2022, Journal of Mathematical Analysis and Applications
Citation Excerpt :
The success of deep networks hinges on their rich expressiveness (see [3], [23], [22], [1], [28], [20] and [31,32]). Recently, deep networks have played an increasingly important role in statistics particularly in nonparametric curve fitting (see [15,11,16,18,24]). Applications of deep networks in other fields such as image processing or pattern recgnition include, to name a few, LeCun et al. [19], Deng et al. [4], Wan et al. [29], Gal and Ghahramani [9], etc.
Deep neural network is a state-of-art method in modern science and technology. Much statistical literature have been devoted to understanding its performance in nonparametric estimation, whereas the results are suboptimal due to a redundant logarithmic sacrifice. In this paper, we show that such log-factors are not necessary. We derive upper bounds for the $L^{2}$ minimax risk in nonparametric estimation. Sufficient conditions on network architectures are provided such that the upper bounds become optimal (without log-sacrifice). Our proof relies on an explicitly constructed network estimator based on tensor product B-splines. We also derive asymptotic distributions for the constructed network and a relating hypothesis testing procedure. The testing procedure is further proved as minimax optimal under suitable network architectures.
Estimates on compressed neural networks regression
2015, Neural Networks
Citation Excerpt :
In this section we discuss how our results relate to other recent studies. Our convergence analysis of regression learning algorithms is based on a similar analysis for regression algorithms by Kohler and Mehnert (2011). There are two differences between our work and that of Kohler and Mehnert.
When the neural element number $n$ of neural networks is larger than the sample size $m$ , the overfitting problem arises since there are more parameters than actual data (more variable than constraints). In order to overcome the overfitting problem, we propose to reduce the number of neural elements by using compressed projection $A$ which does not need to satisfy the condition of Restricted Isometric Property (RIP). By applying probability inequalities and approximation properties of the feedforward neural networks (FNNs), we prove that solving the FNNs regression learning algorithm in the compressed domain instead of the original domain reduces the sample error at the price of an increased (but controlled) approximation error, where the covering number theory is used to estimate the excess error, and an upper bound of the excess error is given.
Analysis of convergence performance of neural networks ranking algorithm
2012, Neural Networks
Citation Excerpt :
There are two differences between our work and that of Kohler and Mehnert. The first difference in our generalization analysis of ranking learning algorithms as compared with the problems of regression in Kohler and Mehnert (2011) is the performance which is measured on pairs of samples in ranking problem, rather than on individual samples for regression estimate. This means that, unlike the empirical error in regression estimate, the empirical error in ranking problem cannot be expressed as a sum of independent random variables.
The ranking problem is to learn a real-valued function which gives rise to a ranking over an instance space, which has gained much attention in machine learning in recent years. This article gives analysis of the convergence performance of neural networks ranking algorithm by means of the given samples and approximation property of neural networks. The upper bounds of convergence rate provided by our results can be considerably tight and independent of the dimension of input space when the target function satisfies some smooth condition. The obtained results imply that neural networks are able to adapt to ranking function in the instance space. Hence the obtained results are able to circumvent the curse of dimensionality on some smooth condition.
Performance evaluation of multilayer perceptrons for discriminating and quantifying multiple kinds of odors with an electronic nose
2012, Neural Networks
Citation Excerpt :
An advanced electronic nose will have to work well with a wide variety of odors over large concentration ranges, as a chromatography instrument does, in order to implement complicated discriminative and quantitative tasks. These tough conditions are a challenge for neural networks (Cho, Katahira, Okanoya, & Okada, 2011; Jaiyen, Lursinsap, & Phimoltares, 2010; Jeong & Lee, 2012; Kohler & Mehnert, 2011; Razavi & Tolson, 2011; Rubio, Angelov, & Pacheco, 2011; Trenn, 2008; Wilamowski & Hao, 2010). In order to simultaneously discriminate and quantify many kinds of odors using neural networks, the divide-and-conquer strategy is brought to mind as expected (Horner & Hierold, 1991; Huang & Leung, 2007; Llobet et al., 1997; Orts et al., 1999).
This paper studies several types and arrangements of perceptron modules to discriminate and quantify multiple odors with an electronic nose. We evaluate the following types of multilayer perceptron. ( $A$ ) A single multi-output (SMO) perceptron both for discrimination and for quantification. ( $B$ ) An SMO perceptron for discrimination followed by multiple multi-output (MMO) perceptrons for quantification. ( $C$ ) An SMO perceptron for discrimination followed by multiple single-output (MSO) perceptrons for quantification. ( $D$ ) MSO perceptrons for discrimination followed by MSO perceptrons for quantification, called the MSO-MSO perceptron model, under the following conditions: ( $D$ 1) using a simple one-against-all (OAA) decomposition method; ( $D$ 2) adopting a simple OAA decomposition method and virtual balance step; and ( $D$ 3) employing a local OAA decomposition method, virtual balance step and local generalization strategy all together. The experimental results for 12 kinds of volatile organic compounds at 85 concentration levels in the training set and 155 concentration levels in the test set show that the MSO-MSO perceptron model with the D3 learning procedure is the most effective of those tested for discrimination and quantification of many kinds of odors.
Nonparametric regression on low-dimensional manifolds using deep ReLU networks: function approximation and statistical recovery
2022, Information and Inference

View all citing articles on Scopus

View full text

Analysis of the rate of convergence of least squares neural network regression estimates in case of measurement errors

Abstract

Introduction

Section snippets

Definition of the least squares neural network regression estimates

Main results

Proofs

Acknowledgements

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural networks and learning: theoretical foundations

An L2-boosting algorithm for estimation of a regression function

IEEE Transactions on Information Theory

Complexity regularization with application to artificial neural networks

Universal approximation bounds for superpositions of a sigmoidal function

IEEE Transactions on Information Theory

Probability theory

An $L_{2}$ -boosting algorithm for estimation of a regression function