Analysis of the rate of convergence of least squares neural network regression estimates in case of measurement errors
Introduction
Let be independent identically distributed —valued random vectors with . In regression analysis we want to estimate after having observed , i.e., we want to determine a function with “close” to . If “closeness” is measured by the mean squared error, then one wants to find a function minimizing the so-called -risk , i.e., should satisfy Let be the regression function. The well-known relation which holds for each measurable function implies that is the solution of the minimization problem (1), is the minimum of (2) and for an arbitrary , the error is the difference between and .
In the regression estimation problem the distribution of (and consequently ) is unknown. Given a sequence of independent observations of , the goal is to construct an estimate of such that the error is small. For a general introduction to regression estimation see, e.g., Györfi, Kohler, Krzyżak, and Walk (2002).
Sometimes it is possible to observe data from the underlying distribution only with measurement errors. In this context usually the problem is considered that the independent variable can be observed only with additional random errors which have mean zero. More precisely, instead of one observes for some random variables which satisfy , and the problem is to estimate the regression function from . In the literature in this context often estimates of the distribution of are constructed and estimates of the regression function are defined by using the estimated distribution of (see, e.g., Caroll, Maca, and Ruppert (1999), Delaigle, Fan, and Carroll (2009), Delaigle and Meister (2007), Fan and Truong (1993), and the references therein).
In this paper we consider a setting, where basically nothing is assumed on the nature of the measurement errors. In particular, the measurement errors do not have to be independent or identically distributed, and they do not need to have expectation zero. The only assumption we are making is that these measurement errors are somehow “small”.
More precisely, we assume that we have given data where the only assumption on the random variables is that the average measurement error is small, where denotes the Euclidean norm. In particular, does not need to be independent or identically distributed, and does not need to be equal to . For notational simplicity we will suppress in the sequel a possible dependency of on the sample size in our notation.
It is not clear how the error of an arbitrary regression estimate is influenced by additional measurement errors. Due to the fact that we assume nothing on the nature of these errors, in contrast to the classical setting described above there is now no chance to get rid of these errors, so these errors will necessarily increase the error of the estimate. Intuitively one can expect that measurement errors influence the error of the estimate not much as long as these measurement errors are small. In this article we show that this is indeed true for suitably defined least squares neural network estimates.
The basic idea behind the definition of our estimate is as follows: Since we assume that (3) is small, it is reasonable to estimate the risk of a Lipschitz continuous function by the so-called empirical risk computed with the aid of the data with measurement error, and to define least squares estimates as if no measurement errors are present by for some set of Lipschitz continuous functions . Here is an abbreviation for and and we assume for simplicity that the minima in (4) exist, however we do not require them to be unique.
In this article we will use for suitably defined sets of neural networks, which is one of the most promising choices for in view of performance of the estimate for high-dimensional data (cf., e.g., Barron, 1993). Our main result is that if we restrict the weights of the neural networks such that the resulting functions are Lipschitz continuous with respect to some Lipschitz constant depending on the sample size, then the error of the corresponding least squares neural network regression estimates applied to data with additional measurement errors in the independent variables is basically the sum of the usual error bound for such an estimate applied to data without measurement errors and the product of the measurement error (3) and the Lipschitz constant.
The sets of natural, real numbers and -dimensional real numbers are denoted by and , respectively. For we denote by the Euclidian norm of . The least integer greater than or equal to a real number will be denoted by . For a function denotes the supremum norm. is the indicator function of a set , and is the cardinality of a set . For and we define
The definition of the estimate is given in Section 2, the main result is formulated in Section 3. Section 4 contains the proofs.
Section snippets
Definition of the least squares neural network regression estimates
A feedforward neural network with one hidden layer and hidden neurons is a real-valued function on of the form where is called a sigmoidal function and are the parameters that specify the network. For the sigmoidal function one often uses so-called squashing functions, i.e. a function which is non-decreasing and satisfies It is well-known that feedforward neural networks with one hidden
Main results
Our main result is the following theorem.
Theorem 1 Set for some and define the estimate as in Section 2. Assume that is independent of , that is sub-Gaussian in the sense thatfor some and that the regression function is bounded in absolute value by some . Thenfor some constant .
Remark 1 The sub-Gaussian condition (9) is in
Proofs
The following lemma is an extension of Lemma 1 in Bagirov, Clausen, and Kohler (2009) to data with measurement errors. It is about bounding the error of estimates, which are defined by splitting of the sample. Let , let be a finite set of parameters and assume that for each parameter an estimate is given, which depends only on the training data , and which is Lipschitz continuous with Lipschitz constant . Then we define
Acknowledgements
The authors wish to thank three anonymous referees for various helpful comments.
References (23)
On the approximate realization of continuous mappings by neural networks
Neural Networks
(1989)Some new results on neural network approximation
Neural Networks
(1993)- et al.
Multi-layer feedforward networks are universal approximators
Neural Networks
(1989) - et al.
Convergence rates for single hidden layer feedforward networks
Neural Networks
(1994) - et al.
Consistency of multilayer perceptron regression estimators
Neural Networks
(1993) - et al.
Neural networks and learning: theoretical foundations
(1999) - et al.
An -boosting algorithm for estimation of a regression function
IEEE Transactions on Information Theory
(2009) - Barron, A.R. (1989). Statistical properties of artificial neural networks. In Proceedings of the 28th conference on...
Complexity regularization with application to artificial neural networks
Universal approximation bounds for superpositions of a sigmoidal function
IEEE Transactions on Information Theory
(1993)
Probability theory
Cited by (17)
A novel fractional order model of SARS-CoV-2 and Cholera disease with real data
2023, Journal of Computational and Applied MathematicsOptimal nonparametric inference via deep neural network
2022, Journal of Mathematical Analysis and ApplicationsCitation Excerpt :The success of deep networks hinges on their rich expressiveness (see [3], [23], [22], [1], [28], [20] and [31,32]). Recently, deep networks have played an increasingly important role in statistics particularly in nonparametric curve fitting (see [15,11,16,18,24]). Applications of deep networks in other fields such as image processing or pattern recgnition include, to name a few, LeCun et al. [19], Deng et al. [4], Wan et al. [29], Gal and Ghahramani [9], etc.
Estimates on compressed neural networks regression
2015, Neural NetworksCitation Excerpt :In this section we discuss how our results relate to other recent studies. Our convergence analysis of regression learning algorithms is based on a similar analysis for regression algorithms by Kohler and Mehnert (2011). There are two differences between our work and that of Kohler and Mehnert.
Analysis of convergence performance of neural networks ranking algorithm
2012, Neural NetworksCitation Excerpt :There are two differences between our work and that of Kohler and Mehnert. The first difference in our generalization analysis of ranking learning algorithms as compared with the problems of regression in Kohler and Mehnert (2011) is the performance which is measured on pairs of samples in ranking problem, rather than on individual samples for regression estimate. This means that, unlike the empirical error in regression estimate, the empirical error in ranking problem cannot be expressed as a sum of independent random variables.
Performance evaluation of multilayer perceptrons for discriminating and quantifying multiple kinds of odors with an electronic nose
2012, Neural NetworksCitation Excerpt :An advanced electronic nose will have to work well with a wide variety of odors over large concentration ranges, as a chromatography instrument does, in order to implement complicated discriminative and quantitative tasks. These tough conditions are a challenge for neural networks (Cho, Katahira, Okanoya, & Okada, 2011; Jaiyen, Lursinsap, & Phimoltares, 2010; Jeong & Lee, 2012; Kohler & Mehnert, 2011; Razavi & Tolson, 2011; Rubio, Angelov, & Pacheco, 2011; Trenn, 2008; Wilamowski & Hao, 2010). In order to simultaneously discriminate and quantify many kinds of odors using neural networks, the divide-and-conquer strategy is brought to mind as expected (Horner & Hierold, 1991; Huang & Leung, 2007; Llobet et al., 1997; Orts et al., 1999).