Elsevier

Neural Networks

Volume 21, Issue 1, January 2008, Pages 48-58
Neural Networks

Relation between weight size and degree of over-fitting in neural network regression

https://doi.org/10.1016/j.neunet.2007.11.001Get rights and content

Abstract

This paper investigates the relation between over-fitting and weight size in neural network regression. The over-fitting of a network to Gaussian noise is discussed. Using re-parametrization, a network function is represented as a bounded function g multiplied by a coefficient c. This is considered to bound the squared sum of the outputs of g at given inputs away from a positive constant δn, which restricts the weight size of a network and enables the probabilistic upper bound of the degree of over-fitting to be derived. This reveals that the order of the probabilistic upper bound can change depending on δn. By applying the bound to analyze the over-fitting behavior of one Gaussian unit, it is shown that the probability of obtaining an extremely small value for the width parameter in training is close to one when the sample size is large.

Introduction

Over-fitting is a problem of learning in layered neural networks such as multi-layer perceptrons and radial basis functions. It is empirically known that the problem is particularly serious when the size of the network is large. When a network is trained with noisy data, it may have a very small training error that is caused by fitting the noise rather than the true function underlying the data. In such situations, the generalization error tends to be larger than its optimal level because the trained network may deviate from the true function. One of the central subjects in the study of learning is to clarify how over-fitting affects generalization performance.

Methods for estimating the generalization error of layered neural networks have been extensively studied (Akaike, 1973, Amari and Murata, 1993, Anthony and Bartlett, 1999, Murata et al., 1994, Vapnik, 1998). One of the important aims of these studies is model selection such as the choice of a network size. When selecting a model, the generalization error should be estimated based on the training error because the generalization error is usually unknown in real world applications. To construct such an estimate, the difference between the generalization error and the training error, which works as a complexity penalty in the criterion for model selection (Akaike, 1973, Murata et al., 1994), should be evaluated. The difference contains two error sources that originate from over-fitting; one is the estimation error, which is the increase in the generalization error from its optimal level, and the other is the degree of over-fitting, which is the decrease in the training error due to noise fitting.

This paper discusses the degree of over-fitting for layered neural networks applied to statistical regression problems motivated by the empirical observation that over-fitting is often observed in a network with high curvature in its graph. Since such high curvatures are caused by weights near the boundary of or at infinity in the weight space, the elucidation of the relations between over-fitting and restrictions on the weights is an interesting problem. This paper shows that certain restrictions in the weight space of layered neural networks yield a variety of probabilistic upper bounds for the degree of over-fitting.

One of the common approaches to the analysis of the degree of over-fitting is to use the standard asymptotic theory (Akaike, 1973, Amari and Murata, 1993, Murata et al., 1994). By assuming certain regularity conditions, the theory indicates that the degree of over-fitting and the estimation error are of the order O(1/n) in probability, where n is the number of training data (Akaike, 1973, Amari and Murata, 1993, Murata et al., 1994).

Recently, several studies have revealed that layered neural networks can have properties different from those shown by the standard theory (Fukumizu, 2003, Hagiwara, 2002, Hayasaka et al., 2004). When the true function, or the underlying function, is realizable by a network with fewer hidden units than the assumed network, the regularity conditions of the standard theory are not satisfied due to the unidentifiability of the weights representing the true function (Amari and Ozeki, 2001, Hagiwara et al., 1993, White, 1989). The results in over-realizable cases show that the degree of over-fitting of layered neural networks is not bounded by O(1/n) in probability (Fukumizu, 2003, Hagiwara, 2002, Hayasaka et al., 2004); the degree of over-fitting has a probabilistic lower bound of the order O(logn/n) for a multi-layer perceptron with a wide class of noise models (Fukumizu, 2003). In the case of one-dimensional regression by a network with one hidden unit of the step function the degree of over-fitting to Gaussian noise has a probabilistic lower bound of the order of O(loglogn/n) (Hayasaka et al., 2004), while it is O(logn/n) when two hidden units of the step function are used (Fukumizu, 2003, Hagiwara, 2002). Furthermore, in the case of Gaussian radial basis functions, the degree of over-fitting to Gaussian noise has a probabilistic lower bound of the order of O(logn/n), even when the number of Gaussian units is unity (Hagiwara, 2002). All the above results discuss situations where a network can realize a function with very high curvature by using extreme weights close to zero or infinity. The purpose of this study is to provide an explicit relation between the sizes of the weights and a probabilistic upper bound of the degree of over-fitting. A restriction is imposed on the shape of functions in the hidden units in order to specify the sizes of the weights. In particular, the restriction depends on the number of training data and typically enlarges the weight space as the data size increases. Various orders of the probabilistic upper bound, such as O(logn/n) and O(loglogn/n), are given by different restrictions. Thus, the O(logn/n) lower bound, obtained in Fukumizu (2003) and Hagiwara (2002) without restricting the weight space, does not necessarily exist if the restriction is slowly loosened by modifying the number of training data. This paper shows that the O(loglogn/n) upper bound is typical in such a restriction.

This paper also provides theoretical verification of the empirical observation that over-fitting to noisy data is caused by extreme values of the weights of input units, which results in a high curvature of the graph of the network. In an over-realizable case, a network with sigmoidal or Gaussian units can produce a narrow hump by typically using extreme weight values. This make it possible for the network to have one training point with other parts almost constant. On the other hand, the network can also produce a smoother function with a certain amount of error at every training point. An interesting question is which of these is preferable to achieve the smallest training error. In this paper, a simple example of a network of one Gaussian unit is considered, and the above-mentioned problem is solved by applying the bounds of the degree of over-fitting; it is proved that the input weights of a trained network range in the small region that shrinks to zero as the sample size goes to infinity. This implies that the former fitting is preferred in the over-realizable case. Such a result on the behavior of the trained weight has not been obtained in the standard asymptotic theory and computational learning theory, which focus on the estimation error rather than the degree of over-fitting.

The estimation error, which is another cause for the difference between the generalization error and the training error, has been intensively investigated by an approach using computational learning theory or statistical learning theory (Anthony and Bartlett, 1999, Haussler, 1992, Krzyżak et al., 1996, Vapnik, 1998). While this approach has the advantage of being free from the previously mentioned unidentifiability problem, it is not suitable for the detailed analysis of over-fitting and trained weights. In this approach, an upper bound is obtained for the estimation error, and the bound is used to evaluate the accuracy of the trained network in terms of generalization capability and to show the uniform convergence of the generalization error of the trained network to its optimal level (Anthony and Bartlett, 1999, Devroye et al., 1996). The approach is also applied to a model selection strategy called structural risk minimization (Vapnik, 1998). In deriving the bound of the generalization error, the main technique is to consider the worst case that is bounded by considering the supremum of the difference between the generalization error and the training error over all possible networks (Anthony and Bartlett, 1999, Bartlett, 1998, Vapnik, 1998). Although this simplifies the mathematical problem so that the detailed properties of specific trained weights are not required, it makes it difficult to analyze over-fitting.

This paper is organized as follows. Section 2 details the formulation of a neural network regression and the definition of the degree of over-fitting. For this purpose, the re-parametrization of network functions is introduced here. The main results are presented in Section 3. In this section, a probabilistic upper bound for the degree of over-fitting is first derived, which reflects a given restriction in the weight space. The variation in the probabilistic upper bounds, which is induced by the degree of the restriction, is obtained from this bound. The various bounds allow the analysis of the behavior of the trained weights in over-fitting. Section 3 also includes an analysis of the trained weights for a Gaussian unit. The conclusions and future works are presented in Section 4.

Section snippets

Neural network regression

In layered neural networks such as three layer perceptrons and in radial basis functions, the output for an input XRd can be generally written as fwm(X)=j=1majψbj(X), where wm=(a1,,am,b1,,bm) is a network weight vector with ajR and bjB. The parameter set B is specified according to the type of activation function employed in the hidden layer. The weight space of fwm is denoted by Wm={R×B}m. Fm={fwm|wmWm} is defined and fwm is written as fm for simplicity. |ψbj(X)|1 is assumed for any XR

Main theorem

Restrictions are introduced on Fm to control the complexity of the function class and to show how the restrictions affect Dn(fm). Consider a family of functions Γ that consists of functions on Rd with values in the interval [1,1]. In accordance with the parametrization mentioned in Section 2.2, a class of functions fc,g(X)=cg(X) is considered, where cR and gΓ. Note that this class includes the network functions considered in Section 2.2 in which the function g is defined by a linear

Conclusions and future works

A probabilistic upper bound for the degree of over-fitting to Gaussian noise is obtained for a neural network regression, provided that the function class is restricted so that the complexity of the class is controlled. Using re-parametrization, a network function is represented as a bounded function g multiplied by a coefficient c. The restriction states that the squared sum of the outputs of the bounded function g for the given inputs is bounded away from zero by a positive constant δn that

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments and suggestions for the improvement. This research was partly supported by a Grant-in-Aid for Scientific Research 15700187 from the Ministry of Education, Science, Sports and Culture, Japan.

References (22)

  • Fukumizu, K., & Hagiwara, K. (2003). A general upper bound of likelihood ratio for regression. Research Memorandum,...
  • Cited by (40)

    • Extended belief rule based system with joint learning for environmental governance cost prediction

      2020, Ecological Indicators
      Citation Excerpt :

      Another important aim of previous studies is to propose an optimization method to minimize prediction error because changes in parameter values impact the predicted results. Many scholars (Deng et al., 2015; Wang et al., 2010; Katsuyuki & Kenji, 2008; Faber & Rajko, 2007) have utilized the prediction method to minimize the training error because the error is usually unknown in the real world. However, in the existing prediction methods, the setting of parameters has often been directly determined by expert knowledge with a strong subjectivity (Xie et al., 2016; Yang et al., 2017a,b; Srivastsva et al., 2014; Kawamura & Miyamoto, 2003; Kong et al., 2015).

    • Spatial variability of selected metals using auxiliary variables in agricultural soils

      2019, Catena
      Citation Excerpt :

      Based on the trained BPNN model, an independent test data set was used for accuracy evaluation. It is empirically known that the overfitting of the BPNN model is particularly serious when the number of neurons in the hidden layer is excessively large (Hagiwara and Fukumizu, 2008). However, overfitting can be avoided by controlling the number of neurons and a regularization algorithm (Srivastava et al., 2014).

    View all citing articles on Scopus
    1

    Tel.: +81 3 5421 8730; fax: +81 3 5421 8796.

    View full text