Relation between weight size and degree of over-fitting in neural network regression
Introduction
Over-fitting is a problem of learning in layered neural networks such as multi-layer perceptrons and radial basis functions. It is empirically known that the problem is particularly serious when the size of the network is large. When a network is trained with noisy data, it may have a very small training error that is caused by fitting the noise rather than the true function underlying the data. In such situations, the generalization error tends to be larger than its optimal level because the trained network may deviate from the true function. One of the central subjects in the study of learning is to clarify how over-fitting affects generalization performance.
Methods for estimating the generalization error of layered neural networks have been extensively studied (Akaike, 1973, Amari and Murata, 1993, Anthony and Bartlett, 1999, Murata et al., 1994, Vapnik, 1998). One of the important aims of these studies is model selection such as the choice of a network size. When selecting a model, the generalization error should be estimated based on the training error because the generalization error is usually unknown in real world applications. To construct such an estimate, the difference between the generalization error and the training error, which works as a complexity penalty in the criterion for model selection (Akaike, 1973, Murata et al., 1994), should be evaluated. The difference contains two error sources that originate from over-fitting; one is the estimation error, which is the increase in the generalization error from its optimal level, and the other is the degree of over-fitting, which is the decrease in the training error due to noise fitting.
This paper discusses the degree of over-fitting for layered neural networks applied to statistical regression problems motivated by the empirical observation that over-fitting is often observed in a network with high curvature in its graph. Since such high curvatures are caused by weights near the boundary of or at infinity in the weight space, the elucidation of the relations between over-fitting and restrictions on the weights is an interesting problem. This paper shows that certain restrictions in the weight space of layered neural networks yield a variety of probabilistic upper bounds for the degree of over-fitting.
One of the common approaches to the analysis of the degree of over-fitting is to use the standard asymptotic theory (Akaike, 1973, Amari and Murata, 1993, Murata et al., 1994). By assuming certain regularity conditions, the theory indicates that the degree of over-fitting and the estimation error are of the order in probability, where is the number of training data (Akaike, 1973, Amari and Murata, 1993, Murata et al., 1994).
Recently, several studies have revealed that layered neural networks can have properties different from those shown by the standard theory (Fukumizu, 2003, Hagiwara, 2002, Hayasaka et al., 2004). When the true function, or the underlying function, is realizable by a network with fewer hidden units than the assumed network, the regularity conditions of the standard theory are not satisfied due to the unidentifiability of the weights representing the true function (Amari and Ozeki, 2001, Hagiwara et al., 1993, White, 1989). The results in over-realizable cases show that the degree of over-fitting of layered neural networks is not bounded by in probability (Fukumizu, 2003, Hagiwara, 2002, Hayasaka et al., 2004); the degree of over-fitting has a probabilistic lower bound of the order for a multi-layer perceptron with a wide class of noise models (Fukumizu, 2003). In the case of one-dimensional regression by a network with one hidden unit of the step function the degree of over-fitting to Gaussian noise has a probabilistic lower bound of the order of (Hayasaka et al., 2004), while it is when two hidden units of the step function are used (Fukumizu, 2003, Hagiwara, 2002). Furthermore, in the case of Gaussian radial basis functions, the degree of over-fitting to Gaussian noise has a probabilistic lower bound of the order of , even when the number of Gaussian units is unity (Hagiwara, 2002). All the above results discuss situations where a network can realize a function with very high curvature by using extreme weights close to zero or infinity. The purpose of this study is to provide an explicit relation between the sizes of the weights and a probabilistic upper bound of the degree of over-fitting. A restriction is imposed on the shape of functions in the hidden units in order to specify the sizes of the weights. In particular, the restriction depends on the number of training data and typically enlarges the weight space as the data size increases. Various orders of the probabilistic upper bound, such as and , are given by different restrictions. Thus, the lower bound, obtained in Fukumizu (2003) and Hagiwara (2002) without restricting the weight space, does not necessarily exist if the restriction is slowly loosened by modifying the number of training data. This paper shows that the upper bound is typical in such a restriction.
This paper also provides theoretical verification of the empirical observation that over-fitting to noisy data is caused by extreme values of the weights of input units, which results in a high curvature of the graph of the network. In an over-realizable case, a network with sigmoidal or Gaussian units can produce a narrow hump by typically using extreme weight values. This make it possible for the network to have one training point with other parts almost constant. On the other hand, the network can also produce a smoother function with a certain amount of error at every training point. An interesting question is which of these is preferable to achieve the smallest training error. In this paper, a simple example of a network of one Gaussian unit is considered, and the above-mentioned problem is solved by applying the bounds of the degree of over-fitting; it is proved that the input weights of a trained network range in the small region that shrinks to zero as the sample size goes to infinity. This implies that the former fitting is preferred in the over-realizable case. Such a result on the behavior of the trained weight has not been obtained in the standard asymptotic theory and computational learning theory, which focus on the estimation error rather than the degree of over-fitting.
The estimation error, which is another cause for the difference between the generalization error and the training error, has been intensively investigated by an approach using computational learning theory or statistical learning theory (Anthony and Bartlett, 1999, Haussler, 1992, Krzyżak et al., 1996, Vapnik, 1998). While this approach has the advantage of being free from the previously mentioned unidentifiability problem, it is not suitable for the detailed analysis of over-fitting and trained weights. In this approach, an upper bound is obtained for the estimation error, and the bound is used to evaluate the accuracy of the trained network in terms of generalization capability and to show the uniform convergence of the generalization error of the trained network to its optimal level (Anthony and Bartlett, 1999, Devroye et al., 1996). The approach is also applied to a model selection strategy called structural risk minimization (Vapnik, 1998). In deriving the bound of the generalization error, the main technique is to consider the worst case that is bounded by considering the supremum of the difference between the generalization error and the training error over all possible networks (Anthony and Bartlett, 1999, Bartlett, 1998, Vapnik, 1998). Although this simplifies the mathematical problem so that the detailed properties of specific trained weights are not required, it makes it difficult to analyze over-fitting.
This paper is organized as follows. Section 2 details the formulation of a neural network regression and the definition of the degree of over-fitting. For this purpose, the re-parametrization of network functions is introduced here. The main results are presented in Section 3. In this section, a probabilistic upper bound for the degree of over-fitting is first derived, which reflects a given restriction in the weight space. The variation in the probabilistic upper bounds, which is induced by the degree of the restriction, is obtained from this bound. The various bounds allow the analysis of the behavior of the trained weights in over-fitting. Section 3 also includes an analysis of the trained weights for a Gaussian unit. The conclusions and future works are presented in Section 4.
Section snippets
Neural network regression
In layered neural networks such as three layer perceptrons and in radial basis functions, the output for an input can be generally written as where is a network weight vector with and . The parameter set is specified according to the type of activation function employed in the hidden layer. The weight space of is denoted by . is defined and is written as for simplicity. is assumed for any
Main theorem
Restrictions are introduced on to control the complexity of the function class and to show how the restrictions affect . Consider a family of functions that consists of functions on with values in the interval . In accordance with the parametrization mentioned in Section 2.2, a class of functions is considered, where and . Note that this class includes the network functions considered in Section 2.2 in which the function is defined by a linear
Conclusions and future works
A probabilistic upper bound for the degree of over-fitting to Gaussian noise is obtained for a neural network regression, provided that the function class is restricted so that the complexity of the class is controlled. Using re-parametrization, a network function is represented as a bounded function multiplied by a coefficient . The restriction states that the squared sum of the outputs of the bounded function for the given inputs is bounded away from zero by a positive constant that
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments and suggestions for the improvement. This research was partly supported by a Grant-in-Aid for Scientific Research 15700187 from the Ministry of Education, Science, Sports and Culture, Japan.
References (22)
A regularity condition of the information matrix of a multilayer perceptron network
Neural Networks
(1996)Decision theoretic generalization of the PAC model for neural net and other learning applications
Information and Computation
(1992)Uniqueness of the weights for minimal feedforward nets with a given input–output map
Neural Networks
(1992)Information theory and an extension of the maximum likelihood principle
- et al.
Differential and algebraic geometry of multilayer perceptrons
IEICE Transactions on Fundamentals
(2001) - et al.
Statistical theory of learning curves under entropic loss criterion
Neural Computation
(1993) - et al.
Neural network learning: Theoretical foundations
(1999) The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network
IEEE Transactions on Information Theory
(1998)- et al.
A probabilistic theory of pattern recognition
(1996) Likelihood ratio of unidentifiable models and multilayer neural networks
Annals of Statistics
(2003)
Cited by (40)
Extended belief rule based system with joint learning for environmental governance cost prediction
2020, Ecological IndicatorsCitation Excerpt :Another important aim of previous studies is to propose an optimization method to minimize prediction error because changes in parameter values impact the predicted results. Many scholars (Deng et al., 2015; Wang et al., 2010; Katsuyuki & Kenji, 2008; Faber & Rajko, 2007) have utilized the prediction method to minimize the training error because the error is usually unknown in the real world. However, in the existing prediction methods, the setting of parameters has often been directly determined by expert knowledge with a strong subjectivity (Xie et al., 2016; Yang et al., 2017a,b; Srivastsva et al., 2014; Kawamura & Miyamoto, 2003; Kong et al., 2015).
A novel index based on the cusp catastrophe theory for predicting harmful algae blooms
2019, Ecological IndicatorsAsymptotic statistics for multilayer perceptron with ReLU hidden units
2019, NeurocomputingSpatial variability of selected metals using auxiliary variables in agricultural soils
2019, CatenaCitation Excerpt :Based on the trained BPNN model, an independent test data set was used for accuracy evaluation. It is empirically known that the overfitting of the BPNN model is particularly serious when the number of neurons in the hidden layer is excessively large (Hagiwara and Fukumizu, 2008). However, overfitting can be avoided by controlling the number of neurons and a regularization algorithm (Srivastava et al., 2014).
- 1
Tel.: +81 3 5421 8730; fax: +81 3 5421 8796.