New training strategies for constructive neural networks with application to regression problems
Introduction
Among the existing numerous neural networks (NNs) paradigms, such as Hopfield networks, Kohonen's self-organizing feature maps (SOM), etc. the feedforward NNs (FNNs) are the most popular due to their flexibility in structure, good representational capabilities, and large number of available training algorithms (Bose and Liang, 1996, Leondes, 1998, Lippmann, 1987, Sarkar, 1995). In this paper we will be mainly concerned with FNNs.
When using a NN, one needs to address three important issues. The solutions to these will significantly influence the overall performance of the NN as far as the following two considerations are concerned: (i) recognition rate to new patterns, and (ii) generalization performance to new data sets that have not been presented during network training.
The first problem is the selection of data/patterns for network training. This is a problem that has practical implications and has not received as much attention by researchers as the other problems. The training data set selection can have considerable effects on the performance of the trained network. Some research on this issue has been conducted in Tetko (1997) (and the references therein).
The second problem is the selection of an appropriate and efficient training algorithm from a large number of possible training algorithms that have been developed in the literature, such as the classical error backpropagation (BP) (Rumelhart, Hinton, & Williams, 1986) and its many variants (Sarkar, 1995, Magoulas et al., 1997, Stager and Agarwal, 1997) and the second-order algorithms (Shepherd, 1997, Osowski et al., 1996), to name a few. Many new training algorithms with faster convergence properties and less computational requirements are being developed by researchers in the NN community.
The third problem is the determination of the network size. This problem is more important from a practical point of view when compared to the above two problems, and is generally more difficult to solve. The problem here is to find a network structure as small as possible to meet certain desired performance specifications. What is usually done in practice is that the developer trains a number of networks with different sizes, and then the smallest network that can fulfill all or most of the required performance requirements is selected. This amounts to a tedious process of trial and errors that seems to be unfortunately unavoidable. This paper focuses on developing a systematic procedure for an automatic determination and/or adaptation of the network architecture for a FNN.
The second and third problems are actually closely related to one another in the sense that different training algorithms are suitable for different NN topologies. Therefore, the above three considerations are indeed critical when a NN is to be applied to a real-life problem. Consider a data set generated by an underlying function. This situation usually occurs in pattern classification, function approximation, and regression problems. The problem is to find a model that can represent the input–output relationship of the data set. The model is to be determined or trained based on the data set so that it can predict within some prespecified error bounds the output to any new input pattern. In general, a FNN can solve this problem if its structure is chosen appropriately. Too small a network may not be able to learn the inherent complexities present in the data set, but too large a network may learn ‘unimportant’ details such as observation noise in the training samples, leading to ‘overfitting’ and hence poor generalization performance. This is analogous to the situation when one uses polynomial functions for curve fitting problems. Generally acceptable results cannot be achieved if too few coefficients are used, since the characteristics or features of the underlying function cannot be captured completely. However, too many coefficients may not only fit the underlying function but also the noise contained in the data, yielding a poor representation of the underlying function. When an ‘optimal’ number of coefficients are used, the fitted polynomial will then yield the ‘best’ representation of the function and also the best prediction for any new data.
A similar situation arises in the application of NNs, where it is also imperative to relate the architecture of the NN to the complexity of the problem. Obviously, algorithms that can determine an appropriate network architecture automatically according to the complexity of the underlying function embedded in the data set are very cost-efficient, and thus highly desirable. Efforts toward the network size determination have been made in the literature for many years, and many techniques have been developed (Hush and Horne, 1993, Kwok and Yeung, 1997a) (and the references therein). Towards this end, in Section 2, we review three general methods that deal with the problem of automatic NN structure determination.
Section snippets
Pruning algorithms
One intuitive way to determine the network size is to first establish by some means a network that is considered to be sufficiently large for the problem being considered, and then trim the unnecessary connections or units of the network to reduce it to an appropriate size. This is the basis for the pruning algorithms. Since it is much ‘easier’ to determine or select a ‘very large’ network than to find the proper size needed, the pruning idea is expected to provide a practical but a partial
Constructive algorithms for feedforward neural networks
In this section, we first give a simple formulation of the training problem for a constructive OHL-FNN in the context of a nonlinear optimization problem. The advantages and disadvantages of these constructive algorithms are also discussed.
Statement of the problem
Generally, a multivariate model-free regression problem can be described as follows. Suppose one is given P pairs of vectorsthat are generated from unknown modelswhere the 's are called the multivariate ‘response’ vectors and 's are called the ‘independent variables’ or the ‘carriers’, and M and N are dimensions of and respectively. The gi's are unknown smooth nonparametric or model-free functions
Error scaling strategy for input-side training
In this section, the features of a correlation-based objective function is investigated. Without loss of any generality, a regression problem with only one output is considered. The correlation-based objective function in this case is given as follows (Fahlman & Lebiere, 1991):where with and denoting the mean values of the training error and the output of the n-th hidden unit over the entire training
Input-side pruning strategies
In the input-side training, one can have one or a pool of candidates to train a new hidden unit. In the latter case, the neuron that results in the maximum objective function will be selected as the best candidate. This candidate is incorporated into the network and its input-side weights are frozen in the subsequent training process that follows. However, certain input-side weights may not contribute much to the maximization of the objective function or indirectly to the reduction of the
Convergence of the proposed constructive algorithm
For our proposed constructive OHL-FNN, the convergence of the algorithm with respect to the added hidden units is an important issue and needs careful investigation. First, we investigate an ideal case where assuming a ‘perfect’ input-side training the convergence of the constructive training algorithm with and without error scaling operations is determined. The ideal case can yield an ‘upperbound’ estimate on the convergence rate that the constructive training algorithm can theoretically
Conclusions
In this paper, a new constructive adaptive NN scheme is proposed to scale the error signal during the learning process to improve the input-side training effectiveness and efficiency, and to obtain better generalization performance capabilities. All the regression simulation results that are performed, up to 13th dimensional input space, confirmed the effectiveness and superiority of the proposed new technique. Further simulations for higher dimensional input spaces will have to be performed to
Acknowledgements
This research was supported in part by the NSERC (Natural Science and Engineering Research Council of Canada) Discovery Grant number RGPIN-42515.
References (36)
- et al.
A new strategy for adaptively constructing multilayer feedforward neural networks
Neurocomputing
(2003) - et al.
Efficient backpropagation training with variable stepsize
Neural Networks
(1997) - et al.
Toward generating neural network structures for function approximation
Neural Networks
(1994) - et al.
Fast second-order learning algorithm for feedforward multilayer neural networks and its applications
Neural Networks
(1996) Investigation of the cascor family of learning algorithms
Neural Networks
(1997)Backpropagation with expected source values
Neural Networks
(1991)- et al.
Three methods to speed up the training of feedforward and feedback perceptrons
Neural Networks
(1997) Efficient partition of learning data sets for neural network training
Neural Networks
(1997)- et al.
An adaptive structural neural network with application to EEG automatic seizure detection
Neural Networks
(1996) Dynamic node creation in backpropagation networks
Connection Science
(1989)