Contributed articleA dynamical model for the analysis and acceleration of learning in feedforward networks
Introduction
Multilayer feedforward neural networks have been the preferred neural network architectures for the solution of classification and function approximation problems due to their interesting learning and generalization abilities. From the numerous methods that have been proposed for training multilayered feedforward networks, some, including classic back-propagation, have relatively low complexity per epoch, but are rather inefficient in dealing with extended plateaus (or flat minima) of the cost functions. Other methods are more efficient in dealing with complex topological features of the cost function landscape at the expense of added computational complexity. Notable examples include both off-line and on-line learning paradigms. For example, second order methods related to efficient off-line learning require the evaluation and inversion of the Hessian matrix, which is clearly a computationally very demanding task when the number of parameters is large. The same problem is also eminent in efficient on-line techniques such as the natural gradient descent method (Amari, 1998) which requires the inversion of the Fisher information matrix, for which the computational cost is very large for large-scale problems.
In earlier work (Ampazis, Perantonis & Taylor, 1999a), we have approached the problem of flat minima using a method that originates from the theory of dynamical systems. Motivated by the connection between flat minima and the build up of redundancy, we introduced suitable state variables formed by appropriate linear combinations of the synaptic weights, and we derived a linear dynamical system model for a network with two hidden nodes and a single output. Using that model, we were able to describe the dynamics of such a network in the vicinity of flat plateaus, and we showed that the learning behavior can be characterized by the largest eigenvalue of the Jacobian matrix corresponding to the linearized system. It was shown that in the vicinity of flat minima, learning evolves slowly because this eigenvalue is very small and that the network is able to abandon the minimum only when the eigenvalues of its Jacobian matrix bifurcate.
The study of the nature of flat minima apart from its intrinsic value in advancing research into the dynamics of learning, can have significant impact in the development of new learning methods inspired by deeper understanding of the fundamental mechanisms involved in the dynamical behavior of layered networks. We envisage two types of benefits coming from this approach.
- 1.
Having identified a flat minimum, one can apply a computationally intensive algorithm just for a few epochs until the flat minimum is abandoned, thus reducing the overall computational complexity of the learning process.
- 2.
The insight gained by the analysis of the nature of the flat minima is valuable for proposing tailor-made efficient algorithms for promptly abandoning temporary minima, whose complexity is much lower than related general purpose algorithms. Thus, even for the few epochs that will be needed to abandon the flat minimum there will be a gain in computational cost.
The purpose of this paper is to extend the dynamical analysis in order to account for a more general type of feedforward networks. We still consider networks with one hidden layer, but place no restriction whatsoever on the number of input, hidden and output nodes. Our study shows that the introduction of suitable state variables results in significant decouplings in the essential quantities related to learning, and, for off-line learning, leads to the formulation of a linear dynamical system model for this more general type of network. In particular, for each cluster of redundant hidden nodes, a linearized system in the corresponding dynamical variables is introduced, which is described by a corresponding symmetric Jacobian matrix with lower dimension than the total number of the weights and thresholds of the network. Abandonment of flat minima arising from the build up of redundancy is signified by the bifurcation of the eigenvalues of the Jacobian matrix of each cluster of redundant hidden units.
Moreover, we extend our effort to incorporate the dynamical system formalism into a learning algorithm that allows successful negotiation of the flat minima and, therefore, accelerates learning. It turns out that such a task requires the ability to identify clusters of redundant hidden nodes, which can be achieved using unsupervised clustering techniques. The identification of individual clusters allows the calculation of the Jacobian eigenvalues of the dynamical system model and the application of extended constrained learning optimization techniques that enable prompt bifurcation of the eigenvalues. A training algorithm (Dynamically Constrained Back Propagation—DCBP) ensues, which can be applied either autonomously or as an aid, in the vicinity of flat minima, to other well-known supervised learning algorithms. In the experimental section it is shown that DCBP exhibits improved learning abilities compared to standard back-propagation and to other reputedly fast learning algorithms (resilient propagation, ALECO-2 and variations of the conjugate gradient methods) in standard benchmark tasks.
The paper is organized as follows: in Section 2 we introduce the dynamical variables for arbitrary networks with a single hidden layer and we discuss the relation of the corresponding dynamical system model arising in the off-line learning mode to other on-line techniques dealing with the flat minima problem. In Section 3 we introduce the constrained optimization method designed to facilitate learning using constraints imposed on the eigenvalues of the Jacobian matrix. In Section 4 we present an outline of the steps required by the proposed DCBP algorithm. Section 5 contains our simulation results and describes the experiments conducted to test the performance of the algorithm and compare it with that of other supervised learning algorithms, Finally, in Section 6 conclusions are drawn and future work is outlined.
Section snippets
Motivation
Consider a neural network with a single hidden layer which has N external input signals with the addition of a bias input. The bias signal is identical for all neurons in the network. The hidden layer consists of M neurons and the output layer contains K neurons with sigmoid activation functions f(s=1/(1+exp(−s)). For a given training pattern p, the square error cost function iswhere yi denote the output activations and di are the desired responses of each output node i. The
Constrained optimization method
In this section, we concentrate on the utilization of the information provided by the dynamical system model for off-line learning in order to explore potential ways of helping the network to escape from flat minima. Following the analysis of the previous section, it is evident that if the maximum eigenvalues λc,c=1,…,S of the Jacobian matrices Jc of Eq. (30) corresponding to each of the S clusters of hidden nodes are relatively large, then the network is able to escape from the flat minimum.
DCBP algorithm outline
In order to formulate a training strategy which takes into account both the dynamical system analysis and the constrained optimization method, we should ensure that we are able to identify the clusters that are formed during the training process. For the cluster identification problem, it should be clear that normally it is difficult to obtain a clear sense of how many clusters are formed during training, but one can only suspect their formation when the error improvement is very small (e.g.
Simulation results
In our simulations, we studied the dynamics of feedforward networks that were trained to solve two different parity problems and a real world classification problem from the PROBEN1 database (Prechelt, 1994). In particular we studied the 3-bit and 4-bit parity problems and the cancer classification problem of the PROBEN1 set (the standard PROBEN1 benchmarking rules were applied). We have also tried to highlight the benefits that can arise either solely from our method (which is useful in the
Conclusions
In this paper, a dynamical system model for feedforward networks has been introduced. The model is useful for analyzing the dynamics of learning in feedforward networks in the vicinity of flat minima arising from redundancy of nodes in the hidden layer. It was shown that, as a direct consequence of the build up of redundancy, it is possible to describe the dynamics of feedforward networks using appropriate state variables whose total number is reduced compared to the total number of free
References (20)
- et al.
Dynamics of multilayer networks in the vicinity of temporary minima
Neural Networks
(1999) - et al.
Local minima and plateaus in hierarchical structures of multi-layer perceptrons
Neural Networks
(2000) XOR has no local minima: a case study in neural network error surface analysis
Neural Networks
(1998)- et al.
An efficient constrained learning algorithm with momentum acceleration
Neural Networks
(1995) Differential-geometrical method in statistics
Springer Lecture Note in Statistics
(1985)Natural gradient works efficiently in learning
Neural Computation
(1998)- et al.
Methods of information geometry
(2000) - et al.
Adaptive method of realizing natural gradient learning for multilayer perceptrons
Neural Computation
(1999) - Ampazis, N., Perantonis, S. J., & Taylor, J. G. (1999). Acceleration of learning in feed-forward networks using...
Pattern recognition with fuzzy objective function algorithms
(1981)