A dynamical model for the analysis and acceleration of learning in feedforward networks

doi:10.1016/S0893-6080(01)00052-1

Neural Networks

Volume 14, Issue 8, October 2001, Pages 1075-1088

https://doi.org/10.1016/S0893-6080(01)00052-1 Get rights and content

Abstract

A dynamical system model is derived for feedforward neural networks with one layer of hidden nodes. The model is valid in the vicinity of flat minima of the cost function that rise due to the formation of clusters of redundant hidden nodes with nearly identical outputs. The derivation is carried out for networks with an arbitrary number of hidden and output nodes and is, therefore, a generalization of previous work valid for networks with only two hidden nodes and one output node. The Jacobian matrix of the system is obtained, whose eigenvalues characterize the evolution of learning. Flat minima correspond to critical points of the phase plane trajectories and the bifurcation of the eigenvalues signifies their abandonment. Following the derivation of the dynamical model, we show that identification of the hidden nodes clusters using unsupervised learning techniques enables the application of a constrained application (Dynamically Constrained Back Propagation—DCBP) whose purpose is to facilitate prompt bifurcation of the eigenvalues of the Jacobian matrix and, thus, accelerate learning. DCBP is applied to standard benchmark tasks either autonomously or as an aid to other standard learning algorithms in the vicinity of flat minima. Its application leads to significant reduction in the number of required epochs for convergence.

Introduction

Multilayer feedforward neural networks have been the preferred neural network architectures for the solution of classification and function approximation problems due to their interesting learning and generalization abilities. From the numerous methods that have been proposed for training multilayered feedforward networks, some, including classic back-propagation, have relatively low complexity per epoch, but are rather inefficient in dealing with extended plateaus (or flat minima) of the cost functions. Other methods are more efficient in dealing with complex topological features of the cost function landscape at the expense of added computational complexity. Notable examples include both off-line and on-line learning paradigms. For example, second order methods related to efficient off-line learning require the evaluation and inversion of the Hessian matrix, which is clearly a computationally very demanding task when the number of parameters is large. The same problem is also eminent in efficient on-line techniques such as the natural gradient descent method (Amari, 1998) which requires the inversion of the Fisher information matrix, for which the computational cost is very large for large-scale problems.

In earlier work (Ampazis, Perantonis & Taylor, 1999a), we have approached the problem of flat minima using a method that originates from the theory of dynamical systems. Motivated by the connection between flat minima and the build up of redundancy, we introduced suitable state variables formed by appropriate linear combinations of the synaptic weights, and we derived a linear dynamical system model for a network with two hidden nodes and a single output. Using that model, we were able to describe the dynamics of such a network in the vicinity of flat plateaus, and we showed that the learning behavior can be characterized by the largest eigenvalue of the Jacobian matrix corresponding to the linearized system. It was shown that in the vicinity of flat minima, learning evolves slowly because this eigenvalue is very small and that the network is able to abandon the minimum only when the eigenvalues of its Jacobian matrix bifurcate.

The study of the nature of flat minima apart from its intrinsic value in advancing research into the dynamics of learning, can have significant impact in the development of new learning methods inspired by deeper understanding of the fundamental mechanisms involved in the dynamical behavior of layered networks. We envisage two types of benefits coming from this approach.

1.
Having identified a flat minimum, one can apply a computationally intensive algorithm just for a few epochs until the flat minimum is abandoned, thus reducing the overall computational complexity of the learning process.
2.
The insight gained by the analysis of the nature of the flat minima is valuable for proposing tailor-made efficient algorithms for promptly abandoning temporary minima, whose complexity is much lower than related general purpose algorithms. Thus, even for the few epochs that will be needed to abandon the flat minimum there will be a gain in computational cost.

It is our belief that the last statement is true for both on-line and off-line learning. In our earlier work, we concentrated on the off-line mode of learning in order to propose one such tailor-made algorithm. We derived an analytical expression representing an approximation to the largest eigenvalue and introducing an efficient constrained optimizational algorithm that achieves simultaneous minimization of the cost function and maximization of the largest eigenvalue of the Jacobian matrix of the dynamical system model so that the network avoids getting trapped at a flat minimum. As a result, significant acceleration of learning in the vicinity of flat minima was achieved, reducing the total training time. The algorithm was also benchmarked against back-propagation and other well-known variants thereof in classification problems, exhibiting a very good overall behavior.

The purpose of this paper is to extend the dynamical analysis in order to account for a more general type of feedforward networks. We still consider networks with one hidden layer, but place no restriction whatsoever on the number of input, hidden and output nodes. Our study shows that the introduction of suitable state variables results in significant decouplings in the essential quantities related to learning, and, for off-line learning, leads to the formulation of a linear dynamical system model for this more general type of network. In particular, for each cluster of redundant hidden nodes, a linearized system in the corresponding dynamical variables is introduced, which is described by a corresponding symmetric Jacobian matrix with lower dimension than the total number of the weights and thresholds of the network. Abandonment of flat minima arising from the build up of redundancy is signified by the bifurcation of the eigenvalues of the Jacobian matrix of each cluster of redundant hidden units.

Moreover, we extend our effort to incorporate the dynamical system formalism into a learning algorithm that allows successful negotiation of the flat minima and, therefore, accelerates learning. It turns out that such a task requires the ability to identify clusters of redundant hidden nodes, which can be achieved using unsupervised clustering techniques. The identification of individual clusters allows the calculation of the Jacobian eigenvalues of the dynamical system model and the application of extended constrained learning optimization techniques that enable prompt bifurcation of the eigenvalues. A training algorithm (Dynamically Constrained Back Propagation—DCBP) ensues, which can be applied either autonomously or as an aid, in the vicinity of flat minima, to other well-known supervised learning algorithms. In the experimental section it is shown that DCBP exhibits improved learning abilities compared to standard back-propagation and to other reputedly fast learning algorithms (resilient propagation, ALECO-2 and variations of the conjugate gradient methods) in standard benchmark tasks.

The paper is organized as follows: in Section 2 we introduce the dynamical variables for arbitrary networks with a single hidden layer and we discuss the relation of the corresponding dynamical system model arising in the off-line learning mode to other on-line techniques dealing with the flat minima problem. In Section 3 we introduce the constrained optimization method designed to facilitate learning using constraints imposed on the eigenvalues of the Jacobian matrix. In Section 4 we present an outline of the steps required by the proposed DCBP algorithm. Section 5 contains our simulation results and describes the experiments conducted to test the performance of the algorithm and compare it with that of other supervised learning algorithms, Finally, in Section 6 conclusions are drawn and future work is outlined.

Section snippets

Motivation

Consider a neural network with a single hidden layer which has N external input signals with the addition of a bias input. The bias signal is identical for all neurons in the network. The hidden layer consists of M neurons and the output layer contains K neurons with sigmoid activation functions f(s=1/(1+exp(−s)). For a given training pattern p, the square error cost function is $E_{p} = 12 ∑ i=1 K (d_{i} −y_{i})^{2}$ where y_i denote the output activations and d_i are the desired responses of each output node i. The

Constrained optimization method

In this section, we concentrate on the utilization of the information provided by the dynamical system model for off-line learning in order to explore potential ways of helping the network to escape from flat minima. Following the analysis of the previous section, it is evident that if the maximum eigenvalues λ_c,c=1,…,S of the Jacobian matrices J_c of Eq. (30) corresponding to each of the S clusters of hidden nodes are relatively large, then the network is able to escape from the flat minimum.

DCBP algorithm outline

In order to formulate a training strategy which takes into account both the dynamical system analysis and the constrained optimization method, we should ensure that we are able to identify the clusters that are formed during the training process. For the cluster identification problem, it should be clear that normally it is difficult to obtain a clear sense of how many clusters are formed during training, but one can only suspect their formation when the error improvement is very small (e.g.

Simulation results

In our simulations, we studied the dynamics of feedforward networks that were trained to solve two different parity problems and a real world classification problem from the PROBEN1 database (Prechelt, 1994). In particular we studied the 3-bit and 4-bit parity problems and the cancer classification problem of the PROBEN1 set (the standard PROBEN1 benchmarking rules were applied). We have also tried to highlight the benefits that can arise either solely from our method (which is useful in the

Conclusions

In this paper, a dynamical system model for feedforward networks has been introduced. The model is useful for analyzing the dynamics of learning in feedforward networks in the vicinity of flat minima arising from redundancy of nodes in the hidden layer. It was shown that, as a direct consequence of the build up of redundancy, it is possible to describe the dynamics of feedforward networks using appropriate state variables whose total number is reduced compared to the total number of free

References (20)

N. Ampazis et al.
Dynamics of multilayer networks in the vicinity of temporary minima
Neural Networks
(1999)
K. Fukumizu et al.
Local minima and plateaus in hierarchical structures of multi-layer perceptrons
Neural Networks
(2000)
L. Hamey
XOR has no local minima: a case study in neural network error surface analysis
Neural Networks
(1998)
S.J. Perantonis et al.
An efficient constrained learning algorithm with momentum acceleration
Neural Networks
(1995)
S. Amari
Differential-geometrical method in statistics
Springer Lecture Note in Statistics
(1985)
S. Amari
Natural gradient works efficiently in learning
Neural Computation
(1998)
S. Amari et al.
Methods of information geometry
(2000)
S. Amari et al.
Adaptive method of realizing natural gradient learning for multilayer perceptrons
Neural Computation
(1999)
Ampazis, N., Perantonis, S. J., & Taylor, J. G. (1999). Acceleration of learning in feed-forward networks using...
J.C. Bezdek
Pattern recognition with fuzzy objective function algorithms
(1981)

There are more references available in the full text version of this article.

Cited by (0)

View full text

Contributed articleA dynamical model for the analysis and acceleration of learning in feedforward networks

Abstract

Introduction

Section snippets

Motivation

Constrained optimization method

DCBP algorithm outline

Simulation results

Conclusions

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Differential-geometrical method in statistics

Springer Lecture Note in Statistics

Natural gradient works efficiently in learning

Neural Computation

Methods of information geometry

Adaptive method of realizing natural gradient learning for multilayer perceptrons

Neural Computation

Pattern recognition with fuzzy objective function algorithms

Contributed article
A dynamical model for the analysis and acceleration of learning in feedforward networks