Elsevier

Neurocomputing

Volume 63, January 2005, Pages 5-23
Neurocomputing

Analyzing the weight dynamics of recurrent learning algorithms

https://doi.org/10.1016/j.neucom.2004.04.006Get rights and content

Abstract

We provide insights into the organization and dynamics of recurrent online training algorithms by comparing real time recurrent learning (RTRL) with a new continuous-time online algorithm. The latter is derived in the spirit of a recent approach introduced by Atiya and Parlos (IEEE Trans. Neural Networks 11 (3) (2000) 697), which leads to non-gradient search directions. We refer to this approach as Atiya–Parlos learning (APRL) and interpret it with respect to its strategy to minimize the standard quadratic error. Simulations show that the different approaches of RTRL and APRL lead to qualitatively different weight dynamics. A formal analysis of the one-output behavior of APRL further reveals that the weight dynamics favor a functional partition of the network into a fast output layer and a slower dynamical reservoir, whose rates of weight change are closely coupled.

Introduction

Recurrent neural networks (RNNs) are attractive tools for tasks of sequential nature like time-series prediction, sequence generation, speech recognition, or adaptive control. Many architectures ranging from fully connected to partially or locally RNNs have been developed. Although their application is successful in practice, the dynamical behavior of the networks leads to high complexity of the algorithms. The credit assignment problem and the potentially rich dynamical properties of the networks make it difficult to devise efficient recurrent learning schemes. Contemporary approaches frequently employ regularization techniques to tackle these problems [2]. The analysis of dynamical properties of recurrent networks and the corresponding learning algorithms nevertheless remains a challenging field of research.

Some encouraging results are available with respect to formulate a common unifying framework for gradient-based techniques like real time recurrent learning (RTRL) or backpropagation through time (BPTT). Atiya and Parlos [1] have shown that these can be derived from a constrained optimization problem which combines the quadratic error with constraints reflecting the network dynamics. They also introduced a new strategy to minimize the standard error which employs search directions different from the gradient. It considers the states as control variables and computes weight updates to achieve a targeted change in the state variables. We reference to this strategy as Atiya–Parlos learning (APRL). In [1], an O(n2)-efficient APRL algorithm was given for discrete networks, while we will introduce below an APRL algorithm for continuous-time networks. We also show that APRL can be interpreted as a truncated “one-step-backward” propagation of the instantaneous error combined with a momentum term providing the necessary dynamic memory.

The existence of such different approaches as RTRL and APRL to minimize the standard quadratic error function along different search directions motivates further investigations on whether and how the resulting weight dynamics differ. This is especially interesting for online trajectory learning, because in this case the input data are highly correlated. In this case, the gradient direction becomes a well-founded heuristics because it does not implement a stochastic gradient descent any more and rather the combined dynamics of error function and weights determine the learning success. The obvious problem is that direct access to the dynamics of online learning on the high-dimensional error surface is not possible for two main reasons: because of the multiple constraints acting through the network dynamics and the complex dependency of the time-varying error surface on the network parameters.

Therefore, we use a comparative approach to investigate the Atiya–Parlos methodology and its weight dynamics as opposed to results for RTRL [8] on two typical tasks. Simulations and theoretical results show that the two algorithms behave quite different and yield evidence that their different ways to minimize the error lead to unrelated weight dynamics. This point of view is further supported by a formal treatment of the one-output case of APRL which reveals a functional division of the network which resembles more the “echo state network” [4], [5] and “liquid state machine” approaches [6], [7] than classical backpropagation.

The remainder of this work is organized as follows: In Section 2, we give a continuous-time online algorithm based on the APRL paradigm [11] and interpret the algorithm with respect to its strategy to minimize the error function. In Section 3, we present simulation results from APRL and RTRL and analyze the weight dynamics of the algorithms. In Section 4, we turn to the one-output case and prove that APRL leads to a very special weight dynamics that results in a functional partition of the network. In Section 5, we summarize the results and discuss the insights gained hereby.

Section snippets

APRL

For further reference, we give a continuous-time algorithm which is derived with the APRL learning approach in a straightforward way and similar to the treatment in [1] for discrete networks.1

Simulation results: APRL vs. RTRL

From the previous discussion we expect that simulations may show substantial performance differences between APRL and RTRL, in particular for online learning. Below we consider two tasks to illustrate such differences: the popular Rössler attractor [9], which is mildly chaotic, has a strange attractor limit set, and has been shown to be learnable by relatively small recurrent networks, and secondly the well-known Mackey–Glass system.

For the Rössler system we learn the operator which maps

The one-output behavior of APRL

In the following, we consider the case of only one output neuron, say x1 (note that all simulations presented use only one output). It turns out that the weight updates in the non-output part of the weight matrix scale equally and with constant rate in every column. The scaling factors are proportional to the weights in the first column. Formally, this is stated as

Proposition 1

ki>1j>1h:Δwih(k)=wi1(0)wj1(0)Δwjh(k).

The proof for (13) is given in Appendix B.

The above result shows that there are two

Summary and discussion

Our investigations on the weight dynamics of the continuous-time online-learning algorithm derived with the recent APRL approach [1], [11], [10] in comparison with RTRL show that the different strategies to minimize the error function lead to significantly different weight and error surface dynamics. The simulations confirm results from [1] which have also been obtained for different tasks and indicate that APRL can be used with higher learning rates such that the minimal training error is

Acknowledgements

We would like to thank three anonymous reviewers for detailed and valuable comments, which helped very much to improve the manuscript.

Ulf D. Schiller studied computer science and physics at the University of Bielefeld. He received the Diploma degree in computer science in 2003. His Diploma thesis was carried out at the Neuroinformatics Group of the Faculty of Technology at Bielefeld University. Currently, he is a member of the Condensed Matter Theory Group of the Faculty of Physics at Bielefeld University, where he is working on computer simulations of lipid-bilayers. His research interests include neural networks, machine

References (12)

  • O.E. Rössler

    An equation for continuous chaos

    Phys. Lett.

    (1976)
  • A.F. Atiya et al.

    New results on recurrent network trainingunifying the algorithms and accelerating convergence

    IEEE Trans. Neural Networks

    (2000)
  • B. Hammer et al.

    Tutorialperspectives on learning with RNNs

  • S. Hochreiter

    The vanishing gradient problem during learning recurrent neural nets and problem solutions

    Int. J. Uncertainty, Fuzziness and Knowledge-Based Systems

    (1998)
  • H. Jaeger, The echo state approach to analysing and training recurrent neural networks, Technical Reports 148, GMD,...
  • H. Jaeger

    Adaptive nonlinear system identification with echo state networks

There are more references available in the full text version of this article.

Cited by (49)

  • Deep echo state networks in data marketplaces

    2023, Machine Learning with Applications
  • Unsupervised EEG feature extraction based on echo state network

    2019, Information Sciences
    Citation Excerpt :

    Such a training strategy has also been proved by research in the areas of computational and cognitive neuroscience. Schiller and Steil pointed out that though all weights are adapted during training of RNNs, the dominant changes lie in the output weights [40]. In cognitive neuroscience, Dominey et al. investigated a similar mechanism in the context of modeling sequence processing in mammalian brains [13].

  • Ensemble Deep Random Vector Functional Link Neural Network for Regression

    2023, IEEE Transactions on Systems, Man, and Cybernetics: Systems
View all citing articles on Scopus

Ulf D. Schiller studied computer science and physics at the University of Bielefeld. He received the Diploma degree in computer science in 2003. His Diploma thesis was carried out at the Neuroinformatics Group of the Faculty of Technology at Bielefeld University. Currently, he is a member of the Condensed Matter Theory Group of the Faculty of Physics at Bielefeld University, where he is working on computer simulations of lipid-bilayers. His research interests include neural networks, machine learning, nonlinear dynamics and modelling and simulation of complex dynamical systems.

Dr. Jochen J. Steil (www.jsteil.de) received the diploma in mathematics from the University of Bielefeld, Germany, in 1993. Since then he has been a member of the Neuroinformatics Group at the University of Bielefeld, interrupted by one year at the St. Petersburg Electrotechnical University, Russia, where he was supported by a German Academic Exchange Foudation (DAAD) grant. In 1999, he received the PhD. Degree with a Dissertation on “Input-Output Stability of Recurrent Neural Networks”, since 2002 he has been appointed tenured senior researcher and lecturer (Akad. Rat). J.J. Steil is member of the special research unit 360 “Situated Artifical Communicators” and the Graduate Program “Task Oriented Communication” and heads projects on robot learning and intelligent systems. Main research interests of J.J. Steil are analysis, stability, and control of recurrent dynamics and recurrent learning as well as cognitive oriented learning architectures of complex robots for multimodal communication and grasping. He is member of the ENNS and the IEEE neural networks society.

View full text