Analyzing the weight dynamics of recurrent learning algorithms

doi:10.1016/j.neucom.2004.04.006

Neurocomputing

Volume 63, January 2005, Pages 5-23

https://doi.org/10.1016/j.neucom.2004.04.006 Get rights and content

Abstract

We provide insights into the organization and dynamics of recurrent online training algorithms by comparing real time recurrent learning (RTRL) with a new continuous-time online algorithm. The latter is derived in the spirit of a recent approach introduced by Atiya and Parlos (IEEE Trans. Neural Networks 11 (3) (2000) 697), which leads to non-gradient search directions. We refer to this approach as Atiya–Parlos learning (APRL) and interpret it with respect to its strategy to minimize the standard quadratic error. Simulations show that the different approaches of RTRL and APRL lead to qualitatively different weight dynamics. A formal analysis of the one-output behavior of APRL further reveals that the weight dynamics favor a functional partition of the network into a fast output layer and a slower dynamical reservoir, whose rates of weight change are closely coupled.

Introduction

Recurrent neural networks (RNNs) are attractive tools for tasks of sequential nature like time-series prediction, sequence generation, speech recognition, or adaptive control. Many architectures ranging from fully connected to partially or locally RNNs have been developed. Although their application is successful in practice, the dynamical behavior of the networks leads to high complexity of the algorithms. The credit assignment problem and the potentially rich dynamical properties of the networks make it difficult to devise efficient recurrent learning schemes. Contemporary approaches frequently employ regularization techniques to tackle these problems [2]. The analysis of dynamical properties of recurrent networks and the corresponding learning algorithms nevertheless remains a challenging field of research.

Some encouraging results are available with respect to formulate a common unifying framework for gradient-based techniques like real time recurrent learning (RTRL) or backpropagation through time (BPTT). Atiya and Parlos [1] have shown that these can be derived from a constrained optimization problem which combines the quadratic error with constraints reflecting the network dynamics. They also introduced a new strategy to minimize the standard error which employs search directions different from the gradient. It considers the states as control variables and computes weight updates to achieve a targeted change in the state variables. We reference to this strategy as Atiya–Parlos learning (APRL). In [1], an $O (n^{2})$ -efficient APRL algorithm was given for discrete networks, while we will introduce below an APRL algorithm for continuous-time networks. We also show that APRL can be interpreted as a truncated “one-step-backward” propagation of the instantaneous error combined with a momentum term providing the necessary dynamic memory.

The existence of such different approaches as RTRL and APRL to minimize the standard quadratic error function along different search directions motivates further investigations on whether and how the resulting weight dynamics differ. This is especially interesting for online trajectory learning, because in this case the input data are highly correlated. In this case, the gradient direction becomes a well-founded heuristics because it does not implement a stochastic gradient descent any more and rather the combined dynamics of error function and weights determine the learning success. The obvious problem is that direct access to the dynamics of online learning on the high-dimensional error surface is not possible for two main reasons: because of the multiple constraints acting through the network dynamics and the complex dependency of the time-varying error surface on the network parameters.

Therefore, we use a comparative approach to investigate the Atiya–Parlos methodology and its weight dynamics as opposed to results for RTRL [8] on two typical tasks. Simulations and theoretical results show that the two algorithms behave quite different and yield evidence that their different ways to minimize the error lead to unrelated weight dynamics. This point of view is further supported by a formal treatment of the one-output case of APRL which reveals a functional division of the network which resembles more the “echo state network” [4], [5] and “liquid state machine” approaches [6], [7] than classical backpropagation.

The remainder of this work is organized as follows: In Section 2, we give a continuous-time online algorithm based on the APRL paradigm [11] and interpret the algorithm with respect to its strategy to minimize the error function. In Section 3, we present simulation results from APRL and RTRL and analyze the weight dynamics of the algorithms. In Section 4, we turn to the one-output case and prove that APRL leads to a very special weight dynamics that results in a functional partition of the network. In Section 5, we summarize the results and discuss the insights gained hereby.

Section snippets

APRL

For further reference, we give a continuous-time algorithm which is derived with the APRL learning approach in a straightforward way and similar to the treatment in [1] for discrete networks.¹

Simulation results: APRL vs. RTRL

From the previous discussion we expect that simulations may show substantial performance differences between APRL and RTRL, in particular for online learning. Below we consider two tasks to illustrate such differences: the popular Rössler attractor [9], which is mildly chaotic, has a strange attractor limit set, and has been shown to be learnable by relatively small recurrent networks, and secondly the well-known Mackey–Glass system.

For the Rössler system we learn the operator which maps

The one-output behavior of APRL

In the following, we consider the case of only one output neuron, say $x_{1}$ (note that all simulations presented use only one output). It turns out that the weight updates in the non-output part of the weight matrix scale equally and with constant rate in every column. The scaling factors are proportional to the weights in the first column. Formally, this is stated as

Proposition 1

$\forall k \forall i > 1 \forall j > 1 \forall h : Δ w_{ih} (k) = \frac{w_{i 1} (0)}{w_{j 1} (0)} Δ w_{jh} (k) .$

The proof for (13) is given in Appendix B.

The above result shows that there are two

Summary and discussion

Our investigations on the weight dynamics of the continuous-time online-learning algorithm derived with the recent APRL approach [1], [11], [10] in comparison with RTRL show that the different strategies to minimize the error function lead to significantly different weight and error surface dynamics. The simulations confirm results from [1] which have also been obtained for different tasks and indicate that APRL can be used with higher learning rates such that the minimal training error is

Acknowledgements

We would like to thank three anonymous reviewers for detailed and valuable comments, which helped very much to improve the manuscript.

Ulf D. Schiller studied computer science and physics at the University of Bielefeld. He received the Diploma degree in computer science in 2003. His Diploma thesis was carried out at the Neuroinformatics Group of the Faculty of Technology at Bielefeld University. Currently, he is a member of the Condensed Matter Theory Group of the Faculty of Physics at Bielefeld University, where he is working on computer simulations of lipid-bilayers. His research interests include neural networks, machine

References (12)

O.E. Rössler
An equation for continuous chaos
Phys. Lett.
(1976)
A.F. Atiya et al.
New results on recurrent network trainingunifying the algorithms and accelerating convergence
IEEE Trans. Neural Networks
(2000)
B. Hammer et al.
Tutorialperspectives on learning with RNNs
S. Hochreiter
The vanishing gradient problem during learning recurrent neural nets and problem solutions
Int. J. Uncertainty, Fuzziness and Knowledge-Based Systems
(1998)
H. Jaeger, The echo state approach to analysing and training recurrent neural networks, Technical Reports 148, GMD,...
H. Jaeger
Adaptive nonlinear system identification with echo state networks

There are more references available in the full text version of this article.

Cited by (49)

From lazy to rich to exclusive task representations in neural networks and neural codes
2023, Current Opinion in Neurobiology
Neural circuits—both in the brain and in “artificial” neural network models—learn to solve a remarkable variety of tasks, and there is a great current opportunity to use neural networks as models for brain function. Key to this endeavor is the ability to characterize the representations formed by both artificial and biological brains. Here, we investigate this potential through the lens of recently developing theory that characterizes neural networks as “lazy” or “rich” depending on the approach they use to solve tasks: lazy networks solve tasks by making small changes in connectivity, while rich networks solve tasks by significantly modifying weights throughout the network (including “hidden layers”). We further elucidate rich networks through the lens of compression and “neural collapse”, ideas that have recently been of significant interest to neuroscience and machine learning. We then show how these ideas apply to a domain of increasing importance to both fields: extracting latent structures through self-supervised learning.
Deep echo state networks in data marketplaces
2023, Machine Learning with Applications
Data Marketplaces are the digital platform for data buyers and data sellers to trade information as valuable products or items. The expectation taken for granted from the users of a data marketplace is the truth of the exchanged information. However, the trade of factual data also means the marketable product is no longer unique but a series of replicas. If every user within the data marketplace owns the same information, this data eventually becomes valueless. There are specific instances where the traded products are sought to be always unique, for instance predictions or digital art. This article applies Deep Echo State Networks (ESNs) in data marketplaces that map tradeable data into a larger dimensional space via the dynamics of reservoirs with fixed and non-linear properties. These reservoirs generate unique tradeable data products that cannot be replicated, therefore ensuring its exclusivity and commercial value. The validation results show that ESNs can also be applied to generate random tradeable products in different dimensional spaces. Specifically, the reservoir with its associated neural perturbation emulates a digital creator that generates unique and exclusive content based on 1D functions and 2D images.
Unsupervised EEG feature extraction based on echo state network
2019, Information Sciences
Citation Excerpt :
Such a training strategy has also been proved by research in the areas of computational and cognitive neuroscience. Schiller and Steil pointed out that though all weights are adapted during training of RNNs, the dominant changes lie in the output weights [40]. In cognitive neuroscience, Dominey et al. investigated a similar mechanism in the context of modeling sequence processing in mammalian brains [13].
Advanced analytics such as event detection, pattern recognition, clustering, and classification with electroencephalogram (EEG) data often rely on extracted EEG features. Most of the existing EEG feature extraction approaches are hand-designed with expert knowledge or prior assumptions, which may lead to inferior analytical performances. In this paper, we develop a fully data-driven EEG feature extraction method by applying recurrent autoencoders on multivariate EEG signals. We use an Echo State Network (ESN) to encode EEG signals to EEG features, and then decode them to recover the original EEG signals. Therefore, we name our method feature extraction based on echo state network, or simply FE-ESN. We show that the well-known autoregression-based EEG feature extraction can be seen as a simplified variation of our FE-ESN method. We have conducted experiments on real-world EEG data to evaluate the effectiveness of FE-ESN for both classification tasks and clustering tasks. Experimental results demonstrate the superiority of FE-ESN over the state-of-the-art methods. This paper not only provides a novel EEG feature extraction method but also opens up a new way towards unsupervised EEG feature design.
Optimizing Reservoir Computing Based on an Alternating Input-Driven Spin-Torque Oscillator
2023, Physical Review Applied
Ensemble Deep Random Vector Functional Link Neural Network for Regression
2023, IEEE Transactions on Systems, Man, and Cybernetics: Systems
Neural echo state network using oscillations of gas bubbles in water
2022, Physical Review E

View all citing articles on Scopus

Dr. Jochen J. Steil (www.jsteil.de) received the diploma in mathematics from the University of Bielefeld, Germany, in 1993. Since then he has been a member of the Neuroinformatics Group at the University of Bielefeld, interrupted by one year at the St. Petersburg Electrotechnical University, Russia, where he was supported by a German Academic Exchange Foudation (DAAD) grant. In 1999, he received the PhD. Degree with a Dissertation on “Input-Output Stability of Recurrent Neural Networks”, since 2002 he has been appointed tenured senior researcher and lecturer (Akad. Rat). J.J. Steil is member of the special research unit 360 “Situated Artifical Communicators” and the Graduate Program “Task Oriented Communication” and heads projects on robot learning and intelligent systems. Main research interests of J.J. Steil are analysis, stability, and control of recurrent dynamics and recurrent learning as well as cognitive oriented learning architectures of complex robots for multimodal communication and grasping. He is member of the ENNS and the IEEE neural networks society.

View full text

Analyzing the weight dynamics of recurrent learning algorithms

Abstract

Introduction

Section snippets

APRL

Simulation results: APRL vs. RTRL

The one-output behavior of APRL

Summary and discussion

Acknowledgements

Phys. Lett.

New results on recurrent network trainingunifying the algorithms and accelerating convergence

IEEE Trans. Neural Networks

Tutorialperspectives on learning with RNNs

The vanishing gradient problem during learning recurrent neural nets and problem solutions

Int. J. Uncertainty, Fuzziness and Knowledge-Based Systems

Adaptive nonlinear system identification with echo state networks