Elsevier

Neurocomputing

Volume 57, March 2004, Pages 87-104
Neurocomputing

Generalization by symbolic abstraction in cascaded recurrent networks

https://doi.org/10.1016/j.neucom.2004.01.006Get rights and content

Abstract

Generalization performance in recurrent neural networks is enhanced by cascading several networks. By discretizing abstractions induced in one network, other networks can operate on a coarse symbolic level with increased performance on sparse and structural prediction tasks. The level of systematicity exhibited by the cascade of recurrent networks is assessed on the basis of three language domains.

Introduction

The combinatorial nature of language has spawned a search for models with an appropriate learning bias operating on the elements of language. Specifically, recurrent neural networks have delivered evidence that models with realistic generalization properties can be adapted from simple language corpora [7], [4], [18]. Single-step prediction learning is a learning strategy that has proven particularly useful for producing generalizations according to contextual and functional similarities of untagged language data. Given a sequence of words presented to the network, the network essentially learns to output the probabilities of words occurring next but in a manner quite distinct from conventional n-gram modeling: by concurrently developing abstractions and dynamics for processing linguistic structure in its state space.

From a generalization point of view, one problem is that realistic language exposure is sparse. Human language users are found to exhibit extraordinary capacity to cope with learning situations which are statistically weak. As pointed out by Hadley [10], Phillips [15] and Marcus [12], [13], generalization in humans sometimes goes beyond the regularities inferred with recurrent networks. As an example, if you have heard “Smith fleedled Jones” you would infer that it is grammatical to say “Smith fleedled Belanger” even though you have never seen “Belanger” as the object in that context [12]. In error-based learning the likelihood that Belanger would be predicted (in that position) by the network is decreased every time another word occurs instead. If Belanger never occurs—in that particular training context—the probability eventually goes down to zero [12] (cf. [19]). It has been argued that neural networks do not exhibit sufficient “systematicity”—an ability to deal with certain types of structural inferences as exemplified above [8], [10].

Inducing means in recurrent networks for processing structurally complex languages (e.g. context-free and context-sensitive) is indeed possible but in many cases difficult and fragile [2], [9], [17]. Available demonstrations of learning structurally complex languages are limited in at least two senses:

  • (1)

    Only a few terminal symbols are employed. It is unclear if large-scale languages can be used for induction.

  • (2)

    The dynamics, which indicates that the network is actually going beyond regular language processing, is associated with individual terminals rather than word classes. It is unclear if the network will extrapolate, i.e. utilize the same principles for processing other related symbols possibly appearing at new levels of structural embedding.

The availability of grammatical and semantical abstractions, embodied as internal activation clusters (groups) in recurrent networks trained on language corpora, led us to investigate how discretization of such abstractions could improve learning the dynamics previously only detected for individual words. The idea is supported by observed performance improvements of classifying continuous features by discretization prior to machine induction [5].

In this paper we show that generalization performance in recurrent neural networks is enhanced by cascading several networks. By discretizing abstractions induced in one network, other networks can operate on a coarse symbolic level with increased performance on sparse and structural prediction tasks. The neural network architecture with associated learning mechanisms is described. The level of systematicity exhibited by the cascade of recurrent networks is assessed on three language domains. Considerable improvement is observed for two of the domains.

Section snippets

Structure and abstraction

Assume a learner faces examples presented sequentially and in random order from the simple context-free language albl, where 1⩽l⩽3: aabbabaaabbbaabba…. If we collect statistics for combinations of six consecutive letters (a 6-gram), a prediction task can be carried out to perfection. There would be 25 examples in such a table so to estimate the probabilities used for generating the strings would not take long.

A tabular approach like the 6-gram would not be inclined to process strings generated

Cascaded recurrent networks

The idea this paper presents is simple: the activation clusters found at a hidden layer of a prediction recurrent network can be segmented, discretized and be used as inputs and targets of another, cascaded, recurrent network, similarly trained to predict the next input. The basic architecture is shown in Fig. 1.

Effectively, through continuous training, the cascaded network approximates a coarser probability distribution over a different set of discrete variables (e.g. “noun” and “verb” replace

Simulations

Specific network architectures were designed and language data generated to address the question if the cascaded recurrent network (through its own learning task and its own independent dynamics) supports a wider range of generalizations—including abstraction and dynamics for structural extrapolation. For benchmarking, conventional recurrent networks (realized by the primary networks) with various number of hidden layers and various number of hidden units were studied.

All networks are trained

Discussion

The lowest training cost for a prediction task is attained if the network produces a probability distribution over all items reflecting the frequencies by which each item occurs in the specific context. Marcus has a point when he argues that a non-occurring item will not (in the limit) be predicted by a network that employs error-based learning. However, the network will internally encode similarities between items as they occur over the complete training set and not just as they appear in

Conclusion

In language processing there are at least two essential aspects to generalization: abstraction and structural extrapolation. Previous work has focused on one or the other.

We have shown that recurrent networks are well-suited to abstract symbols according to their similarities in a structurally homogeneous prediction task. However, the same networks are unable to reach the same level of performance on structural extrapolation as networks trained directly on abstract data. By feeding discretized

Acknowledgments

The author gratefully acknowledges the insightful comments made by two anonymous reviewers.

References (20)

There are more references available in the full text version of this article.

Cited by (8)

  • Getting real about systematicity

    2014, The Architecture of Cognition: Rethinking Fodor and Pylyshyn's Systematicity Challenge
View all citing articles on Scopus
View full text