Generalization by symbolic abstraction in cascaded recurrent networks
Introduction
The combinatorial nature of language has spawned a search for models with an appropriate learning bias operating on the elements of language. Specifically, recurrent neural networks have delivered evidence that models with realistic generalization properties can be adapted from simple language corpora [7], [4], [18]. Single-step prediction learning is a learning strategy that has proven particularly useful for producing generalizations according to contextual and functional similarities of untagged language data. Given a sequence of words presented to the network, the network essentially learns to output the probabilities of words occurring next but in a manner quite distinct from conventional -gram modeling: by concurrently developing abstractions and dynamics for processing linguistic structure in its state space.
From a generalization point of view, one problem is that realistic language exposure is sparse. Human language users are found to exhibit extraordinary capacity to cope with learning situations which are statistically weak. As pointed out by Hadley [10], Phillips [15] and Marcus [12], [13], generalization in humans sometimes goes beyond the regularities inferred with recurrent networks. As an example, if you have heard “Smith fleedled Jones” you would infer that it is grammatical to say “Smith fleedled Belanger” even though you have never seen “Belanger” as the object in that context [12]. In error-based learning the likelihood that Belanger would be predicted (in that position) by the network is decreased every time another word occurs instead. If Belanger never occurs—in that particular training context—the probability eventually goes down to zero [12] (cf. [19]). It has been argued that neural networks do not exhibit sufficient “systematicity”—an ability to deal with certain types of structural inferences as exemplified above [8], [10].
Inducing means in recurrent networks for processing structurally complex languages (e.g. context-free and context-sensitive) is indeed possible but in many cases difficult and fragile [2], [9], [17]. Available demonstrations of learning structurally complex languages are limited in at least two senses:
- (1)
Only a few terminal symbols are employed. It is unclear if large-scale languages can be used for induction.
- (2)
The dynamics, which indicates that the network is actually going beyond regular language processing, is associated with individual terminals rather than word classes. It is unclear if the network will extrapolate, i.e. utilize the same principles for processing other related symbols possibly appearing at new levels of structural embedding.
The availability of grammatical and semantical abstractions, embodied as internal activation clusters (groups) in recurrent networks trained on language corpora, led us to investigate how discretization of such abstractions could improve learning the dynamics previously only detected for individual words. The idea is supported by observed performance improvements of classifying continuous features by discretization prior to machine induction [5].
In this paper we show that generalization performance in recurrent neural networks is enhanced by cascading several networks. By discretizing abstractions induced in one network, other networks can operate on a coarse symbolic level with increased performance on sparse and structural prediction tasks. The neural network architecture with associated learning mechanisms is described. The level of systematicity exhibited by the cascade of recurrent networks is assessed on three language domains. Considerable improvement is observed for two of the domains.
Section snippets
Structure and abstraction
Assume a learner faces examples presented sequentially and in random order from the simple context-free language , where : . If we collect statistics for combinations of six consecutive letters (a 6-gram), a prediction task can be carried out to perfection. There would be 25 examples in such a table so to estimate the probabilities used for generating the strings would not take long.
A tabular approach like the 6-gram would not be inclined to process strings generated
Cascaded recurrent networks
The idea this paper presents is simple: the activation clusters found at a hidden layer of a prediction recurrent network can be segmented, discretized and be used as inputs and targets of another, cascaded, recurrent network, similarly trained to predict the next input. The basic architecture is shown in Fig. 1.
Effectively, through continuous training, the cascaded network approximates a coarser probability distribution over a different set of discrete variables (e.g. “noun” and “verb” replace
Simulations
Specific network architectures were designed and language data generated to address the question if the cascaded recurrent network (through its own learning task and its own independent dynamics) supports a wider range of generalizations—including abstraction and dynamics for structural extrapolation. For benchmarking, conventional recurrent networks (realized by the primary networks) with various number of hidden layers and various number of hidden units were studied.
All networks are trained
Discussion
The lowest training cost for a prediction task is attained if the network produces a probability distribution over all items reflecting the frequencies by which each item occurs in the specific context. Marcus has a point when he argues that a non-occurring item will not (in the limit) be predicted by a network that employs error-based learning. However, the network will internally encode similarities between items as they occur over the complete training set and not just as they appear in
Conclusion
In language processing there are at least two essential aspects to generalization: abstraction and structural extrapolation. Previous work has focused on one or the other.
We have shown that recurrent networks are well-suited to abstract symbols according to their similarities in a structurally homogeneous prediction task. However, the same networks are unable to reach the same level of performance on structural extrapolation as networks trained directly on abstract data. By feeding discretized
Acknowledgments
The author gratefully acknowledges the insightful comments made by two anonymous reviewers.
References (20)
- et al.
Toward a connectionist model of recursion in human linguistic performance
Cognitive Sci.
(1999) - et al.
Supervised and unsupervised discretization of continuous features
Finding structure in time
Cognitive Sci.
(1990)Learning and development in neural networksthe importance of starting small
Cognition
(1993)- et al.
Connectionism and cognitive architecturea critical analysis
Cognition
(1988) Language acquisition in the absence of explicit negative evidencecan simple recurrent networks obviate the need for domain-specific learning devices?
Cognition
(1999)- et al.
Language acquisition in the absence of explicit negative evidencehow important is starting small?
Cognition
(1999) - et al.
Simple recurrent networks can distinguish non-occurring from ungrammatical sentences given appropriate task structure: reply to Marcus
Cognition
(1999) - et al.
Learning the dynamics of embedded clauses
Appl. Intell.
(2003) - et al.
Context-free and context-sensitive dynamics in recurrent neural networks
Connection Sci.
(2000)
Cited by (8)
Connectionist semantic systematicity
2009, CognitionGeneralization by categorical nodes in recurrent neural networks
2006, International Congress SeriesSemantic systematicity in connectionist language production
2021, Information (Switzerland)Getting real about systematicity
2014, The Architecture of Cognition: Rethinking Fodor and Pylyshyn's Systematicity ChallengeLearn more by training less: Systematicity in sentence processing by recurrent networks
2006, Connection Science