Learning transform invariant object recognition in the visual system with multiple stimuli present during training
Introduction
An important problem in understanding natural vision is how the brain can build invariant representations of individual objects even when multiple objects are present in a scene. What mechanisms enable the learning to proceed without the different objects interacting with each other to interfere with the learning of individual object representations? In this paper we describe and analyze an approach to this which relies on the statistics of natural environments. The approach takes account of the statistics that any object can be present with any one of a number of other objects or backgrounds during learning, with some statistical independence of any one object from other objects in the scene.
Over successive stages, the visual system develops neurons that respond with view, size and position (translation) invariance to objects or faces (Desimone, 1991, Perrett and Oram, 1993, Rolls, 1992, Rolls, 2000, Rolls and Deco, 2002, Tanaka et al., 1991). For example, it has been shown that the inferior temporal visual cortex has neurons that respond to faces and objects with translation (Ito et al., 1995, Kobotake and Tanaka, 1994, Op de Beeck and Vogels, 2000, Tovee et al., 1994), size (Ito et al., 1995, Rolls and Baylis, 1986), and view (Booth and Rolls, 1998, Hasselmo et al., 1989) invariance. Such invariant representations, once learned by viewing a number of the transforms of the object, are then useful in the visual system for allowing one-trial learning of, for example, the stimulus–reward association of the object, to generalize to other transforms of the same object (Rolls, 2005, Rolls and Deco, 2002).
A number of computational models have been developed to explain how transform-invariant cells could develop in the visual system (Fukushima, 1980, Riesenhuber and Poggio, 1999, Rolls, 2008, Wallis and Rolls, 1997). Two major theories which have sought to explain how transform-invariant representations could arise through unsupervised training with real-world visual input are trace learning (Földiák, 1991, Rolls and Milward, 2000, Wallis and Rolls, 1997), and Continuous Transformation (CT) learning (Perry et al., 2006, Stringer et al., 2006). Trace learning relies on the temporal continuity of visual objects in the real world (as does slow feature analysis (Wiskott & Sejnowski, 2002)), while in contrast, CT learning relies on spatial continuity.
In most previous studies, however, of invariance learning in hierarchical networks that model the ventral visual stream, only one stimulus is presented at a time during training (Rolls and Milward, 2000, Rolls and Stringer, 2006, Stringer et al., 2006, Wallis and Rolls, 1997). In this paper we investigate whether, and if so how, models of this type can self-organize during training when multiple stimuli are presented together within each visual image.
In Section 3 we show how a standard 1-layer competitive network responds when trained on input patterns containing multiple, e.g. pairs of, independent stimuli. As the number of stimuli increases, the number of possible input patterns which are composed of pairs of stimuli grows quadratically. For small numbers of stimuli , the output neurons still represent the paired-stimulus input patterns. However, for large enough , the output neurons begin to learn to respond to the individual stimuli instead of the multi-stimulus input patterns used during training. In this way, we show that a standard competitive network may suddenly switch from learning representations of the multi-stimulus input patterns to representing the individual stimuli as the number of independent stimuli grows large enough.
In Section 4 we continue to investigate how a 1-layer competitive network may learn to process input patterns containing multiple stimuli. However, we now extend the simulations by allowing the independent stimuli to transform, e.g. translate across the input space, during training. We show that, even when the network is trained on input patterns containing pairs of transforming stimuli, a standard 1-layer competitive network is able to learn invariant representations of the individual stimuli.
In Section 5 we extend these results to a full 4-layer hierarchical feedforward network model (VisNet) of the ventral visual processing stream. The visual input stimuli are rotating 3D objects created using the OpenGL 3D image generation software. We train the network on input images containing multiple rotating objects. We demonstrate that the network is able to develop view-invariant representations of the individual objects.
Section snippets
A hypothesis on how learning can occur about single objects even when multiple objects are present
We consider how learning about individual objects can occur even when a number of objects are present. Consider a situation that might occur in the real world in which an individual object is present, but is accompanied by one other object from a set of other objects. A possible series of trials might have (where each number refers to an individual object) all possible pairs of objects. For the case of stimuli, there are a total of 6 paired-stimulus training patterns used on different
Training a 1-layer competitive network with input patterns which contain multiple independent stimuli
In this section we show how the mechanism described in Section 2 operates in a 1-layer competitive network.
Learning transform-invariant representations of multiple stimuli in a 1-layer competitive network
It is important for the visual system to be able to produce transform-invariant representations of objects, so that the same output representation can be activated by for example the different locations or views of an object (Riesenhuber and Poggio, 1999, Rolls and Deco, 2002). We now hypothesize that invariance learning could be combined with the mechanism for learning about individual objects described in Sections 2 A hypothesis on how learning can occur about single objects even when
Training a multi-layer feedforward network (VisNet) with multiple 3D rotating stimuli
We now show how the learning mechanisms described above can operate in a more biologically realistic model of transform (e.g. view) invariant object recognition in the ventral visual processing stream, VisNet (Wallis & Rolls, 1997). To do this, we trained on a view invariance problem, in which each stimulus was a 3D object shown in 90 views, and the aim was to test whether the network could form view-invariant representations of individual objects. What was different from any previous training
Discussion
In natural vision an important problem is how the brain can build invariant representations of individual objects even when multiple objects are present in a scene. What processes enable the learning to proceed for individual objects rather than for the combinations of objects that may be present during training? In this paper we have described and analyzed an approach to this that relies on the statistics of natural environments. In particular, the features of a given object tend to co-occur
Acknowledgements
This research was supported by the Wellcome Trust, and by the MRC Interdisciplinary Research Centre for Cognitive Neuroscience.
References (41)
- et al.
Neurophysiology of shape processing
Image and Vision Computing
(1993) - et al.
Spatial vs temporal continuity in view invariant visual object recognition learning
Vision Research
(2006) Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition
Neuron
(2000)- et al.
Invariant visual object recognition: A model, with lighting invariance
Journal of Physiology–Paris
(2006) - et al.
Position invariant recognition in the visual system with cluttered environments
Neural Networks
(2000) - et al.
Invariant face and object recognition in the visual system
Progress in Neurobiology
(1997) - et al.
View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex
Cerebral Cortex
(1998) - et al.
Hebbian synapses: Biophysical mechanisms and algorithms
Annual Review of Neuroscience
(1990) Face–selective cells in the temporal cortex of monkeys
Journal of Cognitive Neuroscience
(1991)- et al.
Invariant recognition of feature combinations in the visual system
Biological Cybernetics
(2002)
Learning invariance from transformation sequences
Neural Computation
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position
Biological Cybernetics
Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey
Experimental Brain Research
Introduction to the theory of neural computation
The wake-sleep algorithm for unsupervised neural networks
Science
Size and position invariance of neuronal response in monkey inferotemporal cortex
Journal of Neurophysiology
Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex
Journal of Neurophysiology
Associative changes in the synapse: LTP in the hippocampus
The rules of elemental synaptic plasticity
Minimizing binding errors using learned conjunctive features
Neural Computation
Cited by (31)
Spatial representations in the primate hippocampus, and their functions in memory and navigation
2018, Progress in NeurobiologyThe role of independent motion in object segmentation in the ventral visual stream: Learning to recognise the separate parts of the body
2011, Vision ResearchCitation Excerpt :Therefore the “biased competition hypothesis” cannot be used to explain how the visual system learns to segment bodies into separate parts. Research has shown that it is possible to learn about individual objects when multiple objects are present in the scene without the need for an attentional mechanism using purely feedforward connectivity in a hierarchical neural network model of the ventral visual pathway, VisNet (Stringer & Rolls, 2008; Stringer, Rolls, & Tromans, 2007). Stringer and Rolls (2008) showed how the statistical properties of the visual input stimuli play a crucial role in enabling the network to develop view invariant representations of individual objects when multiple objects are present during training.
Brain Computations and Connectivity
2023, Brain Computations and ConnectivityLearning Invariant Object and Spatial View Representations in the Brain Using Slow Unsupervised Learning
2021, Frontiers in Computational NeuroscienceBrain computations: What and how
2021, Brain Computations: What and How