Elsevier

Neural Networks

Volume 21, Issue 7, September 2008, Pages 888-903
Neural Networks

Learning transform invariant object recognition in the visual system with multiple stimuli present during training

https://doi.org/10.1016/j.neunet.2007.11.004Get rights and content

Abstract

Over successive stages, the visual system develops neurons that respond with view, size and position invariance to objects or faces. A number of computational models have been developed to explain how transform-invariant cells could develop in the visual system. However, a major limitation of computer modelling studies to date has been that the visual stimuli are typically presented one at a time to the network during training. In this paper, we investigate how vision models may self-organize when multiple stimuli are presented together within each visual image during training. We show that as the number of independent stimuli grows large enough, standard competitive neural networks can suddenly switch from learning representations of the multi-stimulus input patterns to representing the individual stimuli. Furthermore, the competitive networks can learn transform (e.g. position or view) invariant representations of the individual stimuli if the network is presented with input patterns containing multiple transforming stimuli during training. Finally, we extend these results to a multi-layer hierarchical network model (VisNet) of the ventral visual system. The network is trained on input images containing multiple rotating 3D objects. We show that the network is able to develop view-invariant representations of the individual objects.

Introduction

An important problem in understanding natural vision is how the brain can build invariant representations of individual objects even when multiple objects are present in a scene. What mechanisms enable the learning to proceed without the different objects interacting with each other to interfere with the learning of individual object representations? In this paper we describe and analyze an approach to this which relies on the statistics of natural environments. The approach takes account of the statistics that any object can be present with any one of a number of other objects or backgrounds during learning, with some statistical independence of any one object from other objects in the scene.

Over successive stages, the visual system develops neurons that respond with view, size and position (translation) invariance to objects or faces (Desimone, 1991, Perrett and Oram, 1993, Rolls, 1992, Rolls, 2000, Rolls and Deco, 2002, Tanaka et al., 1991). For example, it has been shown that the inferior temporal visual cortex has neurons that respond to faces and objects with translation (Ito et al., 1995, Kobotake and Tanaka, 1994, Op de Beeck and Vogels, 2000, Tovee et al., 1994), size (Ito et al., 1995, Rolls and Baylis, 1986), and view (Booth and Rolls, 1998, Hasselmo et al., 1989) invariance. Such invariant representations, once learned by viewing a number of the transforms of the object, are then useful in the visual system for allowing one-trial learning of, for example, the stimulus–reward association of the object, to generalize to other transforms of the same object (Rolls, 2005, Rolls and Deco, 2002).

A number of computational models have been developed to explain how transform-invariant cells could develop in the visual system (Fukushima, 1980, Riesenhuber and Poggio, 1999, Rolls, 2008, Wallis and Rolls, 1997). Two major theories which have sought to explain how transform-invariant representations could arise through unsupervised training with real-world visual input are trace learning (Földiák, 1991, Rolls and Milward, 2000, Wallis and Rolls, 1997), and Continuous Transformation (CT) learning (Perry et al., 2006, Stringer et al., 2006). Trace learning relies on the temporal continuity of visual objects in the real world (as does slow feature analysis (Wiskott & Sejnowski, 2002)), while in contrast, CT learning relies on spatial continuity.

In most previous studies, however, of invariance learning in hierarchical networks that model the ventral visual stream, only one stimulus is presented at a time during training (Rolls and Milward, 2000, Rolls and Stringer, 2006, Stringer et al., 2006, Wallis and Rolls, 1997). In this paper we investigate whether, and if so how, models of this type can self-organize during training when multiple stimuli are presented together within each visual image.

In Section 3 we show how a standard 1-layer competitive network responds when trained on input patterns containing multiple, e.g. pairs of, independent stimuli. As the number of stimuli N increases, the number of possible input patterns which are composed of pairs of stimuli N(N1)/2 grows quadratically. For small numbers of stimuli N, the output neurons still represent the paired-stimulus input patterns. However, for large enough N, the output neurons begin to learn to respond to the individual stimuli instead of the multi-stimulus input patterns used during training. In this way, we show that a standard competitive network may suddenly switch from learning representations of the multi-stimulus input patterns to representing the individual stimuli as the number of independent stimuli grows large enough.

In Section 4 we continue to investigate how a 1-layer competitive network may learn to process input patterns containing multiple stimuli. However, we now extend the simulations by allowing the independent stimuli to transform, e.g. translate across the input space, during training. We show that, even when the network is trained on input patterns containing pairs of transforming stimuli, a standard 1-layer competitive network is able to learn invariant representations of the individual stimuli.

In Section 5 we extend these results to a full 4-layer hierarchical feedforward network model (VisNet) of the ventral visual processing stream. The visual input stimuli are rotating 3D objects created using the OpenGL 3D image generation software. We train the network on input images containing multiple rotating objects. We demonstrate that the network is able to develop view-invariant representations of the individual objects.

Section snippets

A hypothesis on how learning can occur about single objects even when multiple objects are present

We consider how learning about individual objects can occur even when a number of objects are present. Consider a situation that might occur in the real world in which an individual object is present, but is accompanied by one other object from a set of other objects. A possible series of trials might have (where each number refers to an individual object) all possible pairs of objects. For the case of N=4 stimuli, there are a total of 6 paired-stimulus training patterns used on different

Training a 1-layer competitive network with input patterns which contain multiple independent stimuli

In this section we show how the mechanism described in Section 2 operates in a 1-layer competitive network.

Learning transform-invariant representations of multiple stimuli in a 1-layer competitive network

It is important for the visual system to be able to produce transform-invariant representations of objects, so that the same output representation can be activated by for example the different locations or views of an object (Riesenhuber and Poggio, 1999, Rolls and Deco, 2002). We now hypothesize that invariance learning could be combined with the mechanism for learning about individual objects described in Sections 2 A hypothesis on how learning can occur about single objects even when

Training a multi-layer feedforward network (VisNet) with multiple 3D rotating stimuli

We now show how the learning mechanisms described above can operate in a more biologically realistic model of transform (e.g. view) invariant object recognition in the ventral visual processing stream, VisNet (Wallis & Rolls, 1997). To do this, we trained on a view invariance problem, in which each stimulus was a 3D object shown in 90 views, and the aim was to test whether the network could form view-invariant representations of individual objects. What was different from any previous training

Discussion

In natural vision an important problem is how the brain can build invariant representations of individual objects even when multiple objects are present in a scene. What processes enable the learning to proceed for individual objects rather than for the combinations of objects that may be present during training? In this paper we have described and analyzed an approach to this that relies on the statistics of natural environments. In particular, the features of a given object tend to co-occur

Acknowledgements

This research was supported by the Wellcome Trust, and by the MRC Interdisciplinary Research Centre for Cognitive Neuroscience.

References (41)

  • P. Földiák

    Learning invariance from transformation sequences

    Neural Computation

    (1991)
  • K. Fukushima

    Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

    Biological Cybernetics

    (1980)
  • M.E. Hasselmo et al.

    Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey

    Experimental Brain Research

    (1989)
  • J. Hertz et al.

    Introduction to the theory of neural computation

    (1991)
  • G.E. Hinton et al.

    The wake-sleep algorithm for unsupervised neural networks

    Science

    (1995)
  • M. Ito et al.

    Size and position invariance of neuronal response in monkey inferotemporal cortex

    Journal of Neurophysiology

    (1995)
  • E. Kobotake et al.

    Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex

    Journal of Neurophysiology

    (1994)
  • W.B. Levy

    Associative changes in the synapse: LTP in the hippocampus

  • W.B. Levy et al.

    The rules of elemental synaptic plasticity

  • B.W. Mel et al.

    Minimizing binding errors using learned conjunctive features

    Neural Computation

    (2000)
  • Cited by (31)

    • The role of independent motion in object segmentation in the ventral visual stream: Learning to recognise the separate parts of the body

      2011, Vision Research
      Citation Excerpt :

      Therefore the “biased competition hypothesis” cannot be used to explain how the visual system learns to segment bodies into separate parts. Research has shown that it is possible to learn about individual objects when multiple objects are present in the scene without the need for an attentional mechanism using purely feedforward connectivity in a hierarchical neural network model of the ventral visual pathway, VisNet (Stringer & Rolls, 2008; Stringer, Rolls, & Tromans, 2007). Stringer and Rolls (2008) showed how the statistical properties of the visual input stimuli play a crucial role in enabling the network to develop view invariant representations of individual objects when multiple objects are present during training.

    • Brain Computations and Connectivity

      2023, Brain Computations and Connectivity
    • Brain computations: What and how

      2021, Brain Computations: What and How
    View all citing articles on Scopus
    View full text