Elsevier

Neural Networks

Volume 24, Issue 10, December 2011, Pages 1050-1061
Neural Networks

How does the brain rapidly learn and reorganize view-invariant and position-invariant object representations in the inferotemporal cortex?

https://doi.org/10.1016/j.neunet.2011.04.004Get rights and content

Abstract

All primates depend for their survival on being able to rapidly learn about and recognize objects. Objects may be visually detected at multiple positions, sizes, and viewpoints. How does the brain rapidly learn and recognize objects while scanning a scene with eye movements, without causing a combinatorial explosion in the number of cells that are needed? How does the brain avoid the problem of erroneously classifying parts of different objects together at the same or different positions in a visual scene? In monkeys and humans, a key area for such invariant object category learning and recognition is the inferotemporal cortex (IT). A neural model is proposed to explain how spatial and object attention coordinate the ability of IT to learn invariant category representations of objects that are seen at multiple positions, sizes, and viewpoints. The model clarifies how interactions within a hierarchy of processing stages in the visual brain accomplish this. These stages include the retina, lateral geniculate nucleus, and cortical areas V1, V2, V4, and IT in the brain’s What cortical stream, as they interact with spatial attention processes within the parietal cortex of the Where cortical stream. The model builds upon the ARTSCAN model, which proposed how view-invariant object representations are generated. The positional ARTSCAN (pARTSCAN) model proposes how the following additional processes in the What cortical processing stream also enable position-invariant object representations to be learned: IT cells with persistent activity, and a combination of normalizing object category competition and a view-to-object learning law which together ensure that unambiguous views have a larger effect on object recognition than ambiguous views. The model explains how such invariant learning can be fooled when monkeys, or other primates, are presented with an object that is swapped with another object during eye movements to foveate the original object. The swapping procedure is predicted to prevent the reset of spatial attention, which would otherwise keep the representations of multiple objects from being combined by learning. Li and DiCarlo (2008) have presented neurophysiological data from monkeys showing how unsupervised natural experience in a target swapping experiment can rapidly alter object representations in IT. The model quantitatively simulates the swapping data by showing how the swapping procedure fools the spatial attention mechanism. More generally, the model provides a unifying framework, and testable predictions in both monkeys and humans, for understanding object learning data using neurophysiological methods in monkeys, and spatial attention, episodic learning, and memory retrieval data using functional imaging methods in humans.

Introduction

The brain effortlessly learns to recognize objects that are seen at multiple positions, sizes, and viewpoints. How does the brain rapidly learn to recognize objects while scanning a scene with eye movements, without causing a combinatorial explosion in the number of cells that are needed? How does the brain avoid the problem of erroneously classifying parts of different objects together? In monkeys and humans, a key area for such invariant object learning and recognition is the inferotemporal cortex (IT). A neural model is proposed to explain how spatial and object attention coordinate the ability of IT to learn representations of object categories that are seen at multiple positions, sizes, and viewpoints. Such invariant object category learning and recognition can be achieved using interactions between a hierarchy of processing stages in the visual brain. These stages include the retina, lateral geniculate nucleus, and cortical areas V1, V2, V4, and IT in the brain’s What cortical stream, as they interact with spatial attention processes within the parietal cortex of the Where cortical stream. The model builds upon the ARTSCAN model (Fazl et al., 2009, Grossberg, 2009), which proposed how view-invariant object representations may be learned and recognized.

A key prediction of the ARTSCAN model is how the reset of spatial attention in the Where cortical stream prevents views of different objects from being learned as part of the same invariant IT category. The positional ARTSCAN (pARTSCAN) model that is developed in the current article proposes how the following additional processes in the What cortical processing stream also enable position-invariant object representations to be learned: IT cells with persistent activity, and a combination of normalizing object category competition and a view-to-object learning law which together ensure that unambiguous views have a larger effect on object recognition than ambiguous views. The model is tested by simulating neurophysiological data from a target swapping experiment of Li and DiCarlo (2008) that is predicted to fool the spatial attentional reset mechanisms which usually keep different object views separated during learning.

Many electrophysiological experiments have shown that cells in the inferotemporal (IT) cortex respond to the same object at different retinal positions; for example, many IT cells show little attenuation in firing rate across object translations (Booth and Rolls, 1998, Desimone and Gross, 1979, Gross et al., 1972, Ito et al., 1995, Schwartz et al., 1983). The target swapping experiment of Li and DiCarlo (2008) showed, in addition, how the positional selectivity of cells in IT can be altered by experience. Their experiment was divided into two exposure phases, in which two extra-foveal positions (3° above or below the center of gaze) were prechosen as swap and non-swap positions. The experiment studied IT neuronal responses to two objects that initially elicited strong (object P, preferred) and moderate (object N, non-preferred) responses at the two positions. The monkey always began a learning trial looking at a fixation point. During a “normal exposure”, when an object appeared at the prechosen non-swap position, the monkey quickly moved its eyes to it with a saccadic eye movement that brought its image to the fovea. During a “swap exposure”, in which an object appeared at the prechosen swap position, the object P (or N) was always swapped for the other object N (or P) during the saccade. Li and DiCarlo found that IT neuron selectivity to objects P and N at the swap position was reversed with increasing exposure (see Fig. 1(A)), but there was little or no change at the non-swap position.

The pARTSCAN model (Fig. 2) quantitatively explains and simulates the Li and DiCarlo data as a manifestation of the mechanisms whereby the brain learns position-invariant object representations. Some prominent efforts to model IT have built invariant representations using a hierarchy of feedforward filters leading to a learned category choice (Bradski and Grossberg, 1995, Grossberg and Huang, 2009, Riesenhuber and Poggio, 1999, Riesenhuber and Poggio, 2000, Riesenhuber and Poggio, 2002), or through grouping object translations through time (Fazl et al., 2009, Wallis and Rolls, 1997). The pARTSCAN model proposes how the brain learns position-invariant object representations that are consistent with the Li and DiCarlo swapping data. In particular, the pARTSCAN model, as in the ARTSCAN model on which it builds, proposes how multiple brain processing stages, beginning in the retina and lateral geniculate nucleus (LGN), and proceeding through cortical areas V1, V2, V4, and IT in the What cortical stream, can gradually learn such position-invariant object representations, as they interact with Where cortical processes stages in the parietal cortex.

The ARTSCAN model proposes how an object’s surface representation in cortical area V4 generates a form-fitting distribution of spatial attention, or “attentional shroud”, in the parietal cortex of the Where cortical stream. All surface representations dynamically compete for spatial attention to form a shroud. The winning shroud (or shrouds; see Foley, Grossberg, and Mingolla (submitted for publication) for simulations of multifocal attention) remains active due to a surface-shroud resonance that persists during active scanning of the object with eye movements. The active shroud regulates eye movements and category learning about the attended object in the following way.

The first view-specific category to be learned for the attended object also activates a cell population at a higher processing stage. This cell population will become a view-invariant object category. Both types of category are assumed to form in the IT cortex of the What cortical stream. As the eyes explore different views of the object, previously active view-specific categories are reset to enable new view-specific categories to be learned. What prevents the emerging view-invariant object category from also being reset? The shroud maintains the activity of the emerging view-invariant category representation by inhibiting a reset mechanism, also predicted to be in the parietal cortex, that would otherwise inhibit the view-invariant category. As a result, all the view-specific categories can be linked through associative learning to the emerging view-invariant object category. Indeed, these associative linkages create the view invariance property.

Shroud collapse disinhibits the reset signal, which in turn inhibits the active view-invariant category. Then a new shroud, corresponding to a different object, forms in the Where cortical stream as new view-specific and view-invariant categories of the new object are learned in the What cortical stream. The model hereby mechanistically clarifies basic properties of spatial attention shifts (engage, move, disengage) and inhibition of return. As noted in Section 4, the concepts of shroud persistence and reset clarify traditional ideas about sustained and transient attention, respectively.

The ARTSCAN model does not, however, explain how position-invariant object categories are learned and recognized. The current article proposes what additional brain mechanisms are needed to learn position-invariant object categories. These new mechanisms include a new functional role for cells with persistent activity in IT (see Brunel, 2003, Fuster and Jervey, 1981, Miyashita and Chang, 1988, Tomita et al., 1999) and a competitive learning law whereby more predictive unambiguous object views learn to have a larger effect on object recognition than less predictive ambiguous views.

The pARTSCAN model quantitatively simulates the swapping data by showing how the swapping procedure fools the spatial attentional shroud mechanism that usually is reset when a new object is presented, thereby preventing multiple objects from learning to activate the same invariant object category. The model predicts that the shroud of the previous object is not reset during the swap with another object. Persistence of this attentional shroud across swaps leads to rapid reshaping of IT receptive fields through unsupervised natural visual experience when it interacts with IT persistent activity and competitive learning. In addition to these prediction, which can be tested in monkeys, a prediction is made in Section 4 about how to test the shroud hypothesis during a swapping experiment using fMRI in humans. The same combination of brain mechanisms can also explain how swapping targets of different sizes can lead to rapid learning of the corresponding mixtures of object views at different sizes (Li & DiCarlo, 2010).

Section snippets

Model processing stages

The model consists of the following processing stages. See Fig. 2. These stages are described heuristically in this section and mathematically in Section 5.

Contrast normalization and discounting the illuminant. The contrasts in each input image are normalized, and background illumination is discounted, in the simplified model retina and LGN by an on-center off-surround network whose cells obey membrane, or shunting, equations (Grossberg and Todorovic, 1988, Werblin, 1971). This network defines

Model simulations

The swapping simulations used the two objects, boat (“P”) and bowl (“N”), as stimuli that were used in the Li and DiCarlo experiment. In the simulated retina, an above-foveal position was predefined as the swap position and a below-foveal position was predefined as the non-swap position. In order to simulate normal daily experience, the two objects were first learned at three positions (the predefined swap position, non-swap position, and fovea) by 10,000 normal exposures. This experience led

Discussion

Target swapping  fools the reset mechanism. The pARTSCAN model proposes how the brain can learn object categories that are invariant across object positions, sizes, and views. Key model mechanisms that enable position-invariant object learning enable the model to quantitatively simulate the Li and DiCarlo (2008) swapping data. In effect, the Li and DiCarlo (2008) experiments bypass the mechanism whereby attentional shrouds normally get reset when one object is replaced by another one. Their

Model equations

Retina/LGN: discounting the illuminant and contrast normalization. The luminance of the retinal input image Ipq at position (p,q) is preprocessed by the model retina/LGN to discount the illuminant and contrast-normalize the image using shunting on-center off-surround networks (Grossberg & Todorovic, 1988). The equilibrium output signals Xij+ and Xij of ON and OFF cells, respectively, at position (i,j) are defined by Xij+=[Xij0.05]+,Xij=[Xij0.05]+, where Xij=4(CijSij)105+Cij+Sij, notation

Acknowledgments

This work was supported in part by CELEST, a National Science Foundation Science of Learning Center (SBE-0354379), and by the SyNAPSE program of DARPA (HR0011-09-C-0001).

References (87)

  • S. Grossberg et al.

    Spikes, synchrony, and attentive learning by laminar thalamocortical circuits

    Brain Research

    (2008)
  • S. Grossberg et al.

    Laminar cortical dynamics of 3D surface perception: stratification, transparency, and neon color spreading

    Vision Research

    (2005)
  • N. Li et al.

    Unsupervised natural visual experience rapidly reshapes size invariant object represent in inferior temporal cortex

    Neuron

    (2010)
  • N.K. Logothetis et al.

    View-dependent object recognition by monkeys

    Current Biology

    (1994)
  • J.H. Reynolds et al.

    The normalization model of attention

    Neuron

    (2009)
  • M. Riesenhuber et al.

    Neural mechanisms of object recognition

    Current Opinion in Neurobiology

    (2002)
  • D.C. Rogers-Ramachandran et al.

    Psychophysical evidence for boundary and surface systems in human vision

    Vision Research

    (1998)
  • E.L. Schwartz

    Computational anatomy and functional architecture of striate cortex: a spatial mapping approach to perceptual coding

    Vision Research

    (1980)
  • T. Serre et al.

    A quantitative theory of immediate visual recognition

    Progress in Brain Research

    (2007)
  • D.C. Van Essen et al.

    The visual field representation in striate cortex of the macaque monkey: asymmetries, anisotropies, and individual variability

    Vision Research

    (1984)
  • G. Wallis et al.

    Invariant face and object recognition in the visual system

    Progress in Neurobiology

    (1997)
  • X.-J. Wang

    Synaptic reverberation underlying mnemonic persistent activity

    Trends in Neurosciences

    (2001)
  • G.A. Alvarez et al.

    Visual short-term memory operates more efficiently on boundary features than on surface features

    Perception & Psychophysics

    (2008)
  • M.C. Booth et al.

    View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex

    Cerebral Cortex

    (1998)
  • N. Brunel

    Dynamics and plasticity of stimulus selective persistent activity in cortical network models

    Cerebral Cortex

    (2003)
  • H.H. Bulthoff et al.

    Psychophysical support for a two-dimensional view interpolation theory of object recognition

    Proceedings of the National Academy of Sciences of the United States of America

    (1992)
  • H.H. Bulthoff et al.

    How are three-dimensional objects represented in the brain?

    Cerebral Cortex

    (1995)
  • R. Cabeza et al.

    The parietal cortex and episodic memory: an attentional account

    Nature Reviews Neuroscience

    (2008)
  • Y. Cao et al.

    A laminar cortical model of stereopsis and 3D surface perception: closure and da Vinci stereopsis

    Spatial Vision

    (2005)
  • G.A. Carpenter et al.

    Pattern recognition by self-organizing neural networks

    (1991)
  • G.A. Carpenter et al.

    Fuzzy ARTMAP—a neural network architecture for incremental supervised learning of analog multidimensional maps

    IEEE Transactions on Neural Networks

    (1992)
  • Carpenter, G. A., & Ross, W. D. (1993). ART-EMAP: a neural network architecture for learning and prediction by evidence...
  • H.-C. Chang et al.

    Where’s Waldo? How the brain earns to categorize and discover desired objects in a cluttered scene [Abstract]

    Journal of Vision

    (2009)
  • Y.C. Chiu et al.

    A domain-independent source of cognitive control for task sets: shifting spatial attention and switching categorization rules

    Journal of Neuroscience

    (2009)
  • M. Corbetta et al.

    Voluntary orienting is dissociated from target detection in human posterior parietal cortex

    Nature Neuroscience

    (2000)
  • P.M. Daniel et al.

    The representation of the visual field on the cerebral cortex in monkeys

    The Journal of Physiology

    (1961)
  • J. Davidoff

    Cognition through color

    (1991)
  • R. Desimone

    Visual attention mediated by biased competition in extrastriate visual cortex

    Philosophical Transactions of the Royal Society, Series B (Biological Sciences)

    (1998)
  • D. Durstewitz et al.

    Neurocomputational models of working memory

    Nature Neuroscience

    (2000)
  • A.K. Engel et al.

    Dynamic predictions: oscillations and synchrony in top-down processing

    Nature Reviews Neuroscience

    (2001)
  • L. Fang et al.

    From stereogram to surface: how the brain sees the world in depth

    Spatial Vision

    (2009)
  • H. Fischer

    Overlap of receptive field centers and representation of the visual field in the cat’s optic tract

    Vision Research

    (1973)
  • Foley, N. C., Grossberg, S., & Mingolla, E. (2011). Neural dynamics of object-based multifocal visual spatial attention...
  • Cited by (42)

    • Towards building a more complex view of the lateral geniculate nucleus: Recent advances in understanding its role

      2017, Progress in Neurobiology
      Citation Excerpt :

      ART models in their various manifestations generally include an input layer corresponding to LGN that receives expectations from V1 as feedback. These models have been applied to a wide variety of visual phenomena, particularly psychophysical phenomena, including binocular effects (Cao and Grossberg, 2005, 2012; Grossberg and Howe, 2003; Grossberg et al., 2015; Grunewald and Grossberg, 1998), Gestalt perceptual grouping (Ross et al., 2000), illusionary contours (Gove et al., 1995), surface perception and transparency (Grossberg and Yazdanbakhsh, 2005), texture processing (Bhatt et al., 2007; Grossberg et al., 2007), and object recognition (Cao et al., 2011; Rajaei et al., 2012). Many current models of high-level vision, including object recognition models, do not include a separate stage for LGN processing, perhaps viewing LGN as a simple linear filter that wouldn’t add interesting capabilities.

    • Towards solving the hard problem of consciousness: The varieties of brain resonances and the conscious experiences that they support

      2017, Neural Networks
      Citation Excerpt :

      These complexities of internal shroud structure in the parietal cortex (e.g., Silver & Kastner, 2009; Swisher, Halko, Merabet, McMains, & Somers, 2007) will not be further discussed herein. Let us now selectively review the implications for consciousness of the 3D ARTSCAN Search model, notably how including scanning eye movements into discussions of visual awareness, learning, and recognition allows a deeper analysis of (1) how the brain learns invariant object recognition categories that can individually respond to multiple views, positions, and sizes of an object in a 3D scene (Cao et al., 2011; Fazl et al., 2009; Foley et al., 2012; Grossberg, 2009; Grossberg et al., 2011), and (2) how searches for valued objects in such a scene are accomplished (Chang et al., 2014; Grossberg et al., 2014). Along the way, I will offer explanations of data that have not been explained before.

    View all citing articles on Scopus
    View full text