How does the brain rapidly learn and reorganize view-invariant and position-invariant object representations in the inferotemporal cortex?
Introduction
The brain effortlessly learns to recognize objects that are seen at multiple positions, sizes, and viewpoints. How does the brain rapidly learn to recognize objects while scanning a scene with eye movements, without causing a combinatorial explosion in the number of cells that are needed? How does the brain avoid the problem of erroneously classifying parts of different objects together? In monkeys and humans, a key area for such invariant object learning and recognition is the inferotemporal cortex (IT). A neural model is proposed to explain how spatial and object attention coordinate the ability of IT to learn representations of object categories that are seen at multiple positions, sizes, and viewpoints. Such invariant object category learning and recognition can be achieved using interactions between a hierarchy of processing stages in the visual brain. These stages include the retina, lateral geniculate nucleus, and cortical areas V1, V2, V4, and IT in the brain’s What cortical stream, as they interact with spatial attention processes within the parietal cortex of the Where cortical stream. The model builds upon the ARTSCAN model (Fazl et al., 2009, Grossberg, 2009), which proposed how view-invariant object representations may be learned and recognized.
A key prediction of the ARTSCAN model is how the reset of spatial attention in the Where cortical stream prevents views of different objects from being learned as part of the same invariant IT category. The positional ARTSCAN (pARTSCAN) model that is developed in the current article proposes how the following additional processes in the What cortical processing stream also enable position-invariant object representations to be learned: IT cells with persistent activity, and a combination of normalizing object category competition and a view-to-object learning law which together ensure that unambiguous views have a larger effect on object recognition than ambiguous views. The model is tested by simulating neurophysiological data from a target swapping experiment of Li and DiCarlo (2008) that is predicted to fool the spatial attentional reset mechanisms which usually keep different object views separated during learning.
Many electrophysiological experiments have shown that cells in the inferotemporal (IT) cortex respond to the same object at different retinal positions; for example, many IT cells show little attenuation in firing rate across object translations (Booth and Rolls, 1998, Desimone and Gross, 1979, Gross et al., 1972, Ito et al., 1995, Schwartz et al., 1983). The target swapping experiment of Li and DiCarlo (2008) showed, in addition, how the positional selectivity of cells in IT can be altered by experience. Their experiment was divided into two exposure phases, in which two extra-foveal positions (3° above or below the center of gaze) were prechosen as swap and non-swap positions. The experiment studied IT neuronal responses to two objects that initially elicited strong (object , preferred) and moderate (object , non-preferred) responses at the two positions. The monkey always began a learning trial looking at a fixation point. During a “normal exposure”, when an object appeared at the prechosen non-swap position, the monkey quickly moved its eyes to it with a saccadic eye movement that brought its image to the fovea. During a “swap exposure”, in which an object appeared at the prechosen swap position, the object (or ) was always swapped for the other object (or ) during the saccade. Li and DiCarlo found that IT neuron selectivity to objects and at the swap position was reversed with increasing exposure (see Fig. 1(A)), but there was little or no change at the non-swap position.
The pARTSCAN model (Fig. 2) quantitatively explains and simulates the Li and DiCarlo data as a manifestation of the mechanisms whereby the brain learns position-invariant object representations. Some prominent efforts to model IT have built invariant representations using a hierarchy of feedforward filters leading to a learned category choice (Bradski and Grossberg, 1995, Grossberg and Huang, 2009, Riesenhuber and Poggio, 1999, Riesenhuber and Poggio, 2000, Riesenhuber and Poggio, 2002), or through grouping object translations through time (Fazl et al., 2009, Wallis and Rolls, 1997). The pARTSCAN model proposes how the brain learns position-invariant object representations that are consistent with the Li and DiCarlo swapping data. In particular, the pARTSCAN model, as in the ARTSCAN model on which it builds, proposes how multiple brain processing stages, beginning in the retina and lateral geniculate nucleus (LGN), and proceeding through cortical areas V1, V2, V4, and IT in the What cortical stream, can gradually learn such position-invariant object representations, as they interact with Where cortical processes stages in the parietal cortex.
The ARTSCAN model proposes how an object’s surface representation in cortical area V4 generates a form-fitting distribution of spatial attention, or “attentional shroud”, in the parietal cortex of the Where cortical stream. All surface representations dynamically compete for spatial attention to form a shroud. The winning shroud (or shrouds; see Foley, Grossberg, and Mingolla (submitted for publication) for simulations of multifocal attention) remains active due to a surface-shroud resonance that persists during active scanning of the object with eye movements. The active shroud regulates eye movements and category learning about the attended object in the following way.
The first view-specific category to be learned for the attended object also activates a cell population at a higher processing stage. This cell population will become a view-invariant object category. Both types of category are assumed to form in the IT cortex of the What cortical stream. As the eyes explore different views of the object, previously active view-specific categories are reset to enable new view-specific categories to be learned. What prevents the emerging view-invariant object category from also being reset? The shroud maintains the activity of the emerging view-invariant category representation by inhibiting a reset mechanism, also predicted to be in the parietal cortex, that would otherwise inhibit the view-invariant category. As a result, all the view-specific categories can be linked through associative learning to the emerging view-invariant object category. Indeed, these associative linkages create the view invariance property.
Shroud collapse disinhibits the reset signal, which in turn inhibits the active view-invariant category. Then a new shroud, corresponding to a different object, forms in the Where cortical stream as new view-specific and view-invariant categories of the new object are learned in the What cortical stream. The model hereby mechanistically clarifies basic properties of spatial attention shifts (engage, move, disengage) and inhibition of return. As noted in Section 4, the concepts of shroud persistence and reset clarify traditional ideas about sustained and transient attention, respectively.
The ARTSCAN model does not, however, explain how position-invariant object categories are learned and recognized. The current article proposes what additional brain mechanisms are needed to learn position-invariant object categories. These new mechanisms include a new functional role for cells with persistent activity in IT (see Brunel, 2003, Fuster and Jervey, 1981, Miyashita and Chang, 1988, Tomita et al., 1999) and a competitive learning law whereby more predictive unambiguous object views learn to have a larger effect on object recognition than less predictive ambiguous views.
The pARTSCAN model quantitatively simulates the swapping data by showing how the swapping procedure fools the spatial attentional shroud mechanism that usually is reset when a new object is presented, thereby preventing multiple objects from learning to activate the same invariant object category. The model predicts that the shroud of the previous object is not reset during the swap with another object. Persistence of this attentional shroud across swaps leads to rapid reshaping of IT receptive fields through unsupervised natural visual experience when it interacts with IT persistent activity and competitive learning. In addition to these prediction, which can be tested in monkeys, a prediction is made in Section 4 about how to test the shroud hypothesis during a swapping experiment using fMRI in humans. The same combination of brain mechanisms can also explain how swapping targets of different sizes can lead to rapid learning of the corresponding mixtures of object views at different sizes (Li & DiCarlo, 2010).
Section snippets
Model processing stages
The model consists of the following processing stages. See Fig. 2. These stages are described heuristically in this section and mathematically in Section 5.
Contrast normalization and discounting the illuminant. The contrasts in each input image are normalized, and background illumination is discounted, in the simplified model retina and LGN by an on-center off-surround network whose cells obey membrane, or shunting, equations (Grossberg and Todorovic, 1988, Werblin, 1971). This network defines
Model simulations
The swapping simulations used the two objects, boat (“”) and bowl (“”), as stimuli that were used in the Li and DiCarlo experiment. In the simulated retina, an above-foveal position was predefined as the swap position and a below-foveal position was predefined as the non-swap position. In order to simulate normal daily experience, the two objects were first learned at three positions (the predefined swap position, non-swap position, and fovea) by 10,000 normal exposures. This experience led
Discussion
Target swapping fools the reset mechanism. The pARTSCAN model proposes how the brain can learn object categories that are invariant across object positions, sizes, and views. Key model mechanisms that enable position-invariant object learning enable the model to quantitatively simulate the Li and DiCarlo (2008) swapping data. In effect, the Li and DiCarlo (2008) experiments bypass the mechanism whereby attentional shrouds normally get reset when one object is replaced by another one. Their
Model equations
Retina/LGN: discounting the illuminant and contrast normalization. The luminance of the retinal input image at position () is preprocessed by the model retina/LGN to discount the illuminant and contrast-normalize the image using shunting on-center off-surround networks (Grossberg & Todorovic, 1988). The equilibrium output signals and of ON and OFF cells, respectively, at position are defined by where notation
Acknowledgments
This work was supported in part by CELEST, a National Science Foundation Science of Learning Center (SBE-0354379), and by the SyNAPSE program of DARPA (HR0011-09-C-0001).
References (87)
- et al.
Texture segregation by visual cortex: perceptual grouping, attention, and learning
Vision Research
(2007) - et al.
Fast-learning VIEWNET architectures for recognizing three-dimensional objects from multiple two-dimensional views
Neural Networks
(1995) - et al.
A massively parallel architecture for a self-organizing neural pattern-recognition machine
Computer Vision, Graphics, and Image Processing
(1987) - et al.
Normal and amnesic learning, recognition and memory by a neural model of cortico–hippocampal interactions
Trends in Neurosciences
(1993) - et al.
Fuzzy ART: fast stable learning and categorization of analog patterns by an adaptive resonance system
Neural Networks
(1991) - et al.
Top-down and bottom-up attention to memory: a hypothesis (AtoM) on the role of the posterior parietal cortex in memory retrieval
Neuropsychologia
(2008) - et al.
Visual areas in the temporal cortex of the macaque
Brain Research
(1979) - et al.
Evidence for boundary-specific grouping
Vision Research
(1998) - et al.
View-invariant object category learning, recognition, and search: how spatial and object attention are coordinated using surface-based attentional shrouds
Cognitive Psychology
(2009) Consciousness CLEARS the mind
Neural Networks
(2007)
Spikes, synchrony, and attentive learning by laminar thalamocortical circuits
Brain Research
Laminar cortical dynamics of 3D surface perception: stratification, transparency, and neon color spreading
Vision Research
Unsupervised natural visual experience rapidly reshapes size invariant object represent in inferior temporal cortex
Neuron
View-dependent object recognition by monkeys
Current Biology
The normalization model of attention
Neuron
Neural mechanisms of object recognition
Current Opinion in Neurobiology
Psychophysical evidence for boundary and surface systems in human vision
Vision Research
Computational anatomy and functional architecture of striate cortex: a spatial mapping approach to perceptual coding
Vision Research
A quantitative theory of immediate visual recognition
Progress in Brain Research
The visual field representation in striate cortex of the macaque monkey: asymmetries, anisotropies, and individual variability
Vision Research
Invariant face and object recognition in the visual system
Progress in Neurobiology
Synaptic reverberation underlying mnemonic persistent activity
Trends in Neurosciences
Visual short-term memory operates more efficiently on boundary features than on surface features
Perception & Psychophysics
View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex
Cerebral Cortex
Dynamics and plasticity of stimulus selective persistent activity in cortical network models
Cerebral Cortex
Psychophysical support for a two-dimensional view interpolation theory of object recognition
Proceedings of the National Academy of Sciences of the United States of America
How are three-dimensional objects represented in the brain?
Cerebral Cortex
The parietal cortex and episodic memory: an attentional account
Nature Reviews Neuroscience
A laminar cortical model of stereopsis and 3D surface perception: closure and da Vinci stereopsis
Spatial Vision
Pattern recognition by self-organizing neural networks
Fuzzy ARTMAP—a neural network architecture for incremental supervised learning of analog multidimensional maps
IEEE Transactions on Neural Networks
Where’s Waldo? How the brain earns to categorize and discover desired objects in a cluttered scene [Abstract]
Journal of Vision
A domain-independent source of cognitive control for task sets: shifting spatial attention and switching categorization rules
Journal of Neuroscience
Voluntary orienting is dissociated from target detection in human posterior parietal cortex
Nature Neuroscience
The representation of the visual field on the cerebral cortex in monkeys
The Journal of Physiology
Cognition through color
Visual attention mediated by biased competition in extrastriate visual cortex
Philosophical Transactions of the Royal Society, Series B (Biological Sciences)
Neurocomputational models of working memory
Nature Neuroscience
Dynamic predictions: oscillations and synchrony in top-down processing
Nature Reviews Neuroscience
From stereogram to surface: how the brain sees the world in depth
Spatial Vision
Overlap of receptive field centers and representation of the visual field in the cat’s optic tract
Vision Research
Cited by (42)
Adaptive navigation assistance based on eye movement features in virtual reality
2023, Virtual Reality and Intelligent HardwareA neurodynamic model of the interaction between color perception and color memory
2020, Neural NetworksUncertainty-based modulation for lifelong learning
2019, Neural NetworksTowards building a more complex view of the lateral geniculate nucleus: Recent advances in understanding its role
2017, Progress in NeurobiologyCitation Excerpt :ART models in their various manifestations generally include an input layer corresponding to LGN that receives expectations from V1 as feedback. These models have been applied to a wide variety of visual phenomena, particularly psychophysical phenomena, including binocular effects (Cao and Grossberg, 2005, 2012; Grossberg and Howe, 2003; Grossberg et al., 2015; Grunewald and Grossberg, 1998), Gestalt perceptual grouping (Ross et al., 2000), illusionary contours (Gove et al., 1995), surface perception and transparency (Grossberg and Yazdanbakhsh, 2005), texture processing (Bhatt et al., 2007; Grossberg et al., 2007), and object recognition (Cao et al., 2011; Rajaei et al., 2012). Many current models of high-level vision, including object recognition models, do not include a separate stage for LGN processing, perhaps viewing LGN as a simple linear filter that wouldn’t add interesting capabilities.
Towards solving the hard problem of consciousness: The varieties of brain resonances and the conscious experiences that they support
2017, Neural NetworksCitation Excerpt :These complexities of internal shroud structure in the parietal cortex (e.g., Silver & Kastner, 2009; Swisher, Halko, Merabet, McMains, & Somers, 2007) will not be further discussed herein. Let us now selectively review the implications for consciousness of the 3D ARTSCAN Search model, notably how including scanning eye movements into discussions of visual awareness, learning, and recognition allows a deeper analysis of (1) how the brain learns invariant object recognition categories that can individually respond to multiple views, positions, and sizes of an object in a 3D scene (Cao et al., 2011; Fazl et al., 2009; Foley et al., 2012; Grossberg, 2009; Grossberg et al., 2011), and (2) how searches for valued objects in such a scene are accomplished (Chang et al., 2014; Grossberg et al., 2014). Along the way, I will offer explanations of data that have not been explained before.