Elsevier

Neurocomputing

Volume 180, 5 March 2016, Pages 35-54
Neurocomputing

An object-based visual selection framework

https://doi.org/10.1016/j.neucom.2015.10.111Get rights and content

Abstract

Real scenes are composed of multiple points possessing distinct characteristics. Selectively, only part of the scene undergoes scrutiny at a time, and the mechanism responsible for this task is named selective visual attention. Spatial location with the highest contrast might highlight from scene reaching level of awareness (bottom-up attention). On the other hand, attention may also be voluntarily directed to a particular object in the scene (object-based attention), which requires the recognition of a specific target (top-down modulation). In this paper, a new visual selection model is proposed, which combines both early visual features and object-based visual selection modulations. The possibility of the modulation regarding specific features enables the model to be applied to different domains. The proposed model integrates three main mechanisms. The first handles the segmentation of the scene allowing the identification of objects. In the second one, the average of saliency of each object is computed, which provides the modulation of the visual attention for one or more features. Finally, the third builds the object-saliency map, which highlights the salient objects in the scene. We show that top-down modulation has a stronger effect than bottom-up saliency when a memorized object is selected, and this evidence is clearer in the absence of any bottom-up clue. Experiments with synthetic and real images are conducted, and the obtained results demonstrate the effectiveness of the proposed approach for visual selection.

Introduction

Every day we face complex scenes, and our visual system needs to analyze and understand a large amount of visual information while ignoring unimportant things. To handle this information, our visual system must deliver proper attention to specific objects, this task is called visual attention or visual selection. Biologically speaking, an object with features that contrast to the background pops out and draws attention automatically [1], [2]. The information that defines the contrast among objects is related to both primitive features of the scene and previous knowledge about specific targets (memory). This information is related to two distinct components that drive the human visual attention. The first one is the bottom-up attention, which is involuntarily guided by visual features (such as colors, depth, and motion). The second is the top-down modulation, which voluntarily guides the attention towards specific features or known objects in the scene [3]. It is worth noting that when there are objects with similar primitive features, top-down modulation turns out to be a dominant attribute to select one of the objects as a target [4].

But what is an object? An object can be defined as anything that is visible or tangible and is relatively stable in form. For simplicity, here each object is referred to a segment segregated from the ground of the scene. Moreover, according to Temporal Correlation theory, we might say that an object is a binding of distinct features gathered from the visual scene into a single percept. Behavioral and neurophysiological evidences have shown that the selection of objects is used in the primate visual system [5], [6]. It is believed that a pre-attentive process or a perceptual organization is run by the brain unconsciously, performing a figure-ground segregation and segmentation of the visual scene in a set of objects. Those objects, in fact, compete for attention [7]. The Perceptual Organization has been studied in Gestalt psychology where it is indicated that the world is perceived as a cluster of well-structured objects and not as a collection of unorganized points. The formation of objects is governed by Gestaltian laws of grouping such as connectivity, proximity and similarity. It is worth noting that although these processes can be characterized as bottom-up, they can also be influenced by top-down mechanisms [4].

The Temporal Correlation Theory [8], [9] offers an interesting approach to representing multiple objects in a scene by using artificial neural network models, such as the LEGION (Locally Excitatory Globally Inhibitory Oscillator Network) [10], [11]. Thus, by taking the temporal correlation into account, we can integrate distinct features and deal with several objects in a scene.

Bottom-up models [12], [13], [14], [15], do not consider the role of a working memory in the visual selection. In those models, only the primitive features of the image are used to identify the salient point or region of the scene. The selection is directly related to the unsupervised learning, whose goal is to find groups of similar objects according to their features without supervision, or involuntarily. On the other hand, the associative memory considered in top-down models might be associated with some form of supervision [16]. In this case, the visual system searches for previously known objects, which is an inherent characteristic of supervised learning methods. This processing involves a concept of working memory, which temporally holds some information about a target used to modulate the selection process. It is worth noting that top-down attention might also influence the response of bottom-up clues. According to [17], [18], bottom-up attention alerts us to salient details in the scene, whereas top-down attention modulates bottom-up signals to bias the features of a specific target.

In visual selection research that considers both bottom-up and top-down modulations, one can observe two different approaches: (1) the development of psychophysical/psychological or computational models to reproduce real vision mechanisms and/or effects; and (2) the development of pattern recognition tools that simulates biological visual selection mechanisms, i.e., how to use memory for pattern storage and recall. Many works have been developed in this category [19], [20], [21], [22], [23], [24], [25], [26].

In the first approach, previous works developed so far have focused on describing the biological/psychological mechanisms of working memory in the process of visual selection [27], [28], [29], [30]. However, they still lack the definition of computational models that properly address this issue. For this reason, we propose a computational model to study how working memory can influence or benefit visual selection. In this sense, our paper makes a contribution to the first approach of visual selection research.

Specifically, a new object-based visual selection model with both bottom-up and top-down modulations is proposed. Thus, our model is composed of the following modules: (1) Visual Feature Extraction module responsible for extracting the early visual features, such as, colors and orientation; (2) LEGION network [10], for the image segmentation; (3) Network-Based High Level Data Classification, named HLC [31], for object recognition; (4) Network of Integrate and Fire Objects, which creates our object-saliency maps, and finally; (5) Object Selection Module that selects all the salient objects in the scene, based on the guidance from the object-saliency map.

By using the LEGION network, we provide an elegant manner to code temporally the objects in the scene [11]. It means that the objects formed during the segmentation process are highlighted one at a time, which allows a serial scanning of the visual scene. Moreover, the LEGION network is a well-known model consistent with the temporal correlation theory, and it has been extensively analyzed and applied to several tasks [10], [11]. Also, our model considers prior external information about an object by combining the low-level and high-level data classifications [31]. The high-level classification exploits the complex topological properties of the underlying network constructed from the input data. We show that the combined classification approach is robust against changes in pattern recognition.

By integrating the modules mentioned above and using our new object-saliency map our model can delivery attention to objects of the scene regarding their visual features or previous knowledge of the objects/domain. Additionally, we provide qualitative and quantitative comparisons of the proposed model against ground truth fixation maps [32] and nine state-of-the-art methods employed for salient detection [12], [33], [34], [35], [36], [37], [38], [39], [40] for salient detection.

In summary, the main contributions of this work are:

  • 1.

    The salience value of an object is considered one of the major components of this work. It is calculated using bottom-up features, top-down features, or both.

  • 2.

    The absence of specific salient features may automatically ignore regions or objects in the scene.

  • 3.

    Objects with a value of saliency below of the threshold value will not participate in the competition for attention;

  • 4.

    The object-salience map is defined as a network composed of objects, with two types of connections, excitatory and inhibitory, responsible for synchronizing groups of objects that represent close patterns of similarities, and to inhibit objects related to background objects of the scene, respectively, allowing the object related to the most salient object of the scene to be selected.

This paper is organized as follows. In Section 2, a brief review of the early visual features extraction, the segmentation model, and the network-based high-level data classification is provided. Section 3 introduces the proposed model. Computer simulations are presented in Section 4. Finally, concluding remarks and future directions are drawn in Section 5.

Section snippets

Background

In this section, some feature combination strategies are reviewed, the segmentation mechanism and the network-based high-level data classification used in the visual selection model proposed.

Proposed model description

The proposed approach to select salient objects is composed of the following modules: a Visual Feature Extraction module, a LEGION network for image segmentation, a network-based high level data classifier for object recognition, a network of integrate and fire neurons, which creates the object-saliency map and, finally, an object selection module, which highlights the most salient objects in the scene.

Fig. 2 depicts a flowchart of proposed model. Firstly, the scene is presented to the module

Computer simulations

According to [32], many visual attention models have been proposed to predict the locations of the scene in which a human would direct its attention. However, for each new model, the attention mechanism is evaluated using new images, which makes it difficult to compare the results. Thus, to minimize this limitation while performing qualitative and quantitative analysis, here we use fixation maps (FM) generated by tracking the eye movements of human observers for a variety of images. Several

Conclusions

In this work, a top-down contextual classifier and bottom-up object-based visual selection model was proposed for the location of salient objects in real images and synthetic images. The proposed model was able to select objects regarding their visual features as well as the previous knowledge of the system. Thanks to this model, top-down modulation can overcome bottom-up saliency by selecting a known object instead the (bottom-up) most salient being even clear in the absence of any bottom-up

Alcides X. Benicasa received the bachelor׳s degree (with first place honors) from the Faculty of Technology, Taquaritinga, São Paulo, Brazil, in 2001 and the master׳s degree from the Federal University of São Carlos, São Paulo, Brazil, in 2003, both in computer science. He received the Ph.D. degree in Mathematics and Computer Sciences in 2013 by the Institute of Mathematics and Computer Sciences (ICMC), University of São Paulo (USP). He is a Professor at the Department of Information Systems of

References (62)

  • P.R. Roelfsema et al.

    Object-based attention in the primary visual cortex of the macaque monkey

    Nature

    (1998)
  • R. Desimone et al.

    Neural mechanisms of selective visual attention

    Annu. Rev. Neurosci.

    (1995)
  • C. von der Malsburg, The Correlation Theory of Brain Function, Internal report 81-2: Max-Planck Institute for...
  • C. von der Malsburg et al.

    A neural cocktail-party processor

    Biol. Cybern.

    (1986)
  • D. Wang

    The time dimension for scene analysis

    IEEE Trans. Neural Netw.

    (2005)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • L. Itti, Models of bottom-up attention and saliency, in: Neurobiology of Attention, Elsevier, Oxford, 2005, pp. 576–582...
  • A.X. Benicasa, R.A.F. Romero, Localization of salient objects in scenes through visual attention, in: IEEE Proceedings...
  • A.X. Benicasa, M.G. Quiles, L. Zhao, R.A. Romero, An object-based visual selection model with bottom-up and top-down...
  • G. Deco, E.T. Rolls, The role of short-term memory in visual attention, in: Neurobiology of Attention, Elsevier,...
  • C.E. Connor et al.

    Visual attentionbottom-up versus top-down

    Curr. Biol.

    (2004)
  • J. Ba, V. Mnih, K. Kavukcuoglu, Multiple object recognition with visual attention, CoRR, vol. abs/1412.7755, 2014....
  • C. Siagian et al.

    Rapid biologically-inspired scene classification using features shared with visual attention

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • Z. Ren et al.

    Regularized feature reconstruction for spatio-temporal saliency detection

    IEEE Trans. Image Process.

    (2013)
  • T. Kirishima et al.

    Real-time gesture recognition by learning and selective control of visual interest points

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • Y. Sugano et al.

    Appearance-based gaze estimation using visual saliency

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • D. Gao et al.

    Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • C. Jung et al.

    A unified spectral-domain approach for saliency detection and its application to automatic object segmentation

    IEEE Trans. Image Process.

    (2012)
  • C.G. Healey et al.

    Attention and visual memory in visualization and computer graphics

    IEEE Trans. Vis. Comput. Graph.

    (2012)
  • A. Matsushima et al.

    Different neuronal computations of spatial working memory for multiple locations within versus across visual hemifields

    J. Neurosci.

    (2014)
  • W.X. Schneider

    Selective visual processing across competition episodesa theory of task-driven visual attention and working memory

    Philos. Trans. R. Soc. Lond. B: Biol. Sci.

    (2013)
  • Alcides X. Benicasa received the bachelor׳s degree (with first place honors) from the Faculty of Technology, Taquaritinga, São Paulo, Brazil, in 2001 and the master׳s degree from the Federal University of São Carlos, São Paulo, Brazil, in 2003, both in computer science. He received the Ph.D. degree in Mathematics and Computer Sciences in 2013 by the Institute of Mathematics and Computer Sciences (ICMC), University of São Paulo (USP). He is a Professor at the Department of Information Systems of the Federal University of Sergipe, Itabaiana, Sergipe, Brazil. His current research interests include artificial neural networks, computer vision, visual attention, bioinformatics, and pattern recognition.

    Marcos G. Quiles received the B.S. degree from the State University of Londrina, Brazil, and the M.S.degree from the University of São Paulo, Brazil, in 2003 and 2004, respectively, both in Computer Science. He received the Ph.D. degree in Mathematics and Computer Sciences in 2009 by the Institute of Mathematics and Computer Sciences (ICMC), University of São Paulo (USP). From January 2008 to July 2008, he was a Visiting Scholar in the Department of Computer Science and Engineering, the Ohio State University, USA. He is a Professor at the Department of Science and Technology at the Federal University of São Paulo, São Paulo, Brazil. His current research interests include neural networks, computer vision, complex networks, and machine learning.

    Thiago C. Silva obtained the title of Doctor of Science in Computer and Mathematical Sciences by the Institute of Mathematical and Computer Sciences (ICMC), University of São Paulo (USP), Brazil, in December 2012. In 2014, He completed a 1-year postdoctoral research program in Machine Learning and Complex Networks, under the supervision of Prof. Dr. Zhao Liang, within the same university. In 2009, he earned the degree of Computer Engineering also by the University of São Paulo with Honors. He attained several academic recognitions during his Doctorate period. Among them (a) Winner of the Capes Thesis Contest 2013 Area: Computer Science, granted by the Brazilian Federal Agency for the Support and Evaluation of Graduate Education (Capes); (b) Winner of the University of São Paulo Thesis Competition 2013, rewarded by the University of São Paulo; and (c) Winner of the International BRICSCCI PhD Theses Competition in the 1st BRICS Countries Congress (BRICS-CCI) and 11th Brazilian Congress on Computational Intelligence. He currently hold the position of Researcher at the Research Department (DEPEP) Central Bank of Brazil (BCB), Brasília, Brazil. He work with financial estability issues, such as systemic risk using network-based approaches and machine learning methods.

    Liang Zhao received the B.S. degree from Wuhan University, Wuhan, China, and the M.Sc. and Ph.D. degrees from the Aeronautic Institute of Technology, São José dos Campos - SP, Brazil, in 1988, 1996, and 1998, respectively, all in computer science. He joined the University of São Paulo, where he is a Professor with the Department of Computer Science. From 2003 to 2004, he was a Visiting Researcher with the Department of Mathematics, Arizona State University, Tempe, USA. His current research interests include artificial neural networks, machine learning, nonlinear dynamical systems, complex networks, bioinformatics, and pattern recognition. He has published more than 120 scientific articles in refereed international journals, books, and conferences. Dr. Zhao is a recipient of the Brazilian Research Productivity Fellowship. He is currently an Associate Editor of the Neural Networks and he was an Associate Editor of the IEEE Transactions on Neural Networks and Learning Systems from 2009 to 2012.

    Roseli A.F. Romero received her Ph.D. degree in electrical engineering from the University of Campinas, Brazil, in 1993. From 1996 to 1998, she was a Visiting Scientist at Carnegie Mellon׳s Robot Learning Lab, USA. Actually, she is a Professor in Department of Computer Science at ICMC in University of Sao Paulo (USP). She is a Coordinator of the Learning Robots Laboratory (LAR) at ICMC/USP. She is in the Bioinspired Group of ICMC-USP. She is a president of Research Committee of ICMC/USP and coordinator of Pre-IC Program of ICMC/USP. She is a Vice Coordinator of the Center for Robotics (CRob-SC) and is a member of Special Robotic Committee into the Brazilian Computer Society. She is a Senior membership of INNS International Neural Networks Society and a member of Computer Brazilian Society (SBC). Her research interests include artificial neural networks, machine learning techniques, fuzzy logic, robot learning and computational vision. She is a adhoc consultor of FAPESP, CNPq and CAPES. She is one of the coordinators of Warthog Robotics Group.

    View full text