A multiscale dilated dense convolutional network for saliency prediction with instance-level attention competition☆
Introduction
When observing complex scenes, human visual system can rapidly and selectively locate eye fixations on some informative regions. This capability is called the visual selective attention [1], [2], [3]. Such a neurological mechanism is an evolutionary result that humans employ to make important decisions based on quick and effective perception of surrounding environments. Visual saliency estimation, a simulation of the visual selective attention, is a classic research area in computer vision and neuroscience. A visual attention model can be widely applied in objects detection, computational resources allocation, and video stream compression rate control, for example.
Inspired by biological studies, traditional models [4], [5], [6], [7], [8], [9] mainly depend on joint hand-crafted features, such as color, texture, and intensity. Center-surround contrast is then calculated to estimate the saliency. Modern learning-based methods [10], [11], [12], [13] combine low-level cues with high-level information, including faces, cars, and texts, extracted by object detectors. Along with the development of deep learning [14], a large number of saliency models [15], [16], [17], [18], [19], [20] based on convolutional neural network (CNN) [21], [22], [23] have been proposed, which achieve better prediction performance on several large-scale datasets [24], [25], [12].
Though modern saliency prediction models based on CNN are much better than conventional ones in performance, there is still a gap between the models and human behavior. One possible reason to the gap is that existing models do not deal well with attention allocation in scenes with multiple objects, especially when objects belong to a semantically same category. Fig. 1 shows two examples of the above-mentioned issue. Fig. 1a is a scene with objects belonging to different semantic categories. According to the ground truth, most attention should focus on the clock. However, the CNN-based model [20] allocates too much saliency to other objects. Fig. 1b shows a traffic intersection with two persons, where the right person attracts more attention than the left one in the ground truth. However, the CNN-based model [20] produces an opposite result.
CNN-based models [20], [19] have already considered the allocation of attention among multiple objects. Both models extract saliency related visual features with ResNet [22]. With convolutional long short-term memory (ConvLSTM) network [26], Cornia et al. [20] progressively refine saliency of different regions to assign reasonable attention on multiple objects. Liu et al. [19] have utilized a long-term recurrent convolutional network to model visual attention competition in the global context. However, both models have not achieved complete success because they allocate attention mainly based on semantic information, due to the structural limitation of ResNet.
In order to achieve human-level performance, the saliency model should take into account the instance-level attention competition. The key is finding good representations for instances. In our opinion, instance features should consider three types of information in saliency prediction, namely, inter-class, intra-class and contextual information.
Firstly, both discriminative categorical features and instance specific intra-class differences are crucial for instance-level attention competition. A CNN with dense connections can encode both features. Therefore, an improved DenseNet [23] is introduced to extract both features.
Secondly, context is another important cue because it can be treated as an alternative to the center-surround contrast. Meanwhile, contextual information also helps discriminate instances in one class. The dilated convolution is used to extract the contextual information of each object.
Lastly, multiscale information is collected to handle the case where objects are of different sizes in one natural scene. Since the transition layers in DenseNet block the flow of multiscale features, we introduce shortcut connections for compensation.
In this paper, we propose a multiscale dilated dense convolutional network (DDCN) to extract features regarding the three aspects for saliency prediction. In addition, the multiscale DDCN can be trained in an end-to-end manner. The proposed model achieves better prediction performance compared with state-of-the-art approaches on three benchmark datasets, CAT2000 [25], SALICON [24], and MIT1003 [12].
The rest of this paper is organized as follows. Section 2 presents a summary of literature on saliency prediction. Section 3 describes the proposed multiscale DDCN with instance-level attention competition. Section 4 provides implementation details, ablation analysis, qualitative and quantitative evaluation of the proposed model in comparison to the state-of-the-art saliency models. In Section 5, we summarize our work.
Section snippets
Related works
In this section, we summarize works related to saliency prediction. Related literature are divided into conventional and modern ones based on whether or not they employ the deep learning technique. Conventional models are sorted according to the utilization of local, non-local, global, and semantic information. Modern models are categorized as end-to-end models and others. There is an extremely large number of papers in this area, thus please refer to [27], [28] for comprehensive surveys of
Model architecture
In this section, we propose a multiscale DDCN for saliency prediction, the architecture of which is shown in Fig. 2. In the architecture, four dense blocks, composed of improved DenseNet [23], are employed to extract inter- and intra-class information. Dilated convolutional kernels [55] are introduced into the basic DenseNet to enlarge the receptive field of feature maps to obtain contextual information of objects. In Fig. 2, the two dilated dense blocks are with dilated convolutional kernels.
Experiments
In this section, benchmark datasets, evaluation metrics, and implementation details are introduced first. Ablation experiments are then performed to analyze the contribution of each component to the model. Finally, Qualitative and quantitative comparisons between the proposed model and state-of-the-art saliency models are provided.
Conclusions
In this paper, we have proposed a multiscale dilated dense convolutional network for saliency prediction, which can encode instance features of objects to deal well with scenes of multiple objects, especially when objects belong to a semantically same category. Dense connections enable the architecture to learn both the inter- and intra-class information. Contextual information is obtained by introducing the dilated convolution. Inter- and intra-class as well as contextual information
Declaration of Competing Interest
There are no conflict of interest in this work.
Acknowledgments
The work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61572387, 61632019, 61836008, 61672404, and in part by the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant No. 61621005.
References (66)
- et al.
A feature-integration theory of attention
Cognit. Psychol.
(1980) - et al.
Non-local spatial redundancy reduction for bottom-up saliency estimation
J. Vis. Commun. Image Represent.
(2012) - et al.
Nonlocal center-surround reconstruction-based bottom-up saliency estimation
Pattern Recognit.
(2015) - et al.
Learning to predict eye fixations for semantic contents using multi-layer sparse network
Neurocomputing
(2014) - et al.
Components of bottom-up gaze allocation in natural images
Vision Res.
(2005) - et al.
Selective attention gates visual processing in the extrastriate cortex
Science
(1985) - et al.
A model of inhibitory mechanisms in selective attention
- et al.
Working memory and the guidance of visual attention: consonance-driven orienting
Psychonom. Bull. Rev.
(2007) - et al.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
(1998) - et al.
Contrast-based image attention analysis by using fuzzy growing
Bottom-up saliency is a discriminant process
On the plausibility of the discriminant center-surround hypothesis for visual saliency
J. Vis.
Center-surround divergence of feature statistics for salient object detection
Visual saliency detection by spatially weighted dissimilarity
Faces and text attract gaze independent of the task: experimental data and computer model
J. Vis.
Learning to predict where humans look
Boosting bottom-up and top-down visual features for saliency estimation
Deep learning
Nature
Large-scale optimization of hierarchical features for saliency prediction in natural images
SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks
Shallow and deep convolutional networks for saliency prediction
DeepFix: a fully convolutional neural network for predicting human eye fixations
IEEE Trans. Image Process.
Predicting human eye fixations via an LSTM-based saliency attentive model
IEEE Trans. Image Process.
Deep residual learning for image recognition
Densely connected convolutional networks
SALICON: saliency in context
State-of-the-art in visual attention modeling
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (0)
- ☆
This paper has been recommended for acceptance by Zicheng Liu.