A multiscale dilated dense convolutional network for saliency prediction with instance-level attention competition

https://doi.org/10.1016/j.jvcir.2019.102611Get rights and content

Highlights

  • A multiscale dilated dense convolutional network is proposed for saliency prediction.

  • Dense connections is used to extract both the inter- and intra-class differences.

  • Dilated convolution is introduced to obtain contexts to represent instances better.

  • Multiple features deal well with multiscale dependency in saliency prediction.

  • The proposed model achieves state-of-the-art performance on three benchmark datasets.

Abstract

Data-driven saliency estimation attracts increasing interests in recent years because of the establishment of large-scale annotated datasets and the evolution of deep convolutional neural networks (CNN). Although CNN-based models perform much better than traditional ones in saliency prediction, there is still a gap between computational models and human behavior. One reason is that existing approaches fail assigning correct saliency to different objects in scenes with multiple objects. In this paper, we propose a multiscale dilated dense convolutional network to handle instance-level attention competition for better saliency prediction. In the proposed architecture, dense connections encode inter- and intra-class features for instance-level attention competition, dilated convolution collects contextual information to enrich feature representations of instances, and shortcut connections provide multiscale features for attention competition across scales. According to evaluations on three challenging datasets, CAT2000, SALICON, and MIT1003, the proposed model achieves the state-of-the-art performance.

Introduction

When observing complex scenes, human visual system can rapidly and selectively locate eye fixations on some informative regions. This capability is called the visual selective attention [1], [2], [3]. Such a neurological mechanism is an evolutionary result that humans employ to make important decisions based on quick and effective perception of surrounding environments. Visual saliency estimation, a simulation of the visual selective attention, is a classic research area in computer vision and neuroscience. A visual attention model can be widely applied in objects detection, computational resources allocation, and video stream compression rate control, for example.

Inspired by biological studies, traditional models [4], [5], [6], [7], [8], [9] mainly depend on joint hand-crafted features, such as color, texture, and intensity. Center-surround contrast is then calculated to estimate the saliency. Modern learning-based methods [10], [11], [12], [13] combine low-level cues with high-level information, including faces, cars, and texts, extracted by object detectors. Along with the development of deep learning [14], a large number of saliency models [15], [16], [17], [18], [19], [20] based on convolutional neural network (CNN) [21], [22], [23] have been proposed, which achieve better prediction performance on several large-scale datasets [24], [25], [12].

Though modern saliency prediction models based on CNN are much better than conventional ones in performance, there is still a gap between the models and human behavior. One possible reason to the gap is that existing models do not deal well with attention allocation in scenes with multiple objects, especially when objects belong to a semantically same category. Fig. 1 shows two examples of the above-mentioned issue. Fig. 1a is a scene with objects belonging to different semantic categories. According to the ground truth, most attention should focus on the clock. However, the CNN-based model [20] allocates too much saliency to other objects. Fig. 1b shows a traffic intersection with two persons, where the right person attracts more attention than the left one in the ground truth. However, the CNN-based model [20] produces an opposite result.

CNN-based models [20], [19] have already considered the allocation of attention among multiple objects. Both models extract saliency related visual features with ResNet [22]. With convolutional long short-term memory (ConvLSTM) network [26], Cornia et al. [20] progressively refine saliency of different regions to assign reasonable attention on multiple objects. Liu et al. [19] have utilized a long-term recurrent convolutional network to model visual attention competition in the global context. However, both models have not achieved complete success because they allocate attention mainly based on semantic information, due to the structural limitation of ResNet.

In order to achieve human-level performance, the saliency model should take into account the instance-level attention competition. The key is finding good representations for instances. In our opinion, instance features should consider three types of information in saliency prediction, namely, inter-class, intra-class and contextual information.

Firstly, both discriminative categorical features and instance specific intra-class differences are crucial for instance-level attention competition. A CNN with dense connections can encode both features. Therefore, an improved DenseNet [23] is introduced to extract both features.

Secondly, context is another important cue because it can be treated as an alternative to the center-surround contrast. Meanwhile, contextual information also helps discriminate instances in one class. The dilated convolution is used to extract the contextual information of each object.

Lastly, multiscale information is collected to handle the case where objects are of different sizes in one natural scene. Since the transition layers in DenseNet block the flow of multiscale features, we introduce shortcut connections for compensation.

In this paper, we propose a multiscale dilated dense convolutional network (DDCN) to extract features regarding the three aspects for saliency prediction. In addition, the multiscale DDCN can be trained in an end-to-end manner. The proposed model achieves better prediction performance compared with state-of-the-art approaches on three benchmark datasets, CAT2000 [25], SALICON [24], and MIT1003 [12].

The rest of this paper is organized as follows. Section 2 presents a summary of literature on saliency prediction. Section 3 describes the proposed multiscale DDCN with instance-level attention competition. Section 4 provides implementation details, ablation analysis, qualitative and quantitative evaluation of the proposed model in comparison to the state-of-the-art saliency models. In Section 5, we summarize our work.

Section snippets

Related works

In this section, we summarize works related to saliency prediction. Related literature are divided into conventional and modern ones based on whether or not they employ the deep learning technique. Conventional models are sorted according to the utilization of local, non-local, global, and semantic information. Modern models are categorized as end-to-end models and others. There is an extremely large number of papers in this area, thus please refer to [27], [28] for comprehensive surveys of

Model architecture

In this section, we propose a multiscale DDCN for saliency prediction, the architecture of which is shown in Fig. 2. In the architecture, four dense blocks, composed of improved DenseNet [23], are employed to extract inter- and intra-class information. Dilated convolutional kernels [55] are introduced into the basic DenseNet to enlarge the receptive field of feature maps to obtain contextual information of objects. In Fig. 2, the two dilated dense blocks are with dilated convolutional kernels.

Experiments

In this section, benchmark datasets, evaluation metrics, and implementation details are introduced first. Ablation experiments are then performed to analyze the contribution of each component to the model. Finally, Qualitative and quantitative comparisons between the proposed model and state-of-the-art saliency models are provided.

Conclusions

In this paper, we have proposed a multiscale dilated dense convolutional network for saliency prediction, which can encode instance features of objects to deal well with scenes of multiple objects, especially when objects belong to a semantically same category. Dense connections enable the architecture to learn both the inter- and intra-class information. Contextual information is obtained by introducing the dilated convolution. Inter- and intra-class as well as contextual information

Declaration of Competing Interest

There are no conflict of interest in this work.

Acknowledgments

The work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61572387, 61632019, 61836008, 61672404, and in part by the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant No. 61621005.

References (66)

  • D. Gao et al.

    Bottom-up saliency is a discriminant process

  • D. Gao et al.

    On the plausibility of the discriminant center-surround hypothesis for visual saliency

    J. Vis.

    (2008)
  • D.A. Klein et al.

    Center-surround divergence of feature statistics for salient object detection

  • L. Duan et al.

    Visual saliency detection by spatially weighted dissimilarity

  • M. Cerf, J. Harel, W. Einhaeuser, C. Koch, Predicting human gaze using low-level saliency combined with face detection,...
  • M. Cerf et al.

    Faces and text attract gaze independent of the task: experimental data and computer model

    J. Vis.

    (2009)
  • T. Judd et al.

    Learning to predict where humans look

  • A. Borji

    Boosting bottom-up and top-down visual features for saliency estimation

  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • E. Vig et al.

    Large-scale optimization of hierarchical features for saliency prediction in natural images

  • X. Huang et al.

    SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks

  • J. Pan et al.

    Shallow and deep convolutional networks for saliency prediction

  • S.S.S. Kruthiventi et al.

    DeepFix: a fully convolutional neural network for predicting human eye fixations

    IEEE Trans. Image Process.

    (2017)
  • N. Liu, J. Han, A deep spatial contextual long-term recurrent convolutional network for saliency detection, IEEE Trans....
  • M. Cornia et al.

    Predicting human eye fixations via an LSTM-based saliency attentive model

    IEEE Trans. Image Process.

    (2018)
  • K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition (Sep. 2014)....
  • K. He et al.

    Deep residual learning for image recognition

  • G. Huang et al.

    Densely connected convolutional networks

  • M. Jiang et al.

    SALICON: saliency in context

  • A. Borji, L. Itti, CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research (May 2015)....
  • X. SHI, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-C. WOO, Convolutional LSTM network: A machine learning approach...
  • A. Borji et al.

    State-of-the-art in visual attention modeling

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • A. Borji, Saliency prediction in the deep learning era: an empirical investigation (Oct. 2018)....
  • Cited by (0)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text