Elsevier

Neurocomputing

Volume 364, 28 October 2019, Pages 310-321
Neurocomputing

Saliency detection via multi-level integration and multi-scale fusion neural networks

https://doi.org/10.1016/j.neucom.2019.07.054Get rights and content

Abstract

The recent advance on saliency models has remarkably improved performance due to the pervasive application of deep convolutional neural networks. However, for more challenging images, it is worthwhile to explore in deep convolutional neural networks how to effectively exploit features at different levels and scales for saliency detection. In this paper, we propose an end-to-end multi-level feature integration and multi-scale feature fusion network to better predict salient objects in challenging images. Specifically, our network first integrates multi-level features from high-level to low-level features in the network based on ResNet. Then, the feature combined by the multi-level feature integration network is further refined by four parallel residual connected blocks with dilated convolution, in which each block has a specific dilation rate to capture multi-scale context information. Finally, we fuse the outputs of residual connected blocks with dilated convolution and obtain the saliency map by up-sampling operation. Extensive experimental results demonstrate that the proposed model outperforms the state-of-the-art saliency models on several challenging image datasets.

Introduction

Visual saliency detection, which aims at detecting the most distinctive regions or objects attracting human attention in an image, has received a lot of interests in recent years. It has played an important role in many computer vision tasks and achieved tremendous successes in a wide range of applications such as image retrieval [1], [2], image/video object detection [3], [4], [5], [6], [7], image/video object segmentation [8], [9], [10], image retargeting [11], [12] and semantic segmentation [13]. Although it has made considerable progress in recent years, saliency detection is still a challenging task and current saliency detection methods usually fail in cluttered scenarios.

The first visual saliency model [14] was proposed decades ago. Since then, a lot of models [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25] have been proposed to address saliency detection via hand-crafted features and heuristic saliency priors to imitate human visual attention mechanism. Although these models have been proved to be effective in some scenes, hand-crafted features and heuristic saliency priors are usually not sufficient to distinguish high-level semantic concepts of salient objects from its surrounding background. Therefore, the saliency models based on hand-crafted features cannot identify salient objects from the cluttered background robustly.

In the past few years, deep convolutional neural networks (CNNs) have outperformed traditional methods in many computer vision tasks, e.g. image classification [26], [27], object detection/tracking [28], [29], [30], [31], [32], [33], [34], and especially, in many dense labeling tasks such as semantic segmentation [35], [36], instance segmentation [37], [38], pose estimation [39], [40] and contour detection [41]. Fully convolutional networks (FCNs) which capture high-level semantic information from raw images have achieved the outstanding performance. Inspired by the powerful capacity of extracting high-level features using FCNs, several saliency models [42], [43], [44], [45], [46] based on FCNs suggest that a deep network can also benefit saliency detection and identify salient objects more accurately in a variety of scenes. In [42], a recurrent FCN is proposed to refine saliency priors and yield more accurate predictions. In [43], the pixel-level fully convolutional stream and the segment-level spatial stream are fused for detecting salient objects. In [44], a multi-task FCN based saliency model is proposed to perform collaborative feature learning for saliency detection and semantic image segmentation. In [46], the deep uncertain convolutional features are learned for saliency detection via introducing a reformulated dropout after specific convolution layers. In [45], multi-level convolutional features are aggregated to improve saliency detection. Although the better performance has been achieved, most of saliency models based on deep learning have not fully exploited the use of convolutional neural networks with deeper layers, which are able to represent more semantically refined and robust features. Besides, the saliency predictions are still not satisfactory enough when these saliency models are used for images with cluttered scenes.

Recently, residual network (ResNet) [27] has been shown to be an effective approach to train extremely deep neural network architectures for extracting rich semantic features. ResNet can get better results through deeper network and is used as the backbone architecture for many different tasks. Meanwhile, the dilated convolution [36], [47], [48], which enlarges the receptive field of convolution kernel without increasing the number of model parameters, acquires spatial information sufficiently. As for feature fusion in different layers of a deep convolutional network, feature pyramid network (FPN) [49] exploits the inherent multi-scale, pyramidal hierarchy of a network to construct feature pyramids without extra cost. The feature pyramid is considered as a basic component for detecting objects at different scales. FPN is capable of achieving more robust results than the networks that exploit lateral/skip connections to associate low-level feature maps across resolutions and semantic levels. However, FPN may lose some details if only the prediction of its final layer is utilized for feature integration, and the discriminative hierarchical features in the process of semantic information propagating from high level to low level may be lost due to the integrated information combines the next low-level feature maps directly. Moreover, the classical Atrous Spatial Pyramid Pooling (ASPP) module in [36], which includes four different dilated convolution layers, may lose some original spatial information in each dilated convolution process.

In this paper, motivated by FPN and dilated convolution, we propose an end-to-end ResNet-based deep convolutional neural network for saliency detection. Our model can effectively integrate high-level and low-level features. It can also effectively fuse multi-scale spatial features for generating high-quality saliency maps. The proposed saliency model overcomes the aforementioned limitations of recent saliency models based on convolutional neural networks. Specifically, the main contributions of our work mainly lie in the following three aspects:

  • We propose a multi-level feature integration network by introducing residual connected blocks with dilated convolution. Different from the classical FPN [49] model, the proposed multi-level feature integration network uses residual connected blocks to aggregate and propagate high-level and low-level features and expand the spatial receptive fields at different layers with dilation convolution for extracting more robust features.

  • Different from the ASPP module [36] and the PDC module [32] which also exploit several convolutional layers with different dilation rates, we propose a multi-scale feature fusion network, which consists of four parallel residual connected blocks with different dilation rates and appends a 1 × 1 convolutional layer following each output of residual block, to obtain robust spatial context information adaptively at different spatial scales and more accurate saliency detection.

  • The proposed saliency model yields more accurate saliency maps and outperforms the state-of-the-art saliency models on five challenging saliency detection datasets, including DUTS [50], ECSSD [18], HKU-IS [51], PASCAL-S [52] and MSRA10K [53].

The rest of this paper is organized as follows. The related works on saliency detection and deep neural networks are briefly reviewed in Section 2. The network structure and training details of the proposed saliency model are described in Section 3. Experimental results and analysis are presented in Section 4, and conclusions are drawn in Section 5.

Section snippets

Classical saliency models

Saliency models are mainly used for human fixation prediction [54], [55], [56], [57] and salient object detection [17], [20]. The difference between these two applications is that the former aims at finding the locations attracting the attention from human observers, while the latter tries to completely highlight salient object regions and suppress background. Over the past decades, existing saliency models mainly exploit low-level hand-crafted features and conventional machine learning

Proposed saliency model

The proposed saliency model includes two components: 1) multi-level feature integration network, which integrates high-level and low-level features through residual connected blocks with dilated convolution to obtain a richer feature representation; 2) multi-scale feature fusion network, which fuses the integrated features in the previous network by residual connected blocks with dilated convolution using different dilation rates at different spatial scales.

The first component is designed for

Experimental setup

Following the experimental setups of the up-to-date saliency models [64] [65], we utilize the DUTS [50] training dataset, which is a commonly used public image saliency dataset, for training. For data augmentation, we apply the random mirror reflection on training images in different training iterations.

To evaluate the performance of our saliency model, the following five public saliency detection datasets are used for testing.

DUTS [50]. It is the largest saliency detection benchmark dataset,

Conclusion

In this paper, we propose an end-to-end deep neural network with multi-level feature integration and multi-scale feature fusion for saliency detection without any pre-/post- processing. Firstly, the proposed multi-level feature integration (MLFI) network integrates low-level and high-level information in different layers. Then, the proposed multi-scale feature fusion (MSFF) network fuses multi-scale spatial context information of the integrated feature maps. Finally, the fused feature maps are

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 61771301.

Mengke Huang received the B.E. degree from Shanghai Normal University, Shanghai, China, in 2014. He is currently pursuing the Ph.D. degree at the School of Communication and Information Engineering, Shanghai University, Shanghai, China. His research interests include deep learning and image/video saliency detection.

References (74)

  • A.M. Treisman et al.

    A feature-integration theory of attention

    Cognit. Psychol.

    (1980)
  • GaoY. et al.

    3-d object retrieval and recognition with hypergraph analysis

    IEEE Trans. Image Process.

    (2012)
  • HeJ. et al.

    Mobile product search with bag of hash bits and boundary reranking

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • ShiR. et al.

    Region diversity maximization for salient object detection

    IEEE Signal Process. Lett.

    (2012)
  • LuoY. et al.

    Saliency density maximization for efficient visual objects discovery

    IEEE Trans. Circuits Syst. Video Technol.

    (2011)
  • WangW. et al.

    Consistent video saliency using local gradient flow optimization and global refinement

    IEEE Trans. Image Process.

    (2015)
  • WangW. et al.

    Stereoscopic thumbnail creation via efficient stereo saliency detection

    IEEE Trans. Vis. Comput. Graph.

    (2017)
  • GuoF. et al.

    Video saliency detection using object proposals

    IEEE Trans. Cybern.

    (2018)
  • LiuZ. et al.

    Unsupervised salient object segmentation based on kernel density estimation and two-phase graph cut

    IEEE Trans. Multimed.

    (2012)
  • YeL. et al.

    Salient object segmentation via effective integration of saliency and objectness

    IEEE Trans. Multimed.

    (2017)
  • WangW. et al.

    Saliency-aware geodesic video object segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • DingY. et al.

    Importance filtering for image retargeting

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2011)
  • SunJ. et al.

    Scale and object aware image retargeting for thumbnail browsing

    Proceedings of the IEEE International Conference on Computer Vision

    (2011)
  • M. Donoser et al.

    Saliency driven total variation segmentation

    Proceedings of the IEEE International Conference on Computer Vision

    (2009)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • J. Harel et al.

    Graph-based visual saliency

    Proceedings of the Advances in Neural Information Processing Systems

    (2007)
  • F. Perazzi et al.

    Saliency filters: contrast based filtering for salient region detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • JiangH. et al.

    Salient object detection: a discriminative regional feature integration approach

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • YanQ. et al.

    Hierarchical saliency detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • YangC. et al.

    Saliency detection via graph-based manifold ranking

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • LiuZ. et al.

    Saliency tree: a novel saliency detection framework

    IEEE Trans. Image Process.

    (2014)
  • YanJ. et al.

    Visual saliency detection via sparsity pursuit

    IEEE Signal Process. Lett.

    (2010)
  • LiX. et al.

    Saliency detection via dense and sparse reconstruction

    Proceedings of the IEEE International Conference on Computer Vision

    (2013)
  • JiangB. et al.

    Saliency detection via absorbing markov chain

    Proceedings of the IEEE International Conference on Computer Vision

    (2013)
  • C. Aytekin et al.

    Visual saliency by extended quantum cuts

    Proceedings of the IEEE International Conference on Image Processing

    (2015)
  • ZhuW. et al.

    Saliency optimization from robust background detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Proceedings of the Advances in Neural Information Processing Systems

    (2012)
  • HeK. et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • DaiJ. et al.

    R-FCN: object detection via region-based fully convolutional networks

    Proceedings of the Advances in Neural Information Processing Systems

    (2016)
  • WangW. et al.

    Video salient object detection via fully convolutional networks

    IEEE Trans. Image Process.

    (2018)
  • SongH. et al.

    Pyramid dilated deeper convlstm for video salient object detection

    Proceedings of the European Conference on Computer Vision

    (2018)
  • LuX. et al.

    Deep regression tracking with shrinkage loss

    Proceedings of the European Conference on Computer Vision

    (2018)
  • DongX. et al.

    Quadruplet network with one-shot learning for fast visual object tracking

    IEEE Trans. Image Process.

    (2019)
  • E. Shelhamer et al.

    Fully convolutional networks for semantic segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • ChenL. et al.

    Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Cited by (23)

    • RLP-AGMC: Robust label propagation for saliency detection based on an adaptive graph with multiview connections

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      Wang et al. [48] present a progressive feature polishing network, which gradually polishes the multi-level features to improve the characterization of the selected features, thereby improving the accuracy of detection. In [49], an end-to-end multi-level integration and multi-scale fusion neural networks was proposed to detect salient objects. In order to avoid information redundancy during information fusion, Gupta et al. [50] proposed a salient object detection method based on gate-based context information, which can limit undesirable features and promote beneficial ones for adaptive fusion of beneficial features with the input of the next higher scale.

    • Circular Complement Network for RGB-D Salient Object Detection

      2021, Neurocomputing
      Citation Excerpt :

      Wang et al. [70] specifically designed a salient edge module to refine the boundary of detected salient objects. In [8], Huang et al. proposed a multi-level integration and multi-scale fusion neural network. Wu et al. [9] introduced the adversarial learning into image SOD.

    • ERBANet: Enhancing region and boundary awareness for salient object detection

      2021, Neurocomputing
      Citation Excerpt :

      Salient object detection aims to automatically identify and localize important or attractive regions in an image [1–9].

    • Weakly supervised instance segmentation using multi-stage erasing refinement and saliency-guided proposals ordering

      2020, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      The influence of saliency detection methods. To illustrate the influence of saliency detection methods, we choose several other saliency detection methods, including UCF [39], Amulet [40] and MLFI-MSFF [41], to replace R3Net [32] for generating the saliency maps. The results of using different saliency detection methods are shown in Table 3.

    • Saliency detection using adversarial learning networks

      2020, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Saliency detection aims at highlighting the most visually distinctive objects or areas in an image, and has been applied in a wide range of applications such as visual tracking [1,2], image captioning [3], image segmentation [4–6], scene classification [7], content-aware image editing [8], salient object detection [9–11], and weakly supervised learning [12–14], to name a few. In recent years, numerous efforts have been proposed, and especially, the convolutional neural networks (CNN) based models [15–24] have pushed forward the progress of saliency detection significantly. However, the performances of these models degrade seriously when dealing with some complicated scenes, such as cluttered background, low contrasts between salient objects and background, and so on.

    View all citing articles on Scopus

    Mengke Huang received the B.E. degree from Shanghai Normal University, Shanghai, China, in 2014. He is currently pursuing the Ph.D. degree at the School of Communication and Information Engineering, Shanghai University, Shanghai, China. His research interests include deep learning and image/video saliency detection.

    Zhi Liu received the B.E. and M.E. degrees from Tianjin University, Tianjin, China, and the Ph.D. degree from Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai, China, in 1999, 2002, and 2005, respectively. He is currently a Professor with the School of Communication and Information Engineering, Shanghai University, Shanghai, China. From Aug. 2012 to Aug. 2014, he was a Visiting Researcher with the SIROCCO Team, IRISA/INRIA-Rennes, France, with the support by EU FP7 Marie Curie Actions. He has published more than 170 refereed technical papers in international journals and conferences. His research interests include image/video processing, machine learning, computer vision and multimedia communication. He was a TPC member/session chair in ICIP 2017, PCM 2016, VCIP 2016, ICME 2014, WIAMIS 2013, etc. He co-organized special sessions on visual attention, saliency models, and applications at WIAMIS 2013 and ICME 2014. He is an area editor of Signal Processing: Image Communication and served as a guest editor for the special issue on Recent Advances in Saliency Models, Applications and Evaluations in Signal Processing: Image Communication. He is a senior member of IEEE.

    Linwei Ye received the B.E. degree from Hangzhou Dianzi University, Hangzhou, China, in 2013, the M.E. degree from Shanghai University, Shanghai, China, in 2016, and is currently working toward the Ph.D. degree in computer science at the University of Manitoba, Winnipeg, MB, Canada. His research interests include saliency model, salient object segmentation, and semantic segmentation.

    Xiaofei Zhou received the Ph.D. degree from Shanghai University, Shanghai, China, in 2018. He is currently a Lecturer at Institute of Information and Control, School of Automation, Hangzhou Dianzi University. His research interests include saliency detection, and image/video segmentation.

    Yang Wang received the B.Sc. degree from the Harbin Institute of Technology, Harbin, China, the M.Sc. degree from the University of Alberta, Edmonton, AB, Canada, and the Ph.D. degree from Simon Fraser University, Burnaby, BC, Canada, all in computer science. He was previously a NSERC Postdoc Fellow with the University of Illinois at Urbana-Champaign, Champaign, IL, USA. He is currently an Associate Professor of computer science with the University of Manitoba, Winnipeg, MB, Canada. His research interests include computer vision and machine learning.

    View full text