Elsevier

Neurocomputing

Volume 445, 20 July 2021, Pages 35-49
Neurocomputing

Deep saliency detection via spatial-wise dilated convolutional attention

https://doi.org/10.1016/j.neucom.2021.02.061Get rights and content

Abstract

Saliency detection aims to highlight the area which significantly attracts human attention and stands out in an image. In recent years, deep learning-based saliency detection has achieved fantastic performance over conventional works while still facing huge challenges in multi-features fusion and the enlargement of the receptive field. Current top-performing saliency detectors on the basis of FCNs benefit from their powerful feature representations but suffer from high computational costs due to the integration of multi-scale features without distinction. So in this paper, we propose a novel and simple network, the DCAM, based on attention mechanism with dilated convolutions (DAM), incorporating multi-scale features with enlarged receptive field. Specifically, we apply DAM to guide each side output respectively which selectively emphasizes the significant regions, thus efficiently enhancing the representation ability of each layer. Our spatial attention module helps us looking for areas in the image that have a greater impact and give them a higher weight.Besides, we adopt FPN to integrate features adjacent to each other layer and a CRF scheme for refining saliency results. Experiments on five benchmark datasets demonstrate that the proposed approach performs favorably against five state-of-the-art methods with a fast speed (56 FPS on a single GPU).

Introduction

Along with the breakthrough of deep learning approaches and the rapid development of computer vision, the recent works dealing with saliency detection have made a conspicuous success through neural networks. As an important part of image processing, salient object detection is one of the core problems in the field of computer vision, whose goal is to find the most conspicuous and informative regions or objects in images which attract human attention. It has been widely applied as a pre-processing step in a variety of computer vision tasks to enhance their performance, such as image resizing [1], object segmentation [2], person re-identification [3], [4] and image retrieval [5].

In order to accomplish the task of saliency detection effectively, we need to distinguish objects from complex backgrounds and take into account objects with obscure boundaries. Traditional methods [6], [7], [8], [9], [10] employed hand-crafted visual features to detect salient objects from the input images, such as color, texture, and contrast. However, the lack of high-level semantic information restricts their ability to make salient objects stand out from the background when encountering complex scenes. Furthermore, it is time-consuming to extract hand-craft features.

Recently, convolutional neural networks (CNNs) have been introduced into saliency detection to extract high-level semantic features from raw pixels, which perform superior performance against traditional methods, significantly pushing the outcomes based on their rich representation power. The visual contrast with the traditional method can be seen in Fig. 1.

Currently, many state-of-the-art methods are based on the fully convolutional network (FCN) [13], [14], [15], [16] which is mainly aimed at obtaining significant information through the combination of the high-level features extracted from the last several convolution layers. An image of any sizes can be input to FCNs which is more efficient than CNNs. In [13], a recurrent fully convolutional network is proposed in which the model can automatically learn to optimize the saliency map by correcting its previous errors until the final prediction is produced in the final time step. Zhang et. al. [14] improve the robustness and accuracy of saliency detection by learning the convolution feature of deep uncertainty at the top of FCN with dilated convolution. Wenguan Wang [16] is proposed to capture hierarchical saliency information from deep layers with global saliency information to shallow layers with local saliency response. Final saliency prediction is achieved via the cooperation of those global and local predictions. In general, these methods stack multiple convolutional layers and pooling layers to gradually increase the size of the receptive field and obtain more advanced semantic information. To some extent, it solves the problem that conventional approaches fail to yield satisfactory results for images with complex scenarios.

Unfortunately, most works based on FCNs fail to efficiently extract and fuse features. The above-mentioned methods generally integrate multi-level convolutional features together without considering their different contributions to the saliency detection since not all features are of equal importance and some even cause distraction. Therefore, in order to solve the above problems, many works [17], [18], [19], [20], [21] introduce attention mechanisms to alleviate distraction of background thus highlight the foreground regions. Some works adapt channel-wise attention to assign large weights to the channels which play an important role for saliency detection, some works apply spatial attention to focus more on the high-response regions instead of considering all spatial positions equally to generate meaningful features and the others take above methods both.

However, in these excellent spatial attention works, the extracted spatial features are not sufficient and they can be further improved. So we refine the model in the step of looking for areas of interest with our spatial attention module. In addition, we need more lightweight models to better incorporate them into other computer vision tasks as preprocessing steps. So,in this paper, we propose a novel deep hierarchical saliency detection network with spatial-wise attention mechanism based on dilated convolutions(DAM). Specifically, in the construction of the spatial attention module, dilated convolution was applied to reduce the computational burden and increase the receptive field without introducing additional parameters. The purpose of this structure is to provide a greater receptive field with removing the pooling layer since the pooling layer will result in information loss. The reuse of max-pooling and striding in the aforementioned FCNs and other networks will significantly reduce the spatial resolution of the generated feature map. And if we take up-sampling and deconvolution into account, additional memory and time will be unavoidable. As a result, we introduce dilated convolution to avoid down-sampling and retain internal data structures. After concatenating the results of dilated convolution and a series of operations such as relu and normalization, the spatial attention weights are obtained and then multiplied by the preliminarily processed side outputs to obtain the final salient maps.

In addition, different layer has different function, and the deep side output has rich semantic information which can better locate prominent objects suffer from details missing. The shallow side output is convenient to capture the rich spatial information, but it lacks the global information. Simple fusion results are mediocre, so in this paper, we adopt FPN [22] to purposefully select useful features for fusion to extract the most conspicuous distinctive objects. Besides, we adopt a structure similar to PPM [23] to extract global information, connecting a set of dilatational convolution in a denser way and obtaining a larger range of expansion rate.

In summary, our main contributions are threefold:

  • (1)

    A novel spatial-wise dilated convolutional attention network for saliency detection is proposed. We preprocess every side output for enhanced representational ability first. Then, we apply attention module to guide each side output respectively which selectively emphasize the significant regions. We also make some improvements in the global feature extraction to better extract high-level semantic information. Besides, we introduce the high-level features to the low-level features for locating object precisely.

  • (2)

    We design a new spatial attention module to find the spatial position in the image that contributes greatly to the salient results and to suppress the interference of background information. Moreover, the application of dilated convolution enlarges the receptive field and reduces the computation without additional parameters introduced.This module is versatile enough to be embedded in any computer vision tasks to help us find highlighted areas.

  • (3)

    We design a lightweight model which runs fast with a speed of 56 FPS on a single GPU and outperforms other state-of-the-art models. It can better serve other tasks as a pre-processing step due to its efficient processing speed. Compared with other sixteen advanced works, comprehensive experimental comparisons and ablation analysis demonstrate the effectiveness of our proposed model on five benchmark datasets,including ECSSD, DUT-OMRON, HKU-IS, PASCAL-S and DUTS-TE.

The rest of this paper is organized as follows. In Section 2, we describe the salient object detection task and some excellent papers based on attention mechanisms in detail. Section 3 briefly describes the proposed method. In Section 4 and Section 5, we give the experimental results and conclusions.

Section snippets

Saliency detection

Salient object detection is the process of drawing a clear outline of the object from an image to find what the human eye is most interested in. There are two main categories of methods: traditional methods and deep learning-based methods.

Traditional methods adopt a lot of significant prior information for image saliency detection, such as background prior and center prior. Some works rely on hand-crafted features including color and shape in conjunction with heuristics computing model, which

Proposed approach

In this paper, we propose a dilated convolutional attention module to capture the salient objects location exactly and meanwhile to generate powerful attentive features. In Section 3.1, we describe the overall architecture of the proposed deep salient object detection network and then give a detailed depiction of our attention module in Section 3.2. The overall architecture of the proposed network is illustrated in Fig. 2.

Experiments

In this section, the superiority of the proposed model is demonstrated according to the experimental results. First, we describe the datasets and evaluation criteria used in the experiment, and then the experiment will be introduced in detail. In the ablation experiment, we compared the module itself in the first place and then compared it with other advanced models.

Conclusion

In this paper, we propose a novel but simple network based on attention mechanisms with dilated convolutions(DAM). We design a novel convolutional network with spatial attention module to find the spatial position in the image that contributes greatly to the salient results and to suppress the interference of background information. Moreover, the usage of dilated convolution enlarges the receptive field and reduces the computation without additional parameters introduced. Besides, the

CRediT authorship contribution statement

Wenzhao Cui: Conceptualization, Software, Formal analysis, Writing - original draft, Writing - review & editing. Qing Zhang: Methodology, Software, Writing - review & editing, Validation, Investigation, Resources, Visualization, Supervision, Project administration. Baochuan Zuo: Investigation, Data curation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported by Natural Science Foundation of Shanghai under Grant No. 19ZR1455300, National Natural Science Foundation of China under Grant No. 61806126 and Science and Technology Development Foundation of Shanghai Institute of Technology under Grant No. ZQ2018-23.

Wenzhao Cui obtained her Bachelor degree from Dongbei University of Finance and Economics(DUFE), Dalian, China, in 2018. She is currently a postgraduate student of Shanghai Institute of Technology, Shanghai, China. Her research interests include deep learning, object detection and saliency detection.

References (56)

  • C.A. Hussain et al.

    Robust pre-processing technique based on saliency detection for content based image retrieval systems

    Procedia Comput. Sci.

    (2016)
  • W. Wang et al.

    Stereoscopic thumbnail creation via efficient stereo saliency detection

    IEEE Trans. Vis. Comput. Graph.

    (2017)
  • W. Wang et al.

    Deep cropping via attention box prediction and aesthetics assessment

  • Y. Wei et al.

    Stc: A simple to complex framework for weakly-supervised semantic segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • R. Zhao et al.

    Unsupervised salience learning for person re-identification

  • R. Zhao et al.

    Person re-identification by saliency learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • F. Perazzi, P. Krähenbühl, Y. Pritch, A. Hornung, Saliency filters: Contrast based filtering for salient region...
  • Y. Qin et al.

    Saliency detection via cellular automata

  • C. Yang et al.

    Saliency detection via graph-based manifold ranking

  • H. Jiang et al.

    Salient object detection: A discriminative regional feature integration approach

  • X. Li et al.

    Saliency detection via dense and sparse reconstruction

  • M. Feng et al.

    Attentive feedback network for boundary-aware salient object detection

  • J. Wang et al.

    Salient object detection: A discriminative regional feature integration approach

    Int. J. Comput. Vision

    (2017)
  • L. Wang et al.

    Salient object detection with recurrent fully convolutional networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • P. Zhang et al.

    Learning uncertain convolutional features for accurate saliency detection

  • W. Wang et al.

    An iterative and cooperative top-down and bottom-up inference network for salient object detection

  • J.S. Wenguan Wang, Deep visual attention prediction, IEEE Transactions on Image...
  • N. Liu et al.

    Picanet: Learning pixel-wise contextual attention for saliency detection

  • S. Woo et al.

    Cbam: Convolutional block attention module

  • W. Wang et al.

    Salient object detection with pyramid attention and salient edges

  • X. Zhang et al.

    Progressive attention guided recurrent network for salient object detection

  • J. Hu et al.

    Squeeze-and-excitation networks

  • T.-Y. Lin et al.

    Feature pyramid networks for object detection

  • H. Zhao et al.

    Pyramid scene parsing network

  • Fang, Guo, Wenguan, Wang, Jianbing, Shen, Ling, Shao, Jian, Y. and Video saliency detection using object proposals.,...
  • Zhao, Sanyuan, Lei, Zhengchao, Sun, Meiling, Shen, Jianbing, Ma, Ao, Diffusion-based saliency detection with optimal...
  • W. Wang, J. Shen, J. Xie, M.M. Cheng, A. Borji, Revisiting video saliency prediction in the deep learning era, IEEE...
  • Q. Hou et al.

    Deeply supervised salient object detection with short connections

  • Cited by (0)

    Wenzhao Cui obtained her Bachelor degree from Dongbei University of Finance and Economics(DUFE), Dalian, China, in 2018. She is currently a postgraduate student of Shanghai Institute of Technology, Shanghai, China. Her research interests include deep learning, object detection and saliency detection.

    Qing Zhang received her B.S. and Ph.D degrees from College of Electrical Engineering, East China University of Science and Technology, Shanghai, China in 2007 and 2012, respectively. She is currently an associate professor at Shanghai Institute of Technology, Shanghai, China. Her research interests include image and video processing and machine learning.

    Baochuan Zuo received his Bachelor degree from Henan University of Science and Technology, China in 2016. He is currently a second-year graduate students and study in Shanghai Institute of Technology. His current research interests include salient object detection and instance segmentation.

    View full text