Multi-attention guided feature fusion network for salient object detection
Introduction
Saliency detection refers to extracting salient regions from images with intelligent algorithms that simulate the human visual system. This task is divided into two branches: eye-fixation detection [[1], [2], [3], [4], [5]] and salient object segmentation [6], [7], [8], [9], [10], [11], [12], [13], [14]. In this paper, we focus on the latter issue with the purpose of separating salient object areas from input images. The results of this research usually serve as a pre-processing step in varied computer vision tasks, such as video segmentation [15], visual tracking [16], image retrieval [17], thumbnail creation [18] and image captioning [19].
Numerous methods have been emerging unceasingly in the past few decades, due to the importance of salient object detection. Conventional models [20], [21], which were deeply influenced by the algorithm proposed by Itti et al. [22], usually utilize hand-crafted features to calculate contrast between local and global regions. However, it is obviously of great difficulty to use these simple low-level features, such as color and intensity, to segment the salient object from complex scenarios.
Recently, substantial contribution has been made in computer vision with the introduction of Convolutional Neural Networks (CNNs, e.g. VGG [25] and ResNet [26]). The CNN-based methods which can intelligently extract complex features with high-level semantic cues and low-level spatial structures synchronously are more feasible and effective than traditional algorithms. Even so, the repeated pooling operations in CNNs inevitably cause the loss of spatial details, which can’t be recovered by upsample operation and make a negative impact on dense prediction tasks. To address the above problem, multi-scale feature aggregation mechanisms [23], [24] have been used to enhance detailed information and capture distinctive objectness. However, the result of simple skip and short connections is not quite satisfactory(see Fig. 1) because different features have different impact on predicting salient pixels. In fact, some cluttered and noisy features may result in interference.
Therefore, when considering optimal and robust fusion-feature representation to get a more precise prediction, we hope the network has the faculty to select discriminative features and abandon noisy ones automatically. As a result, the attention mechanism [27], [28], which generates weights to image features of varied positions and channels, has been put forward and benefits different computer vision tasks [29], [30], [31], [32] a lot. On account of the superiority of attention, we apply multiple attention mechanisms to guide the message passing block by block in this paper. Different from the work [33] proposed in 2018, we use a novel Channel-wise Attention Block(CAB), which takes charge of the information transmission between every two contiguous blocks to learn a more satisfying aggregated features. Besides, we also employ self-attention and spatial-attention to improve the integrated features in the spatial dimension.
More specifically, our motivation is to solve two challenging problems for salient object detection via attention mechanisms. The first one is how to preserve the spatial consistency of the salient object. As is shown in the first row of Fig. 1, inconsistency within the scope of salient area troubles many saliency methods, which may miss parts of entire goals. To tackle the issue, we construct a CAB-based encoder-decoder network to learn a more robust fusion feature representation due to two factors. For one thing, we concat features output from every two adjacent convolutional blocks in the CAB module, then employ the semantic information of the higher block to calculate the channel-wise weights of the lower block from a global perspective. Accordingly, the semantic cues in the deeper block can guide the shallower block to select more discriminative features, which strengthen the capacity to segment the whole object. For another, the inconsistency problem is also caused by the lack of sufficient context information, so we integrate multi-scale features in the decoder subnet to capture the global and local context.
The second problem is how to prevent the network from predicting the redundant background area to be salient area(see the second row of Fig. 1). This issue mainly results from cluttered background features and the lack of contrast context information. To alleviate the problem, we design a Position Attention Block (PAB) which is composed of a self-attention module and a spatial attention module. Firstly, the self-attention module aims to get pixel relationships between every two-pixel pairs. For feature vector at any spatial position, we calculate similarities between itself and all other ones. The result is used to weight every feature vector in all spatial locations, and then the sum of weighted feature vectors will update the feature vector in the primary position. As a result, similar feature vectors contribute mutual improvement irrespective of their distance in the feature map so that the model can capture long-range dependency and contextual information. Secondly, we apply the spatial attention module to highlight the salient areas and suppress background positions. It is evident that the spatial attention module can avoid distractions of non-salient regions and make features more distinctive because not all feature vectors contribute to saliency detection and the noisy features of background regions may generate interference.
In conclusion, the feature fusion network we proposed in this paper performs superiorly under the guidance of the multi-attention mechanism. Our contributions are summarized as three folds:
- •
We propose an encoder-decoder feature aggregation network with a novel channel-wise attention block, which utilizes features in the high-level block to guide the selection of features in the low-level block. The multi-scale fusion features are of great benefit to the spatial consistency of the salient object.
- •
We also use self-attention and spatial attention to capture long-range contextual information and make features more distinctive and effective.
- •
We test the model on five saliency benchmark datasets, and the results of the experiment validate the effectiveness of our proposed algorithm.
Section snippets
Related work
As a vital branch of dense prediction tasks, saliency detection has developed rapidly in recent decades. Early researches [34], [35], [36], [37], [38], [39], [40], [41], [42] concentrate on extracting hand-crafted features, such as color, intensity and some prior information. These methods limited by imperfection of low-level visual features and knowledge of designers have poor accuracy and generalization. Due to the efficiency of deep learning approaches in computer vision tasks [43], [44],
Proposed method
In this section, we dwell on the proposed network for the saliency task. First, we describe the backbone of the architecture. Then, the channel-wise attention guided multi-scale feature fusion mechanism is the point of our narrative. Finally, we present the Position Attention Block (PAB) composed of a spatial attention module and a self-attention module, which filters features in the spatial dimension. As is shown in Fig. 2, there are six side output predictions in the whole network. We concat
Evaluation datasets
We evaluate the proposed network on five popular benchmark datasets: ECSSD [36], DUT-OMRON [35], HKU-IS [48], DUTS-test [66], SOD [67]. The ECSSD dataset has 1000 natural images with pixel-level annotations, and the images are selected from the internet. The DUT-OMRON dataset has 5168 complicated images with accurate ground truth, which is very challenging. The HKU-IS dataset has 4447 images which usually contain multiple disconnected salient objects. The DUTS dataset is a large-scale dataset
Conclusion
In this paper, we propose a novel feature fusion network for saliency detection task using three kinds of attention mechanisms to guide the integration and selection of features. For enhancing the spatial consistency of salient object areas, we introduce a novel CAB module that exploits the semantic cues in the high-level block to guide the feature selection in the low-level block from a global view. Then we utilize spatial attention and self-attention to generate the position attention module,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Anni Li: Writing - Original Draft. JinQing Qi: Writing - Review & Editing. Huchuan Lu: Supervision.
Anni Li received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2017. She is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). Her research interest include saliency detection and semantic segmentation.
References (72)
- et al.
Deep visual attention prediction
IEEE Trans. Image Process.
(2017) - et al.
Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks
- et al.
Shallow and deep convolutional networks for saliency prediction
- et al.
Understanding low-and high-level contributions to fixation prediction
- et al.
Predicting human eye fixations via an lstm-based saliency attentive model
IEEE Trans. Image Process.
(2018) - et al.
A shape-based approach for salient object detection using deep learning
- et al.
Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation
- et al.
Background prior-based salient object detection via deep reconstruction residual
IEEE Trans. Circ. Syst. Video Technol.
(2014) - et al.
Detect globally, refine locally: A novel approach to saliency detection
- et al.
Unsupervised salient object detection via inferring from imperfect saliency models
IEEE Trans. Multimedia
(2017)
Correspondence driven saliency transfer
IEEE Trans. Image Process.
Advanced deep-learning techniques for salient and category-specific object detection: a survey
IEEE Signal Process. Mag.
As-similar-as-possible saliency fusion
Multimedia Tools Appl.
Semantic prior analysis for salient object detection
IEEE Trans. Image Process.
Saliency-aware video object segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Online tracking by learning discriminative saliency map with convolutional neural network
3-d object retrieval and recognition with hypergraph analysis
IEEE Trans. Image Process.
Stereoscopic thumbnail creation via efficient stereo saliency detection
IEEE Trans. Visual. Comput. Graph.
From captions to visual concepts and back
Global contrast based salient region detection
IEEE Trans. Pattern Anal. Mach. Intell.
Unsupervised extraction of visual attention objects in color images
IEEE Trans. Circ. Syst. Video Technol.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Amulet: Aggregating multi-level convolutional features for salient object detection
Deeply supervised salient object detection with short connections
Deep residual learning for image recognition
End-to-end instance segmentation with recurrent attention
Multi-context attention for human pose estimation
Knowing when to look: adaptive attention via a visual sentinel for image captioning
Multi-level attention networks for visual question answering
Progressive attention guided recurrent network for salient object detection
Salient object detection: A discriminative regional feature integration approach
Saliency detection via graph-based manifold ranking
Hierarchical saliency detection
Cited by (19)
Addressing multiple salient object detection via dual-space long-range dependencies
2023, Computer Vision and Image UnderstandingIntermediate deep feature coding for human–machine vision collaboration
2023, Journal of Visual Communication and Image RepresentationConcrete crack detection using lightweight attention feature fusion single shot multibox detector
2023, Knowledge-Based SystemsSalient instance segmentation with region and box-level annotations
2022, NeurocomputingCitation Excerpt :In this paper, we concentrate on the new challenging task: salient instance segmentation (SIS). The performance of SOD has made significant progress owing to the rapid development of deep convolutional neural networks (CNNs) [9–11]. Salient instance segmentation, unlike SOD, promotes the saliency maps from region-level to instance-level perception by labeling each instance with a precise pixel-wise mask.
Joint-attention feature fusion network and dual-adaptive NMS for object detection
2022, Knowledge-Based SystemsCitation Excerpt :2) The fusion feature may be noisy, leading to negative inference for the detector. To alleviate these challenges, many researchers have introduced the attention mechanism [20–29] into object detection to improve the context relation of multi-scale features to optimize the detection effect. Yi et al. [38] proposed ASSD based on SSD and attention mechanisms.
Anni Li received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2017. She is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). Her research interest include saliency detection and semantic segmentation.
Jinqing Qi received the Ph.D. degree in communication and integrated system from the University of Tokyo Institute of Technology, Tokyo, Japan, in 2004. He is currently an Associate Professor of Information and Communication Engineering at University of DUT, Dalian, China. His recent research interests focus on computer vision, pattern recognition and machine learning. He is a member of IEEE.
Huchuan Lu received the M.S. degree from the Department of Electrical Engineering, Dalian University of Technology (DUT), China in 1998 and his Ph.D. degree of System Engineering from DUT in 2008, respectively. From 1998 to now, he is a faculty of School of Electronic and Information Engineering of DUT. He has been associate professor since2006. He has visited Ritsumeikan University from Oct. 2007 to Jan. 2008. His recent research interests focus on computer vision, artificial intelligence, pattern recognition and machine learning. He is a member of IEEE and IEIC.