Multi-attention guided feature fusion network for salient object detection

doi:10.1016/j.neucom.2020.06.021

Neurocomputing

Volume 411, 21 October 2020, Pages 416-427

https://doi.org/10.1016/j.neucom.2020.06.021 Get rights and content

Abstract

Though with the rapid development of deep learning, salient object detection methods have achieved increasingly better performance, how to get effective feature representation to predict more accurate saliency maps is still a burning problem we need to consider. To overcome this situation, most previous works tend to focus on skip-based architecture to integrate hierarchical information of different scales and layers. However, a simple concatenation of high-level features and low-level features is not all-powerful because cluttered and noisy information can cause negative consequences. Concerning the issue mentioned above, we propose a Multi-Attention guided Feature-fusion network (MAF) which can alleviate the problem from two aspects. For one thing, we use a novel Channel-wise Attention Block (CAB) to in charge of message passing layer by layer from a global view, which utilizes the semantic cues in the higher convolutional block to instruct the feature selection in the lower block. For another, a Position Attention Block (PAB) also works on integrated features to model pixel relationships and capture rich contextual dependencies. Under the guidance of multi-attention, discriminative features are selected to conduct a new end-to-end densely supervised encoder-decoder network which detects salient objects more uniformly and precisely. As the experimental results on five benchmark datasets show, our methods perform favorably against other state-of-the-art approaches.

Introduction

Saliency detection refers to extracting salient regions from images with intelligent algorithms that simulate the human visual system. This task is divided into two branches: eye-fixation detection [[1], [2], [3], [4], [5]] and salient object segmentation [6], [7], [8], [9], [10], [11], [12], [13], [14]. In this paper, we focus on the latter issue with the purpose of separating salient object areas from input images. The results of this research usually serve as a pre-processing step in varied computer vision tasks, such as video segmentation [15], visual tracking [16], image retrieval [17], thumbnail creation [18] and image captioning [19].

Numerous methods have been emerging unceasingly in the past few decades, due to the importance of salient object detection. Conventional models [20], [21], which were deeply influenced by the algorithm proposed by Itti et al. [22], usually utilize hand-crafted features to calculate contrast between local and global regions. However, it is obviously of great difficulty to use these simple low-level features, such as color and intensity, to segment the salient object from complex scenarios.

Recently, substantial contribution has been made in computer vision with the introduction of Convolutional Neural Networks (CNNs, e.g. VGG [25] and ResNet [26]). The CNN-based methods which can intelligently extract complex features with high-level semantic cues and low-level spatial structures synchronously are more feasible and effective than traditional algorithms. Even so, the repeated pooling operations in CNNs inevitably cause the loss of spatial details, which can’t be recovered by upsample operation and make a negative impact on dense prediction tasks. To address the above problem, multi-scale feature aggregation mechanisms [23], [24] have been used to enhance detailed information and capture distinctive objectness. However, the result of simple skip and short connections is not quite satisfactory(see Fig. 1) because different features have different impact on predicting salient pixels. In fact, some cluttered and noisy features may result in interference.

Therefore, when considering optimal and robust fusion-feature representation to get a more precise prediction, we hope the network has the faculty to select discriminative features and abandon noisy ones automatically. As a result, the attention mechanism [27], [28], which generates weights to image features of varied positions and channels, has been put forward and benefits different computer vision tasks [29], [30], [31], [32] a lot. On account of the superiority of attention, we apply multiple attention mechanisms to guide the message passing block by block in this paper. Different from the work [33] proposed in 2018, we use a novel Channel-wise Attention Block(CAB), which takes charge of the information transmission between every two contiguous blocks to learn a more satisfying aggregated features. Besides, we also employ self-attention and spatial-attention to improve the integrated features in the spatial dimension.

More specifically, our motivation is to solve two challenging problems for salient object detection via attention mechanisms. The first one is how to preserve the spatial consistency of the salient object. As is shown in the first row of Fig. 1, inconsistency within the scope of salient area troubles many saliency methods, which may miss parts of entire goals. To tackle the issue, we construct a CAB-based encoder-decoder network to learn a more robust fusion feature representation due to two factors. For one thing, we concat features output from every two adjacent convolutional blocks in the CAB module, then employ the semantic information of the higher block to calculate the channel-wise weights of the lower block from a global perspective. Accordingly, the semantic cues in the deeper block can guide the shallower block to select more discriminative features, which strengthen the capacity to segment the whole object. For another, the inconsistency problem is also caused by the lack of sufficient context information, so we integrate multi-scale features in the decoder subnet to capture the global and local context.

The second problem is how to prevent the network from predicting the redundant background area to be salient area(see the second row of Fig. 1). This issue mainly results from cluttered background features and the lack of contrast context information. To alleviate the problem, we design a Position Attention Block (PAB) which is composed of a self-attention module and a spatial attention module. Firstly, the self-attention module aims to get pixel relationships between every two-pixel pairs. For feature vector at any spatial position, we calculate similarities between itself and all other ones. The result is used to weight every feature vector in all spatial locations, and then the sum of weighted feature vectors will update the feature vector in the primary position. As a result, similar feature vectors contribute mutual improvement irrespective of their distance in the feature map so that the model can capture long-range dependency and contextual information. Secondly, we apply the spatial attention module to highlight the salient areas and suppress background positions. It is evident that the spatial attention module can avoid distractions of non-salient regions and make features more distinctive because not all feature vectors contribute to saliency detection and the noisy features of background regions may generate interference.

In conclusion, the feature fusion network we proposed in this paper performs superiorly under the guidance of the multi-attention mechanism. Our contributions are summarized as three folds:

•
We propose an encoder-decoder feature aggregation network with a novel channel-wise attention block, which utilizes features in the high-level block to guide the selection of features in the low-level block. The multi-scale fusion features are of great benefit to the spatial consistency of the salient object.
•
We also use self-attention and spatial attention to capture long-range contextual information and make features more distinctive and effective.
•
We test the model on five saliency benchmark datasets, and the results of the experiment validate the effectiveness of our proposed algorithm.

Section snippets

Related work

As a vital branch of dense prediction tasks, saliency detection has developed rapidly in recent decades. Early researches [34], [35], [36], [37], [38], [39], [40], [41], [42] concentrate on extracting hand-crafted features, such as color, intensity and some prior information. These methods limited by imperfection of low-level visual features and knowledge of designers have poor accuracy and generalization. Due to the efficiency of deep learning approaches in computer vision tasks [43], [44],

Proposed method

In this section, we dwell on the proposed network for the saliency task. First, we describe the backbone of the architecture. Then, the channel-wise attention guided multi-scale feature fusion mechanism is the point of our narrative. Finally, we present the Position Attention Block (PAB) composed of a spatial attention module and a self-attention module, which filters features in the spatial dimension. As is shown in Fig. 2, there are six side output predictions in the whole network. We concat

Evaluation datasets

We evaluate the proposed network on five popular benchmark datasets: ECSSD [36], DUT-OMRON [35], HKU-IS [48], DUTS-test [66], SOD [67]. The ECSSD dataset has 1000 natural images with pixel-level annotations, and the images are selected from the internet. The DUT-OMRON dataset has 5168 complicated images with accurate ground truth, which is very challenging. The HKU-IS dataset has 4447 images which usually contain multiple disconnected salient objects. The DUTS dataset is a large-scale dataset

Conclusion

In this paper, we propose a novel feature fusion network for saliency detection task using three kinds of attention mechanisms to guide the integration and selection of features. For enhancing the spatial consistency of salient object areas, we introduce a novel CAB module that exploits the semantic cues in the high-level block to guide the feature selection in the low-level block from a global view. Then we utilize spatial attention and self-attention to generate the position attention module,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Anni Li: Writing - Original Draft. JinQing Qi: Writing - Review & Editing. Huchuan Lu: Supervision.

Anni Li received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2017. She is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). Her research interest include saliency detection and semantic segmentation.

References (72)

W. Wang et al.
Deep visual attention prediction
IEEE Trans. Image Process.
(2017)
X. Huang et al.
Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks
J. Pan et al.
Shallow and deep convolutional networks for saliency prediction
M. Kummerer et al.
Understanding low-and high-level contributions to fixation prediction
M. Cornia et al.
Predicting human eye fixations via an lstm-based saliency attentive model
IEEE Trans. Image Process.
(2018)
J. Kim et al.
A shape-based approach for salient object detection using deep learning
S.S. Kruthiventi et al.
Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation
J. Han et al.
Background prior-based salient object detection via deep reconstruction residual
IEEE Trans. Circ. Syst. Video Technol.
(2014)
T. Wang et al.
Detect globally, refine locally: A novel approach to saliency detection
R. Quan et al.
Unsupervised salient object detection via inferring from imperfect saliency models
IEEE Trans. Multimedia
(2017)

W. Wang et al.

Correspondence driven saliency transfer

IEEE Trans. Image Process.

(2016)

J. Han et al.

Advanced deep-learning techniques for salient and category-specific object detection: a survey

IEEE Signal Process. Mag.

(2018)

T.V. Nguyen et al.

As-similar-as-possible saliency fusion

Multimedia Tools Appl.

(2017)

T.V. Nguyen et al.

Semantic prior analysis for salient object detection

IEEE Trans. Image Process.

(2019)

W. Wang et al.

Saliency-aware video object segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

S. Hong et al.

Online tracking by learning discriminative saliency map with convolutional neural network

Y. Gao et al.

3-d object retrieval and recognition with hypergraph analysis

IEEE Trans. Image Process.

(2012)

W. Wang et al.

Stereoscopic thumbnail creation via efficient stereo saliency detection

IEEE Trans. Visual. Comput. Graph.

(2016)

H. Fang et al.

From captions to visual concepts and back

M.-M. Cheng et al.

Global contrast based salient region detection

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

J. Han et al.

Unsupervised extraction of visual attention objects in color images

IEEE Trans. Circ. Syst. Video Technol.

(2005)

L. Itti et al.

A model of saliency-based visual attention for rapid scene analysis

IEEE Trans. Pattern Anal. Mach. Intell.

(1998)

P. Zhang et al.

Amulet: Aggregating multi-level convolutional features for salient object detection

Q. Hou et al.

Deeply supervised salient object detection with short connections

K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint...

K. He et al.

Deep residual learning for image recognition

V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention, in: Advances in Neural Information...

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you...

M. Ren et al.

End-to-end instance segmentation with recurrent attention

X. Chu et al.

Multi-context attention for human pose estimation

J. Lu et al.

Knowing when to look: adaptive attention via a visual sentinel for image captioning

D. Yu et al.

Multi-level attention networks for visual question answering

X. Zhang et al.

Progressive attention guided recurrent network for salient object detection

H. Jiang et al.

Salient object detection: A discriminative regional feature integration approach

C. Yang et al.

Saliency detection via graph-based manifold ranking

Q. Yan et al.

Hierarchical saliency detection

Cited by (19)

Addressing multiple salient object detection via dual-space long-range dependencies
2023, Computer Vision and Image Understanding
Salient object detection plays an important role in many downstream tasks. However, complex real-world scenes with varying scales and numbers of salient objects still pose a challenge. In this paper, we directly address the problem of detecting multiple salient objects across complex scenes. We propose a network architecture incorporating non-local feature information in both the spatial and channel spaces, capturing the long-range dependencies between separate objects. Traditional bottom-up and non-local features are combined with edge features within a feature fusion gate that progressively refines the salient object prediction in the decoder. We show that our approach accurately locates multiple salient regions even in complex scenarios. To demonstrate the efficacy of our approach to the multiple salient objects problem, we curate a new dataset containing only multiple salient objects. Our experiments demonstrate the proposed method presents state-of-the-art results on five widely used datasets without any pre-processing and post-processing. We obtain a further performance improvement against competing techniques on our multi-objects dataset. The dataset and source code are available at: https://github.com/EricDengbowen/DSLRDNet.
Intermediate deep feature coding for human–machine vision collaboration
2023, Journal of Visual Communication and Image Representation
Traditional image coding are mainly designed for human vision. While for collaborative intelligence, deep feature coding is specific for machine vision, which includes feature extraction and compression. Actually, deep features can build a bridge between human and machine vision. Therefore, we focus on generalized deep feature extraction and compression for multitask, which includes image reconstruction task for human vision and computer visual tasks for machine vision. After analyzing correlation among multitask, a reconstruction guided feature extraction strategy and feature fusion based network are proposed to get more generalized intermediate deep feature, which contains sufficient information friendly for human and machine vision. Besides, a non-uniform quantization method based on importance and a compact representation method for feature distribution information protection are proposed for high efficiency feature coding. Eventually, we come up with an entire intermediate deep feature coding framework including feature extraction and compression. Experimental results indicate the performance gains with our framework.
Concrete crack detection using lightweight attention feature fusion single shot multibox detector
2023, Knowledge-Based Systems
As one of the most important defects of concrete, cracks seriously threaten the service life and safety of concrete structures, and various safety incidents caused by the collapse of concrete structures have occurred. Therefore, it is essential to detect concrete cracks as soon as possible. Existing object detection methods have low detection accuracy for cracks, leading to unsatisfactory detection results. In this paper, we propose a variety of feasible modules that improve the accuracy of single shot multibox detection (SSD), which is the most efficient object detection method in terms of both accuracy and speed. First, to improve the neural network’s ability to learn high-level and low-level feature maps, we propose a feature fusion enhancement module (FFEM). Second, to more effectively capture the information between feature map channels, we propose convolutional network attention (CNA). Third, to improve the anchor box fit to the ground truth box, we reset the distribution of the anchor box. Last, we propose a new type of nonmaximum suppression (NMS) named T-Soft NMS to address numerous issues with current NMS and to significantly enhance the performance of the model. We tested our method on a crack dataset, and numerous tests showed that it outperformed competing methods. In addition, we carried out ablation studies to confirm the validity and efficacy of our method.
Selective kernel convolution deep residual network based on channel-spatial attention mechanism and feature fusion for mechanical fault diagnosis
2023, ISA Transactions
This paper proposes a selective kernel convolution deep residual network based on the channel-spatial attention mechanism and feature fusion for mechanical fault diagnosis. First, adjacent channel attention modules are connected with the spatial attention mechanism module, then all channel features and spatial features are fused and a channel-spatial attention mechanism is constructed to form the feature enhancement module. Second, the feature enhancement module is embedded in a series model based on selective kernel convolution and deep residual network and combined with multi-layer feature fusion information. The model can more effectively extract fault features from the vibration signal, compared with traditional deep learning methods, and the fault recognition efficiency is improved. Finally, the proposed method was used to experimentally diagnose bearing and gear faults, and identification accuracies of 99.87% and 97.77%, respectively, were achieved. Compared with similar algorithms, the proposed method has higher fault identification ability, thereby demonstrating the advantages of the channel-spatial attention mechanism network. In addition, the accuracy and robustness of the model were verified.
Salient instance segmentation with region and box-level annotations
2022, Neurocomputing
Citation Excerpt :
In this paper, we concentrate on the new challenging task: salient instance segmentation (SIS). The performance of SOD has made significant progress owing to the rapid development of deep convolutional neural networks (CNNs) [9–11]. Salient instance segmentation, unlike SOD, promotes the saliency maps from region-level to instance-level perception by labeling each instance with a precise pixel-wise mask.
In the field of saliency detection, salient instance segmentation is a novel challenging task that has received widespread attention. Due to the limited scale of the available dataset and the high cost of mask annotations, a substantial quantity of supervision sources is urgently required to train a high-performing salient instance model. To this end, we aim to train a novel salient instance segmentation model by weak supervisions that make full use of the existing salient object detection dataset. In this paper, we present a cyclic global context salient instance segmentation network (CGCNet) supervised by the combination of salient regions and bounding boxes from ready-made salient object detection datasets. To locate salient instances more accurately, a global feature refining layer is designed to expand the size of the features from the region of interest (ROI) to the global field in a scene. Moreover, a labeling updating scheme is embedded in the proposed framework to iteratively update the weak labels. Extensive experimental results demonstrate that our CGCNet trained by weak labels is competitive with the existing fully-supervised state-of-the-art methods.
Joint-attention feature fusion network and dual-adaptive NMS for object detection
2022, Knowledge-Based Systems
Citation Excerpt :
2) The fusion feature may be noisy, leading to negative inference for the detector. To alleviate these challenges, many researchers have introduced the attention mechanism [20–29] into object detection to improve the context relation of multi-scale features to optimize the detection effect. Yi et al. [38] proposed ASSD based on SSD and attention mechanisms.
Attention mechanisms and Non-Maximum Suppression (NMS) have proven to be effective components in object detection. However, feature fusion of different scales and layers based on a single attention mechanism cannot always yield gratifying performance, and may introduce redundant information that makes the results worse than expected. NMS methods, on the other hand, generally face the single-constant threshold dilemma, namely, a lower threshold leads to the miss of highly overlapped instance objects while a higher one brings in more false positives. Therefore, how to optimize different dimensions of correlation in feature mapping and how to adaptively set the NMS threshold still hinder effective object detection. While independently addressing each will cause suboptimal detection, this paper proposes to feed the informative feature representation from a joint-attention feature fusion network into adaptive NMS for a comprehensive performance enhancement. Specifically, we embed two types of attention modules in a three-level Feature Pyramid Network (FPN): the channel-attention module is adopted for enhanced feature representation by re-evaluating relationships between channels from a global perspective; the position-attention module is used to exploit the correlation between features to discover rich contextual feature information. Furthermore, we develop dual-adaptive NMS to dynamically adjust the suppression thresholds according to instance objects density, namely, the threshold rises as instance objects gather and decays when objects appear sparsely. The proposed method is evaluated on the COCO dataset and extensive experimental results demonstrate its superior performance compared with existing methods.

View all citing articles on Scopus

Jinqing Qi received the Ph.D. degree in communication and integrated system from the University of Tokyo Institute of Technology, Tokyo, Japan, in 2004. He is currently an Associate Professor of Information and Communication Engineering at University of DUT, Dalian, China. His recent research interests focus on computer vision, pattern recognition and machine learning. He is a member of IEEE.

Huchuan Lu received the M.S. degree from the Department of Electrical Engineering, Dalian University of Technology (DUT), China in 1998 and his Ph.D. degree of System Engineering from DUT in 2008, respectively. From 1998 to now, he is a faculty of School of Electronic and Information Engineering of DUT. He has been associate professor since2006. He has visited Ritsumeikan University from Oct. 2007 to Jan. 2008. His recent research interests focus on computer vision, artificial intelligence, pattern recognition and machine learning. He is a member of IEEE and IEIC.

View full text

Multi-attention guided feature fusion network for salient object detection

Abstract

Introduction

Section snippets

Related work

Proposed method

Evaluation datasets

Conclusion

Declaration of Competing Interest

CRediT authorship contribution statement

Deep visual attention prediction

IEEE Trans. Image Process.

Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks

Shallow and deep convolutional networks for saliency prediction

Understanding low-and high-level contributions to fixation prediction

Predicting human eye fixations via an lstm-based saliency attentive model

IEEE Trans. Image Process.

A shape-based approach for salient object detection using deep learning

Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation

Background prior-based salient object detection via deep reconstruction residual

IEEE Trans. Circ. Syst. Video Technol.

Detect globally, refine locally: A novel approach to saliency detection

Unsupervised salient object detection via inferring from imperfect saliency models

IEEE Trans. Multimedia

Correspondence driven saliency transfer

IEEE Trans. Image Process.

Advanced deep-learning techniques for salient and category-specific object detection: a survey

IEEE Signal Process. Mag.

As-similar-as-possible saliency fusion

Multimedia Tools Appl.

Semantic prior analysis for salient object detection

IEEE Trans. Image Process.

Saliency-aware video object segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Online tracking by learning discriminative saliency map with convolutional neural network

3-d object retrieval and recognition with hypergraph analysis

IEEE Trans. Image Process.

Stereoscopic thumbnail creation via efficient stereo saliency detection

IEEE Trans. Visual. Comput. Graph.

From captions to visual concepts and back

Global contrast based salient region detection

IEEE Trans. Pattern Anal. Mach. Intell.

Unsupervised extraction of visual attention objects in color images

IEEE Trans. Circ. Syst. Video Technol.

A model of saliency-based visual attention for rapid scene analysis

IEEE Trans. Pattern Anal. Mach. Intell.

Amulet: Aggregating multi-level convolutional features for salient object detection

Deeply supervised salient object detection with short connections

Deep residual learning for image recognition

End-to-end instance segmentation with recurrent attention

Multi-context attention for human pose estimation

Knowing when to look: adaptive attention via a visual sentinel for image captioning

Multi-level attention networks for visual question answering

Progressive attention guided recurrent network for salient object detection

Salient object detection: A discriminative regional feature integration approach

Saliency detection via graph-based manifold ranking

Hierarchical saliency detection