BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network

Fan, Deng-Ping; Zhai, Yingjie; Borji, Ali; Yang, Jufeng; Shao, Ling

doi:10.1007/978-3-030-58610-2_17

BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network

Conference paper
First Online: 07 October 2020

5498 Accesses
168 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12357))

Abstract

Multi-level feature fusion is a fundamental topic in computer vision for detecting, segmenting and classifying objects at various scales. When multi-level features meet multi-modal cues, the optimal fusion problem becomes a hot potato. In this paper, we make the first attempt to leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to develop a novel cascaded refinement network. In particular, we 1) propose a bifurcated backbone strategy (BBS) to split the multi-level features into teacher and student features, and 2) utilize a depth-enhanced module (DEM) to excavate informative parts of depth cues from the channel and spatial views. This fuses RGB and depth modalities in a complementary way. Our simple yet efficient architecture, dubbed Bifurcated Backbone Strategy Network (BBS-Net), is backbone independent and outperforms 18 SOTAs on seven challenging datasets using four metrics.

D.-P. Fan and Y. Zhai—Equal contributions.

Download conference paper PDF

1 Introduction

Multi-modal and multi-level feature fusion [37] is essential for many computer vision tasks, such as object detection [8, 21, 26, 42, 70], semantic segmentation [29, 30, 32, 67], co-attention tasks [19, 72] and classification [38, 40, 53]. Here, we attempt to utilize this idea for RGB-D salient object detection (SOD) [4, 74], which aims at finding and segmenting the most visually prominent object(s) [2, 75] in a scene according to the RGB and depth cues.

To efficiently integrate the RGB and depth cues for SOD, researchers have explored several multi-modal strategies [3, 5], and have achieved encouraging results. Existing RGB-D SOD methods, however, still face the following challenges:

(1)
Effectively aggregating multi-level features. As discussed in [44, 63], teacher features^{Footnote 1} provide discriminative semantic information that serves as strong guidance for locating salient objects, while student features carry affluent details that are beneficial for refining edges. Therefore, previous RGB-D SOD algorithms focus on leveraging multi-level features, either via a progressive merging process [47, 76] or by using a dedicated aggregation strategy [52, 74]. However, these operations directly fuse multi-level features without considering level-specific characteristics, and thus suffer from the inherent noise often introduced by low-level features [4, 65]. Thus, some methods tend to get distracted by the background (e.g., first row in Fig. 1).
(2)
Excavating informative cues from the depth modality. Previous methods combine RGB and depth cues by regarding the depth map as a fourth-channel input [13, 51] or fusing RGB and depth modalities by simple summation [22, 23] or multiplication [9, 78]. These algorithms treat depth and RGB information the same and ignore the fact that depth maps mainly focus on the spatial relations among objects, whereas RGB information captures color and texture. Thus, such simple combinations are not efficient due to the modality difference. Besides, depth maps are sometimes low-quality, which may introduce feature noise and redundancy into the network. As an example, the depth map shown in the second row of Fig. 1 is blurry and noisy, and that is why many methods, including the top-ranked model (DMRA-iccv19 [52]), fail to detect the complete salient object.

To address these issues, we propose a novel Bifurcated Backbone Strategy Network (BBS-Net) for RGB-D salient object detection. As shown in Fig. 2(b), BBS-Net consists of two cascaded decoder stages. In the first stage, teacher features are aggregated by a standard cascaded decoder $\mathbf{F} _{CD1}$ to generate an initial saliency map $S_{1}$. In the second stage, student features are refined by an element-wise multiplication with the initial saliency map $S_{1}$ and are then integrated by another cascaded decoder $\mathbf{F} _{CD2}$ to predict the final map $S_{2}$.

To the best of our knowledge, BBS-Net is the first work to explore the cascaded refinement mechanism for the RGB-D SOD task. Our main contributions are as follows:

(1)
We exploit multi-level features in a bifurcated backbone strategy (BBS) to suppress distractors in the lower layers. This strategy is based on the observation that high-level features provide discriminative semantic information without redundant details [44, 65], which may contribute significantly to eliminating distractors in lower layers.
(2)
To fully capture the informative cues in the depth map and improve the compatibility of RGB and depth features, we introduce a depth-enhanced module (DEM), which contains two sequential attention mechanisms: channel attention and spatial attention. The channel attention utilizes the inter-channel relations of the depth features, while the spatial attention aims to determine where informative depth cues are carried.
(3)
We demonstrate that the proposed BBS-Net exceeds 18 SOTAs on seven public datasets, by a large margin. Our experiments show that our framework has strong scalability in terms of various backbones. This means that the bifurcated backbone strategy via a cascaded refinement mechanism is promising for multi-level and multi-modal learning tasks.

2 Related Works

Although RGB-based SOD has been thoroughly studied in recent years [7, 39, 60, 69, 71], most of algorithms fail under complicated scenarios (e.g., cluttered backgrounds [16], low-intensity environments, or varying illuminations) [4, 52]. As a complementary modality to RGB information, depth cues contain rich spatial distance information [52] and contribute significantly to understanding challenging scenes. Therefore, researchers have started to solve the SOD problem by combining RGB images with depth information [15].

Traditional Models. Previous RGB-D SOD algorithms mainly focused on hand crafted features [9, 78]. Some of these methods largely relied on contrast-based cues by calculating color, edge, texture and region contrast to measure the saliency in a local region. For example, [15] adopted the region based contrast to compute contrast strengths for the segmented regions. In [10], the saliency value of each pixel depended on the color contrast and surface normals. However, the local contrast methods focued on the boundaries of salient objects and were easily affected by high-frequency content [54]. Therefore, some algorithms, such as global contrast [11], spatial prior [9], and background prior [56], proposed to calculate the saliency by combining local and global information. To effectively combine saliency cues from the RGB and depth modalities, researchers have explored various fusion strategies. Some methods [13, 51] regarded depth images as the fourth-channel inputs and processed the RGB and depth channes together (early fusion). This operation seems simple but disregards the differences between the RGB and depth modalities and thus cannot achieve reliable results. Therefore, to effectively extract the saliency information from the two modalities separately, some algorithms [22, 78] first leveraged two backbones to predict saliency maps and then fused the saliency results (late fusion). Besides, considering that the RGB and depth modalities may positively influence each other, yet other methods [24, 34] fused RGB and depth features in a middle stage and then predicted the saliency maps from the fused features (middle fusion). In fact, these three fusion strategies are also explored in the current deep models, and our model can be considered as a middle fusion.

Deep Models. Early deep algorithms [54, 56] extracted hand crafted features first, and then fed them to CNNs to compute saliency confidence scores. However, these methods need to design low-level features first and cannot be trained in an end-to-end manner. More recently, researchers have exploited CNNs to extract RGB and depth features in a bottom-up way [28]. Compared with hand crafted features, deep features contain more semantic and contextual information that can better capture representations of the RGB and depth modalities and achieve encouraging performance. The success of these deep models [5, 52] stems from two aspects of feature fusion. The first is the extracting of multi-scale features from different layers and then the effective fusion of these features. The second is the mechanism of fusing features from the two different modalities.

To effectively aggregate multi-scale features, researchers have designed various network architectures. For example, [47] fed a four-channel RGB-D image into a single backbone and then obtained saliency map outputs from each side-out features (single stream). Chen et al. [3] leveraged two networks to extract RGB features and depth features respectively, and then fused them in a progressive complementary way (double stream). Further, to exploit cross-modal complements in the bottom-up feature extraction process, Chen et al. [4] proposed a three-stream network that contains two modality-specific streams and a parallel cross-modal distillation stream to learn supplementary features (three streams). However, depth maps are often of low quality and thus may contain a lot of noise and misleading information. This greatly decreases the performance of SOD models. To address this problem, Zhao et al. [74] designed a contrast-enhanced network to improve the quality of depth maps by the contrast prior. Fan et al. [20] proposed a depth depurator unit that can evaluate the quality of the depth images and then filter out the low-quality maps automatically. Two recent famous works have also explored uncertainty [68], bilateral attention [73], graph neural network [48] and a joint learning strategy [25] and achieve good performance.

3 Proposed Method

3.1 Overview

Existing popular RGB-D SOD models directly aggregate multi-level features (Fig. 2(a)). As shown in Fig. 3, the network flow of our BBS-Net is different from the above mentioned models. We first introduce the bifurcated backbone strategy with the cascaded refinement mechanism in Sect. 3.2. To fully use informative cues in the depth map, we introduce a new depth-enhanced module (Sect. 3.3).

3.2 Bifurcated Backbone Strategy (BBS)

We propose to excavate the rich semantic information in high-level cross-modal features to suppress background distractors in a cascaded refinement way. We adopt a bifurcated backbone strategy (BBS) to divide the multi-level cross-modal features into two groups, i.e., $\mathbf{Q} _1$ = {Conv1, Conv2, Conv3} and $\mathbf{Q} _2 = ${Conv3, Conv4, Conv5}, with the Conv3 as the split point. Each group still preserves the original multi-scale information.

Cascaded Refinement Mechanism. To effectively leverage the features of the two groups, the whole network is trained with a cascaded refinement mechanism. This mechanism first produces an initial saliency map with three cross-modal teacher features (i.e., $\mathbf{Q} _2$) and then improves the details of the initial saliency map $S_1$ with three cross-modal student features (i.e., $\mathbf{Q} _1$) refined by the initial saliency map itself. Using this mechanism, our model can iteratively refine the details in the low-level features. This is based on the observation that high-level features contain rich global contextual information which is beneficial for locating salient objects, while low-level features carry much detailed micro-level information that can contribute significantly to refining the boundaries. In other words, this strategy efficiently eliminates noise in low-level cross-modal features, by exploring the specialties of multi-level features, and predicts the final saliency map in a progressive refinement manner.

Specifically, we first compute cross-modal features $\{f_i^{cm}; i=1,2,...,5\}$ by merging RGB and depth features processed by the DEM (Fig. 3(a)). In stage one, the three cross-modality teacher features (i.e., $f_3^{cm},f_4^{cm},f_5^{cm}$) are aggregated by the first cascaded decoder, which is formulated:

$$\begin{aligned} S_1=\mathbf{T} _1\big (\mathbf{F} _{CD1}(f_3^{cm},f_4^{cm},f_5^{cm})\big ), \end{aligned}$$

(1)

where $S_1$ is the initial saliency map, $\mathbf{F} _{CD1}$ is the first cascaded decoder and $\mathbf{T} _1$ represents two simple convolutional layers that change the channel number from 32 to 1. In stage two, the initial saliency map $S_1$ is leveraged to refine the three cross-modal student features, which is defined as:

$$\begin{aligned} f_i^{cm^{\prime }}=f_i^{cm}\odot S_1 , \end{aligned}$$

(2)

where $f_i^{cm^{\prime }}$ ($i\in \{1,2,3\}$) denotes the refined features and $\odot $ represents the element-wise multiplication. Then, the three refined student features are integrated by another decoder followed by a progressively transposed module (PTM), which is defined as,

$$\begin{aligned} S_2=\mathbf{T} _2\Big (\mathbf{F} _{CD2}(f_1^{cm^{\prime }},f_2^{cm^{\prime }},f_3^{cm^{\prime }})\Big ), \end{aligned}$$

(3)

where $S_2$ is the final saliency map. $\mathbf{T} _2$ represents the PTM module and $\mathbf{F} _{CD2}$ denotes the second cascaded decoder. Finally, we jointly optimize the two stages by defining the total loss:

$$\begin{aligned} \mathcal {L}=\alpha \ell _{ce}(S_1,G)+ (1-\alpha ) \ell _{ce}(S_2,G), \end{aligned}$$

(4)

in which $\ell _{ce}$ represents the widely used binary cross entropy loss and $\alpha \in [0,1]$ controls the trade-off between the two parts of the losses. The $\ell _{ce}$ is computed as:

$$\begin{aligned} \ell _{ce}(S,G)=G\log S + (1-G)\log (1-S), \end{aligned}$$

(5)

in which S is the predicted saliency map and G denotes the ground-truth binary saliency map.

Cascaded Decoder. Given the two groups of multi-level, cross-modal features ($\{f_i^{cm},f_{i+1}^{cm},f_{i+2}^{cm}\},i\in \{1,3\}$) fused by the RGB and depth features from different layers, we need to efficiently utilize the multi-level, multi-scale information in each group to carry out our cascaded refinement strategy. Thus, we introduce a light-weight cascaded decoder [65] to aggregate the two groups of multi-level, cross-modal features. As shown in Fig. 3(b), the cascaded decoder contains three global context modules (GCM) and a simple feature aggregation strategy. The GCM is refined from the RFB module [46] with an additional branch to enlarge the receptive field and a residual connection [31] to preserve the original information. Specifically, as illustrated in Fig. 3(c), the GCM module contains four parallel branches. For all of these branches, a $1\times 1$ convolution is first applied to reduce the channel size to 32. For the $k^{th}$ branch ($k\in \{2,3,4\}$), a convolution operation with a kernel size of $2k-1$ and dilation rate of 1 is applied. This is followed by another 3 $\times $ 3 convolution operation with a dilation rate of $2k\ -\ 1$. The goal here is to extract the global contextual information from the cross-modal features. Next, the outputs of the four branches are concatenated together and their channel number is reduced to 32 with a 1 $\times $ 1 convolution operation. Finally, the concatenated features form a residual connection with the input feature. The outputs of the GCM modules in the two cascaded decoders are defined by:

$$\begin{aligned} f_i^{gcm}=\mathbf{F} _{GCM}(f_i), \end{aligned}$$

(6)

To further improve the representational ability of cross-modal features, we leverage a pyramid multiplication and concatenation feature aggregation strategy to integrate the cross-modal features ($\{f_i^{gcm},f_{i+1}^{gcm},f_{i+2}^{gcm}\},i\in \{1,3\}$). As shown in Fig. 3(b), first, each refined feature $f_i^{gcm}$ is updated by multiplying it with all higher-level features:

$$\begin{aligned} f_i^{gcm^{\prime }}=f_i^{gcm} \odot \varPi _{k=i+1}^{k_{max}} Conv\Big (\mathbf{F} _{U P}\left( f_k^{gcm}\right) \Big ), \end{aligned}$$

(7)

where $i \in \{1,2,3\}$, $k_{max}=3$ or $ i \in \{3,4,5\}$, $k_{max}=5$. $Conv (\cdot )$ represents the standard 3 $\times $ 3 convolution operation, and $\mathbf{F} _{UP}$ denotes the upsampling operation if these features are not in the same scale. $\odot $ represents the element-wise multiplication. Second, the updated features are aggregated by a progressive concatenation strategy to generate the output:

$$\begin{aligned} S=\mathbf{T} \left( \left[ f_k^{gcm^{\prime }};Conv\Big (\mathbf{F} _{UP}\left[ f_{k+1}^{gcm^{\prime }};Conv\left( \mathbf{F} _{UP}(f_{k+2}^{gcm^{\prime }})\Big )\right] \right) \right] \right) , \end{aligned}$$

(8)

where S is the generated salient map, $k\in \{1,3\}$, and [x; y] denotes the concatenation operation of x and y. In the first stage, $\mathbf{T} $ represents two sequential convolutional layers ($\mathbf{T} _1$), while it denotes the PTM module ($\mathbf{T} _2$) for the second stage. The output (88 $\times $ 88) of the second decoder is 1/4 of the ground-truth resolution (352 $\times $ 352), so directly up-sampling the output to the ground-truth size will result in a loss of some details. To address this problem, we design a simple yet effective progressively transposed module (PTM, Fig. 3(d)) to predict the final saliency map ($S_2$) in a progressive upsampling way. It is composed of two sequential residual-based transposed blocks [33] and three sequential $1\times 1$ convolutions. Each residual-based transposed block consists of a $3\times 3$ convolution and a residual-based transposed convolution.

Note that our cascaded refinement mechanism is different from the recent refinement mechanisms R3Net [14], CRN [6], and RFCN [61] in its usage of multi-level features and the initial map. The obvious difference and superiority of our design is that we only need one round of saliency refinement to obtain a good performance, while R3Net, CRN, and RFCN all need more iterations, which will increase the training time and computational resources. In addition, our cascaded strategy is different from CPD [65] in that it exploits the details in low-level features and semantic information in high-level features, while suppressing the noise in low-level features, simultaneously.

3.3 Depth-Enhanced Module (DEM)

There are two main problems when trying to fuse RGB and depth features. One is the compatibility of the two due to the intrinsic modality difference, and the other is the redundancy and noise in low-quality depth features. Inspired by [64], we introduce a depth-enhanced module (DEM) to improve the compatibility of multi-modal features and to excavate the informative cues from the depth features.

Specifically, let $f_i^{rgb}$, $f_i^{d}$ denote the feature maps of the $i^{th}$ ($i\in 1,2,...,5$) side-out layer from the RGB and depth branches, respectively. As shown in Fig. 3, each DEM is added before each side-out feature map from the depth branch to improve the compatibility of the depth features. Such a side-out process enhances the saliency representation of depth features and preserves the multi-level, multi-scale information. The fusion process of the two modalities is formulated as:

$$\begin{aligned} f_i^{cm}=f_i^{rgb}+\mathbf{F} _{DEM}(f_i^{d}), \end{aligned}$$

(9)

where $f_i^{cm}$ represents the cross-modal features of the $i^{th}$ layer. As illustrated in Fig. 3(a), the DEM includes a sequential channel attention operation and a spatial attention operation. The operation of the DEM is defined as:

$$\begin{aligned} \mathbf{F} _{DEM}(f_i^{d})=\mathbf{S} _{att}\Big (\mathbf{C} _{att}(f^{d}_i)\Big ), \end{aligned}$$

(10)

where $\mathbf{C} _{att}(\cdot )$ and $\mathbf{S} _{att}(\cdot )$ denote the spatial and channel attention, respectively. More specifically,

$$\begin{aligned} \mathbf{C} _{att}(f)=\mathbf{M} \Big (\mathbf{P} _{max}(f)\Big )\otimes f, \end{aligned}$$

(11)

where $\mathbf{P} _{max}(\cdot )$ represents the global max pooling operation for each feature map, f denotes the input feature map, $\mathbf{M} (\cdot )$ is a multi-layer (two-layer) perceptron, and $\otimes $ denotes the multiplication by the dimension broadcast. The spatial attention is implemented as:

$$\begin{aligned} \mathbf{S} _{att}(f)=Conv\Big (\mathbf {R}_{max}(f)\Big )\odot f, \end{aligned}$$

(12)

where $\mathbf {R}_{max}(\cdot )$ represents the global max pooling operation for each point in the feature map along the channel axis. Our depth enhanced module is different from previous RGB-D models. Previous models fuse the corresponding multi-level features from the RGB and depth branches by direct concatenation [3, 4, 76], enhance the depth map by contrast prior [74] or process the multi-level depth features by a simple convolutional layer [52]. To the best of our knowledge, we are the first to introduce the attention mechanism to excavate informative cues from depth features in multiple side-out layers. Our experiments (see Table 4 and Fig. 5) show the effectiveness of this approach in improving the compatibility of multi-modal features.

Moreover, the spatial and channel attention mechanisms are different from the operation proposed in [64]. We only leverage a single global max pooling [50] operation to excavate the most critical cues in the depth features and reduce the complexity of the module simultaneously, which is based on the intuition that SOD aims at finding the most important area in an image.

4 Experiments

4.1 Experimental Settings

Datasets. We tested seven challenging RGB-D SOD datasets, i.e., NJU2K [34], NLPR [51], STERE [49], SIP [20], DES [9], LFSD [41], and SSD [77].

Training/Testing. Following the same training settings as in [52, 74], we use 1, 485 samples from the NJU2K dataset and 700 samples from the NLPR dataset as our training set. The remaining images in the NJU2K and NLPR datasets and the whole datasets of STERE, DES, LFSD, SSD, and SIP are used for testing.

Evaluation Metrics. We adopt four widely used metrics, including S-measure ($S_{\alpha }$) [17], maximum E-measure ($E_{\xi }$) [18], maximum F-measure ($F_{\beta }$) [1], mean absolute error (MAE). Evaluation code: http://dpfan.net/d3netbenchmark/.

Contenders. We compare the proposed BBS-Net with ten models that use hand crafted features [9, 12, 24, 27, 34, 43, 51, 55, 58, 78] and eight models [3,4,5, 28, 52, 54, 62, 74] based on deep learning. We train and test the above models using their default settings, as proposed in the original papers. For those models without released source codes, we used their published results for comparisons.

Inference Time. In terms of speed, BBS-Net 14 fps and 48 fps on a single GTX 1080Ti with a batch size of 1 and 10, respectively.

Implementation Details. We perform our experiments using the PyTorch [59] framework on a single 1080Ti GPU. Parameters of the backbone network (ResNet-50 [31]) are initialized from the model pre-trained on ImageNet [36]. We discard the last pooling and fully connected layers of ResNet-50 and leverage each middle output of the five convolutional blocks as the side-out feature maps. The two branches do not share weights and the only difference between them is that the depth branch has the input channel number set to 1. The other parameters are initialized as the PyTorch default settings. We use the Adam algorithm [35] to optimize the proposed model. The initial learning rate is set to 1e−4 and is divided by 10 every 60 epochs. We resize the input RGB and depth images to $352\times 352$ for both the training and test phases. All the training images are augmented using multiple strategies (i.e., random flipping, rotating and border clipping). It takes about 10 h to train our model with a mini-batch size of 10 for 200 epochs.

Table 1. Quantitative comparison of models using S-measure ($S_{\alpha }$), max F-measure ($F_{\beta }$), max E-measure ($E_{\xi }$) and MAE (M) scores on seven datasets. $\uparrow $ ($\downarrow $) denotes that the higher (lower) the better. The best score in each row is highlighted in bold.

Full size table

4.2 Comparison with SOTAs

Quantitative Results. As shown in Table 1, our method performs favorably against all algorithms based on hand crafted features as well as SOTA CNN-based methods by a large margin, in terms of all four evaluation metrics. Performance gains over the best compared algorithms (ICCV’19 DMRA [52] and CVPR’19 CPFP [74]) are (2.5%–3.5%, 0.7%–3.9%, 0.8%–2.3%, 0.009–0.016) for the metrics ($S_{\alpha }$, $max F_{\beta }$, $max E_{\xi }$, M) on seven challenging datasets.

Visual Comparison. Figure 4 provides sample saliency maps predicted by the proposed method and several SOTA algorithms. Visualizations cover simple scenes (a) and various challenging scenarios, including small objects (b), multiple objects (c), complex backgrounds (d), and low-contrast scenes (e). First, (a) is an easy example. The flower in the foreground is evident in the original RGB image, but the depth image is low-quality and contains some misleading information. The top two models, i.e., DMRA and CPFP, fail to predict the whole extent of the salient object due to the interference from the depth map. Our method can eliminate the side-effects of the depth map by utilizing the complementary depth information more effectively. Second, two examples of small objects are shown in (b). Despite the handle of the teapot in the first row being tiny, our method can accurately detect it. Third, we show two examples with multiple objects in the image in (c). Our method locates all salient objects in the image. It segments the objects better and generates sharper edges compared to other algorithms. Even though the depth map in the first row of (c) lacks clear information, our algorithm predicts the salient objects correctly. Fourth, (d) shows two examples with complex backgrounds. Here, our method produces reliable results, while other algorithms confuse the background as a salient object. Finally, (e) presents two examples in which the contrast between the object and background is low. Many algorithms fail to detect and segment the entire extent of the salient object. Our method produces satisfactory results by suppressing background distractors and exploring the informative cues from the depth map.

Table 2. Performance comparison using different backbones.

Full size table

Table 3. Comparison of different feature aggregation strategies on seven datasets.

Full size table

5 Discussion

Scalability. There are three popular backbone architectures (i.e., VGG-16 [57], VGG-19 [57] and ResNet-50 [31]) that are used in deep RGB-D models. To further validate the scalability of the proposed method, we provide performance comparisons using different backbones in Table 2. We find that our BBS-Net exceeds the SOTA methods (e.g., CPFP [74], and DMRA [52]) with all of these popular backbones, showing the strong scalability of our framework.

Aggregation Strategies. We conduct several experiments to validate the effectiveness of our cascaded refinement mechanism. Results are shown in Table 3 and Fig. 5(a). ‘Low3’ means that we only integrate the student features (Conv1$\sim $3) using the decoder without the refinement from the initial map for training and testing. Student features contain abundant details that are beneficial for refining the object edges, but at the same time introduce a lot of background distraction. Integrating only low-level features produces unsatisfactory results and generates many distractors (e.g., $1^{st}$–$2^{nd}$ row in Fig. 5(a)) or fails to locate the salient objects (e.g., the $3^{rd}$ row in Fig. 5(a)). ‘High3’ only integrates the teacher features (Conv3$\sim $5) using the decoder to predict the saliency map. Compared with student features, teacher features are ‘sophisticated’ and thus contain more semantic information. As a result, they help locate the salient objects and preserve edge information. Thus, integrating teacher features leads to better results. ‘All5’ aggregates features from all five levels (Conv1$\sim $5) directly using a single decoder for training and testing. It achieves comparable results with the ‘High3’ but may generate background noise introduced by the student features. ‘BBS-NoRF’ indicates that we directly remove the refinement flow of our model. This leads to poor performance. ‘BBS-RH’ can be seen as a reverse refinement strategy to our cascaded refinement mechanism, where teacher features (Conv3$\sim $5) are first refined by the initial map aggregated by student features (Conv1$\sim $3) and are then integrated to generate the final saliency map. It performs worse than our final mechanism (BBS-RL), because, with this reverse refinement strategy, noise in student features cannot be effectively suppressed. Besides, compared to ‘All5’, our method fully utilizes the features at different levels, and thus achieves significant performance improvement with fewer background distractors and sharper edges (i.e., ‘BBS-RL’ in Fig. 5(a)).

Table 4. Ablation study of our BBS-Net. ‘BM’ = base model. ‘CA’ = channel attention. ‘SA’ = spatial attention. ‘PTM’ = progressively transposed module.

Full size table

Table 5. Effectiveness analysis of the cascaded decoder.

Full size table

Impact of Different Modules. As shown in Table 4 and Fig. 5(b), we conduct an ablation study to test the effectiveness of different modules in our BBS-Net. The base model (BM) is our BBS-Net without additional modules (i.e., CA, SA, and PTM). Note that just the BM performs better than the SOTA methods over almost all datasets, as shown in Table 1 and Table 4. Adding the channel attention (CA) and spatial attention (SA) modules enhances performance on most of the datasets. See the results shown in the second and third rows of Table 4. When we combine the two modules (fourth row in Table 4), the performance is greatly improved on all datasets, compared with the BM. We can easily conclude from the ‘#3’ and ‘#4’ columns in Fig. 5(b) that the spatial attention and channel attention mechanisms in DEM allow the model to focus on the informative parts of the depth features, which results in better suppression of background distraction. Finally, we add a progressively transposed block before the second decoder to gradually upsample the feature map to the same resolution as the ground truth. The results in the fifth row of Table 4 and the ‘#5’ column of Fig. 5(b) show that the ‘PTM’ achieves impressive performance gains on all datasets and generates sharper edges with fine details.

To further analyze the effectiveness of the cascaded decoder, we experiment by changing the decoder to an element-wise summation mechanism. That is to say, we first change the features from different layers using $1\times 1$ convolution and upsampling operation to the same dimension and then fuse them by element-wise summation. Experimental results in Table 5 demonstrate the effectiveness of the cascaded decoder.

Table 6. $S_{\alpha }$ comparison with SOTA RGB SOD methods on three datasets. ‘w/o D’ and ‘w/ D’ represent training and testing the proposed method without/with the depth.

Full size table

Benefits of the Depth Map. To explore whether or not the depth information can really contribute to the performance of SOD, we conduct two experiments in Table 6: (i) We compare the proposed method with five SOTA RGB SOD methods (i.e., CPD [66], PoolNet [44], PiCANet [45], PAGRN [71] and R3Net [14]) when neglecting the depth information. We train and test these methods using the same training and testing sets as our BBS-Net. It is shown that the proposed methods (i.e., Ours (w/ D)) can significantly exceed SOTA RGB SOD methods due to the usage of depth information. (ii) We train and test the proposed method without using the depth information by setting the inputs of the depth branch to zero (i.e., Ours (w/o D)). Comparing the results of Ours (w/ D) with Ours (w/o D), we find that the depth information can effectively improve the performance of the proposed model. The above two experiments together demonstrate the benefits of the depth information for SOD, since depth maps can be seen as prior knowledge that provides spatial-distance information and contour guidance to detect salient objects.

6 Conclusion

We presented a new multi-level multi-modality learning framework that demonstrates state-of-the-art performance on seven challenging RGB-D salient object detection datasets using several evaluation measures. Our BBS-Net is based on a novel bifurcated backbone strategy (BBS) with a cascaded refinement mechanism. Importantly, our simple architecture is backbone independent, making it promising for further research on other related topics, including semantic segmentation, object detection and classification.

Notes

1.
Note that we use the terms ‘high-level features & low-level features’ and ‘teacher features & student features’ interchangeably.

References

Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: CVPR, pp. 1597–1604 (2009)
Google Scholar
Borji, A., Cheng, M.M., Jiang, H., Li, J.: Salient object detection: a benchmark. IEEE TIP 24(12), 5706–5722 (2015)
MathSciNet MATH Google Scholar
Chen, H., Li, Y.: Progressively complementarity-aware fusion network for RGB-D salient object detection. In: CVPR, pp. 3051–3060 (2018)
Google Scholar
Chen, H., Li, Y.: Three-stream attention-aware network for RGB-D salient object detection. IEEE TIP 28(6), 2825–2835 (2019)
MathSciNet MATH Google Scholar
Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. IEEE TOC 86, 376–385 (2019)
Google Scholar
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: CVPR, pp. 1511–1520 (2017)
Google Scholar
Chen, S., Tan, X., Wang, B., Lu, H., Hu, X., Fu, Y.: Reverse attention-based residual network for salient object detection. IEEE TIP 29, 3763–3776 (2020)
Google Scholar
Cheng, G., Han, J., Zhou, P., Xu, D.: Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE TIP 28(1), 265–278 (2018)
MathSciNet MATH Google Scholar
Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detection method. In: ICIMCS, pp. 23–27 (2014)
Google Scholar
Ciptadi, A., Hermans, T., Rehg, J.M.: An in depth view of saliency. In: BMVC (2013)
Google Scholar
Cong, R., Lei, J., Fu, H., Hou, J., Huang, Q., Kwong, S.: Going from RGB to RGBD saliency: a depth-guided transformation model. IEEE TOC, 1–13 (2019)
Google Scholar
Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., Hou, C.: Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. IEEE SPL 23(6), 819–823 (2016)
Google Scholar
Cong, R., Lei, J., Fu, H., Huang, Q., Cao, X., Ling, N.: HSCS: hierarchical sparsity based co-saliency detection for RGBD images. IEEE TMM 21(7), 1660–1671 (2019)
Google Scholar
Deng, Z., et al.: R3Net: recurrent residual refinement network for saliency detection. In: IJCAI, pp. 684–690 (2018)
Google Scholar
Desingh, K., Krishna, K., Rajanand, D., Jawahar, C.: Depth really matters: improving visual salient region detection with depth. In: BMVC, pp. 1–11 (2013)
Google Scholar
Fan, D.P., Cheng, M.M., Liu, J.J., Gao, S.H., Hou, Q., Borji, A.: Salient objects in clutter: bringing salient object detection to the foreground. In: ECCV, pp. 186–202 (2018)
Google Scholar
Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: ICCV, pp. 4548–4557 (2017)
Google Scholar
Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: IJCAI, pp. 698–704 (2018)
Google Scholar
Fan, D.P., Lin, Z., Ji, G.P., Zhang, D., Fu, H., Cheng, M.M.: Taking a deeper look at co-salient object detection. In: CVPR, pp. 2919–2929 (2020)
Google Scholar
Fan, D.P., Lin, Z., Zhang, Z., Zhu, M., Cheng, M.M.: Rethinking RGB-D salient object detection: models, datasets, and large-scale benchmarks. IEEE TNNLS (2020)
Google Scholar
Fan, D.P., Wang, W., Cheng, M.M., Shen, J.: Shifting more attention to video salient object detection. In: CVPR, pp. 8554–8564 (2019)
Google Scholar
Fan, X., Liu, Z., Sun, G.: Salient region detection for stereoscopic images. In: DSP, pp. 454–458 (2014)
Google Scholar
Fang, Y., Wang, J., Narwaria, M., Le Callet, P., Lin, W.: Saliency detection for stereoscopic images. IEEE TIP 23(6), 2625–2636 (2014)
MathSciNet MATH Google Scholar
Feng, D., Barnes, N., You, S., McCarthy, C.: Local background enclosure for RGB-D salient object detection. In: CVPR, pp. 2343–2350 (2016)
Google Scholar
Fu, K., Fan, D.P., Ji, G.P., Zhao, Q.: JL-DCF: joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In: CVPR, pp. 3052–3062 (2020)
Google Scholar
Gao, S.H., Tan, Y.Q., Cheng, M.M., Lu, C., Chen, Y., Yan, S.: Highly efficient salient object detection with 100K parameters. In: ECCV (2020)
Google Scholar
Guo, J., Ren, T., Bei, J.: Salient object detection for RGB-D image via saliency evolution. In: ICME, pp. 1–6 (2016)
Google Scholar
Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE TOC 48(11), 3171–3183 (2018)
Google Scholar
Han, J., Yang, L., Zhang, D., Chang, X., Liang, X.: Reinforcement cutting-agent learning for video object segmentation. In: CVPR, pp. 9080–9089 (2018)
Google Scholar
Han, Q., Zhao, K., Xu, J., Cheng, M.M.: Deep hough transform for semantic line detection. In: ECCV (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, X., Yang, S., Li, G., Li, H., Chang, H., Yu, Y.: Non-local context encoder: robust biomedical image segmentation against adversarial attacks. In: AAAI 2019, pp. 8417–8424 (2019)
Google Scholar
Hu, X., Yang, K., Fei, L., Wang, K.: ACNet: attention based network to exploit complementary features for RGBD semantic segmentation. In: ICIP, pp. 1440–1444 (2019)
Google Scholar
Ju, R., Ge, L., Geng, W., Ren, T., Wu, G.: Depth saliency based on anisotropic center-surround difference. In: ICIP, pp. 1115–1119 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Google Scholar
Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR, pp. 5455–5463 (2015)
Google Scholar
Li, G., Zhu, X., Zeng, Y., Wang, Q., Lin, L.: Semantic relationships guided representation learning for facial action unit recognition. In: AAAI, pp. 8594–8601 (2019)
Google Scholar
Li, H., Chen, G., Li, G., Yu, Y.: Motion guided attention for video salient object detection. In: ICCV, pp. 7274–7283 (2019)
Google Scholar
Li, J., et al.: Learning from large-scale noisy web data with ubiquitous reweighting for image classification. IEEE TPAMI (2019)
Google Scholar
Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. In: CVPR, pp. 2806–2813 (2014)
Google Scholar
Li, X., Yang, F., Cheng, H., Liu, W., Shen, D.: Contour knowledge transfer for salient object detection. In: ECCV, pp. 355–370 (2018)
Google Scholar
Liang, F., Duan, L., Ma, W., Qiao, Y., Cai, Z., Qing, L.: Stereoscopic saliency model using contrast and depth-guided-background prior. Neurocomputing 275, 2227–2238 (2018)
Article Google Scholar
Liu, J.J., Hou, Q., Cheng, M.M., Feng, J., Jiang, J.: A simple pooling-based design for real-time salient object detection. In: CVPR, pp. 3917–3926 (2019)
Google Scholar
Liu, N., Han, J., Yang, M.H.: PiCANet: learning pixel-wise contextual attention for saliency detection. In: CVPR, pp. 3089–3098 (2018)
Google Scholar
Liu, S., Huang, D., Wang, Y.: Receptive field block net for accurate and fast object detection. In: ECCV, pp. 404–419 (2018)
Google Scholar
Liu, Z., Shi, S., Duan, Q., Zhang, W., Zhao, P.: Salient object detection for RGB-D image by single stream recurrent convolution neural network. Neurocomputing 363, 46–57 (2019)
Article Google Scholar
Luo, A., Li, X., Yang, F., Jiao, Z., Cheng, H., Lyu, S.: Cascade graph neural networks for RGB-D salient object detection. In: ECCV (2020)
Google Scholar
Niu, Y., Geng, Y., Li, X., Liu, F.: Leveraging stereopsis for saliency analysis. In: CVPR, pp. 454–461 (2012)
Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free? - Weakly-supervised learning with convolutional neural networks. In: CVPR, pp. 685–694 (2015)
Google Scholar
Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: RGBD salient object detection: a benchmark and algorithms. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. Lecture Notes in Computer Science, vol. 8691, pp. 92–109. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_7
Chapter Google Scholar
Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrent attention network for saliency detection. In: ICCV, pp. 7254–7263 (2019)
Google Scholar
Qiao, L., Shi, Y., Li, J., Wang, Y., Huang, T., Tian, Y.: Transductive episodic-wise adaptive metric for few-shot learning. In: ICCV, pp. 3603–3612 (2019)
Google Scholar
Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: RGBD salient object detection via deep fusion. IEEE TIP 26(5), 2274–2285 (2017)
MathSciNet MATH Google Scholar
Ren, J., Gong, X., Yu, L., Zhou, W., Ying Yang, M.: Exploiting global priors for RGB-D saliency detection. In: CVPRW, pp. 25–32 (2015)
Google Scholar
Shigematsu, R., Feng, D., You, S., Barnes, N.: Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features. In: ICCVW, pp. 2749–2757 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, H., Liu, Z., Du, H., Sun, G., Le Meur, O., Ren, T.: Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning. IEEE TIP 26(9), 4204–4216 (2017)
MathSciNet MATH Google Scholar
Steiner, B., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NIPS, pp. 8024–8035 (2019)
Google Scholar
Su, J., Li, J., Zhang, Y., Xia, C., Tian, Y.: Selectivity or invariance: boundary-aware salient object detection. In: ICCV, pp. 3798–3807 (2019)
Google Scholar
Wang, L., Wang, L., Lu, H., Zhang, P., Ruan, X.: Salient object detection with recurrent fully convolutional networks. IEEE TPAMI 41(7), 1734–1746 (2018)
Article Google Scholar
Wang, N., Gong, X.: Adaptive fusion for RGB-D salient object detection. IEEE Access 7, 55277–55284 (2019)
Article Google Scholar
Wang, T., et al.: Detect globally, refine locally: a novel approach to saliency detection. In: CVPR, pp. 3127–3135 (2018)
Google Scholar
Woo, S., Park, J., Lee, J.Y., So Kweon, I.: CBAM: convolutional block attention module. In: ECCV, pp. 3–19 (2018)
Google Scholar
Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: CVPR, pp. 3907–3916 (2019)
Google Scholar
Wu, Z., Su, L., Huang, Q.: Stacked cross refinement network for edge-aware salient object detection. In: ICCV, pp. 7264–7273 (2019)
Google Scholar
Zeng, Y., Zhuge, Y., Lu, H., Zhang, L.: Joint learning of saliency detection and weakly supervised semantic segmentation. In: ICCV, pp. 7223–7233 (2019)
Google Scholar
Zhang, J., et al.: UC-Net: uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In: CVPR, pp. 8582–8591 (2020)
Google Scholar
Zhang, L., Wu, J., Wang, T., Borji, A., Wei, G., Lu, H.: A multistage refinement network for salient object detection. IEEE TIP 29, 3534–3545 (2020)
Google Scholar
Zhang, Q., Huang, N., Yao, L., Zhang, D., Shan, C., Han, J.: RGB-T salient object detection via fusing multi-level CNN features. IEEE TIP 29, 3321–3335 (2020)
Google Scholar
Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: CVPR, pp. 714–722 (2018)
Google Scholar
Zhang, Z., Jin, W., Xu, J., Cheng, M.M.: Gradient-induced co-saliency detection. In: ECCV (2020)
Google Scholar
Zhang, Z., Lin, Z., Xu, J., Jin, W., Lu, S.P., Fan, D.P.: Bilateral attention network for RGB-D salient object detection. arXiv preprint arXiv:2004.14582 (2020)
Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for RGBD salient object detection. In: CVPR, pp. 3927–3936 (2019)
Google Scholar
Zhao, J.X., Liu, J.J., Fan, D.P., Cao, Y., Yang, J., Cheng, M.M.: EGNet: edge guidance network for salient object detection. In: CVPR, pp. 8779–8788 (2019)
Google Scholar
Zhu, C., Cai, X., Huang, K., Li, T.H., Li, G.: PDNet: prior-model guided depth-enhanced network for salient object detection. In: ICME, pp. 199–204 (2019)
Google Scholar
Zhu, C., Li, G.: A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In: ICCVW, pp. 3008–3014 (2017)
Google Scholar
Zhu, C., Li, G., Wang, W., Wang, R.: An innovative salient object detection using center-dark channel prior. In: ICCVW, pp. 1509–1515 (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by the Major Project for New Generation of AI Grant (NO. 2018AAA0100403), NSFC (NO. 61876094, U1933114), Natural Science Foundation of Tianjin, China (NO. 18JCYBJC15400, 18ZXZNGX00110), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Inception Institute of Artificial Intelligence, Abu Dhabi, UAE
Deng-Ping Fan & Ling Shao
Nankai University, Tianjin, China
Yingjie Zhai & Jufeng Yang
HCL America, Manhattan, NY, USA
Ali Borji
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Ling Shao

Authors

Deng-Ping Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yingjie Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Ali Borji
View author publications
You can also search for this author in PubMed Google Scholar
Jufeng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Shao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jufeng Yang .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7112 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, DP., Zhai, Y., Borji, A., Yang, J., Shao, L. (2020). BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12357. Springer, Cham. https://doi.org/10.1007/978-3-030-58610-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-58610-2_17
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58609-6
Online ISBN: 978-3-030-58610-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Works

3 Proposed Method

3.1 Overview

3.2 Bifurcated Backbone Strategy (BBS)

3.3 Depth-Enhanced Module (DEM)

4 Experiments

4.1 Experimental Settings

4.2 Comparison with SOTAs

5 Discussion

6 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7112 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation