1 Introduction

Multi-modal and multi-level feature fusion  [37] is essential for many computer vision tasks, such as object detection  [8, 21, 26, 42, 70], semantic segmentation  [29, 30, 32, 67], co-attention tasks  [19, 72] and classification  [38, 40, 53]. Here, we attempt to utilize this idea for RGB-D salient object detection (SOD)  [4, 74], which aims at finding and segmenting the most visually prominent object(s)  [2, 75] in a scene according to the RGB and depth cues.

Fig. 1.
figure 1

Saliency maps of state-of-the-art (SOTA) CNN-based methods (i.e., DMRA  [52], CPFP  [74], TANet  [4], and our BBS-Net) and methods based on hand crafted features (i.e., SE  [27] and LBE  [24]). Our method generates higher quality saliency maps and suppresses background distractors for challenging scenarios (first row: complex background; second row: depth with noise) more effectively.

To efficiently integrate the RGB and depth cues for SOD, researchers have explored several multi-modal strategies  [3, 5], and have achieved encouraging results. Existing RGB-D SOD methods, however, still face the following challenges:

  1. (1)

    Effectively aggregating multi-level features. As discussed in [44, 63], teacher featuresFootnote 1 provide discriminative semantic information that serves as strong guidance for locating salient objects, while student features carry affluent details that are beneficial for refining edges. Therefore, previous RGB-D SOD algorithms focus on leveraging multi-level features, either via a progressive merging process  [47, 76] or by using a dedicated aggregation strategy  [52, 74]. However, these operations directly fuse multi-level features without considering level-specific characteristics, and thus suffer from the inherent noise often introduced by low-level features  [4, 65]. Thus, some methods tend to get distracted by the background (e.g., first row in Fig. 1).

  2. (2)

    Excavating informative cues from the depth modality. Previous methods combine RGB and depth cues by regarding the depth map as a fourth-channel input  [13, 51] or fusing RGB and depth modalities by simple summation  [22, 23] or multiplication  [9, 78]. These algorithms treat depth and RGB information the same and ignore the fact that depth maps mainly focus on the spatial relations among objects, whereas RGB information captures color and texture. Thus, such simple combinations are not efficient due to the modality difference. Besides, depth maps are sometimes low-quality, which may introduce feature noise and redundancy into the network. As an example, the depth map shown in the second row of Fig. 1 is blurry and noisy, and that is why many methods, including the top-ranked model (DMRA-iccv19  [52]), fail to detect the complete salient object.

Fig. 2.
figure 2

(a) Existing multi-level feature aggregation methods for RGB-D SOD  [3, 4, 47, 52, 62, 74, 76]. (b) In this paper, we propose to adopt a bifurcated backbone strategy (BBS) to split the multi-level features into student and teacher features. The initial saliency map \(S_1\) is utilized to refine the student features to effectively suppress distractors. Then, the refined features are passed to another cascaded decoder to generate the final saliency map \(S_2\).

To address these issues, we propose a novel Bifurcated Backbone Strategy Network (BBS-Net) for RGB-D salient object detection. As shown in Fig. 2(b), BBS-Net consists of two cascaded decoder stages. In the first stage, teacher features are aggregated by a standard cascaded decoder \(\mathbf{F} _{CD1}\) to generate an initial saliency map \(S_{1}\). In the second stage, student features are refined by an element-wise multiplication with the initial saliency map \(S_{1}\) and are then integrated by another cascaded decoder \(\mathbf{F} _{CD2}\) to predict the final map \(S_{2}\).

To the best of our knowledge, BBS-Net is the first work to explore the cascaded refinement mechanism for the RGB-D SOD task. Our main contributions are as follows:

  1. (1)

    We exploit multi-level features in a bifurcated backbone strategy (BBS) to suppress distractors in the lower layers. This strategy is based on the observation that high-level features provide discriminative semantic information without redundant details  [44, 65], which may contribute significantly to eliminating distractors in lower layers.

  2. (2)

    To fully capture the informative cues in the depth map and improve the compatibility of RGB and depth features, we introduce a depth-enhanced module (DEM), which contains two sequential attention mechanisms: channel attention and spatial attention. The channel attention utilizes the inter-channel relations of the depth features, while the spatial attention aims to determine where informative depth cues are carried.

  3. (3)

    We demonstrate that the proposed BBS-Net exceeds 18 SOTAs on seven public datasets, by a large margin. Our experiments show that our framework has strong scalability in terms of various backbones. This means that the bifurcated backbone strategy via a cascaded refinement mechanism is promising for multi-level and multi-modal learning tasks.

2 Related Works

Although RGB-based SOD has been thoroughly studied in recent years   [7, 39, 60, 69, 71], most of algorithms fail under complicated scenarios (e.g., cluttered backgrounds  [16], low-intensity environments, or varying illuminations)  [4, 52]. As a complementary modality to RGB information, depth cues contain rich spatial distance information  [52] and contribute significantly to understanding challenging scenes. Therefore, researchers have started to solve the SOD problem by combining RGB images with depth information  [15].

Traditional Models. Previous RGB-D SOD algorithms mainly focused on hand crafted features  [9, 78]. Some of these methods largely relied on contrast-based cues by calculating color, edge, texture and region contrast to measure the saliency in a local region. For example,  [15] adopted the region based contrast to compute contrast strengths for the segmented regions. In  [10], the saliency value of each pixel depended on the color contrast and surface normals. However, the local contrast methods focued on the boundaries of salient objects and were easily affected by high-frequency content  [54]. Therefore, some algorithms, such as global contrast  [11], spatial prior  [9], and background prior  [56], proposed to calculate the saliency by combining local and global information. To effectively combine saliency cues from the RGB and depth modalities, researchers have explored various fusion strategies. Some methods  [13, 51] regarded depth images as the fourth-channel inputs and processed the RGB and depth channes together (early fusion). This operation seems simple but disregards the differences between the RGB and depth modalities and thus cannot achieve reliable results. Therefore, to effectively extract the saliency information from the two modalities separately, some algorithms  [22, 78] first leveraged two backbones to predict saliency maps and then fused the saliency results (late fusion). Besides, considering that the RGB and depth modalities may positively influence each other, yet other methods  [24, 34] fused RGB and depth features in a middle stage and then predicted the saliency maps from the fused features (middle fusion). In fact, these three fusion strategies are also explored in the current deep models, and our model can be considered as a middle fusion.

Deep Models. Early deep algorithms  [54, 56] extracted hand crafted features first, and then fed them to CNNs to compute saliency confidence scores. However, these methods need to design low-level features first and cannot be trained in an end-to-end manner. More recently, researchers have exploited CNNs to extract RGB and depth features in a bottom-up way  [28]. Compared with hand crafted features, deep features contain more semantic and contextual information that can better capture representations of the RGB and depth modalities and achieve encouraging performance. The success of these deep models  [5, 52] stems from two aspects of feature fusion. The first is the extracting of multi-scale features from different layers and then the effective fusion of these features. The second is the mechanism of fusing features from the two different modalities.

To effectively aggregate multi-scale features, researchers have designed various network architectures. For example,  [47] fed a four-channel RGB-D image into a single backbone and then obtained saliency map outputs from each side-out features (single stream). Chen et al.  [3] leveraged two networks to extract RGB features and depth features respectively, and then fused them in a progressive complementary way (double stream). Further, to exploit cross-modal complements in the bottom-up feature extraction process, Chen et al.  [4] proposed a three-stream network that contains two modality-specific streams and a parallel cross-modal distillation stream to learn supplementary features (three streams). However, depth maps are often of low quality and thus may contain a lot of noise and misleading information. This greatly decreases the performance of SOD models. To address this problem, Zhao et al.  [74] designed a contrast-enhanced network to improve the quality of depth maps by the contrast prior. Fan et al.  [20] proposed a depth depurator unit that can evaluate the quality of the depth images and then filter out the low-quality maps automatically. Two recent famous works have also explored uncertainty  [68], bilateral attention  [73], graph neural network  [48] and a joint learning strategy  [25] and achieve good performance.

Fig. 3.
figure 3

The architecture of BBS-Net. Feature Extraction: ‘Conv1\(\sim \)Conv5’ denote different layers from ResNet-50  [31]. Multi-level features (\(f_1^d\sim f_5^d\)) from the depth branch are enhanced by the (a) DEM and then fused with features (i.e., \(f_1^{rgb}\sim f_5^{rgb}\)) from the RGB branch. Stage 1: cross-modal teacher features (\(f_3^{cm}\sim f_5^{cm}\)) are first aggregated by the (b) cascaded decoder to produce the initial saliency map \(S_1\). Stage 2: Then, student features (\(f_1^{cm}\sim f_3^{cm}\)) are refined by the initial saliency map \(S_1\) and are integrated by another cascaded decoder to predict the final saliency map\(S_2\).

3 Proposed Method

3.1 Overview

Existing popular RGB-D SOD models directly aggregate multi-level features (Fig. 2(a)). As shown in Fig. 3, the network flow of our BBS-Net is different from the above mentioned models. We first introduce the bifurcated backbone strategy with the cascaded refinement mechanism in Sect. 3.2. To fully use informative cues in the depth map, we introduce a new depth-enhanced module (Sect. 3.3).

3.2 Bifurcated Backbone Strategy (BBS)

We propose to excavate the rich semantic information in high-level cross-modal features to suppress background distractors in a cascaded refinement way. We adopt a bifurcated backbone strategy (BBS) to divide the multi-level cross-modal features into two groups, i.e., \(\mathbf{Q} _1\) = {Conv1, Conv2, Conv3} and \(\mathbf{Q} _2 = \){Conv3, Conv4, Conv5}, with the Conv3 as the split point. Each group still preserves the original multi-scale information.

Cascaded Refinement Mechanism. To effectively leverage the features of the two groups, the whole network is trained with a cascaded refinement mechanism. This mechanism first produces an initial saliency map with three cross-modal teacher features (i.e., \(\mathbf{Q} _2\)) and then improves the details of the initial saliency map \(S_1\) with three cross-modal student features (i.e., \(\mathbf{Q} _1\)) refined by the initial saliency map itself. Using this mechanism, our model can iteratively refine the details in the low-level features. This is based on the observation that high-level features contain rich global contextual information which is beneficial for locating salient objects, while low-level features carry much detailed micro-level information that can contribute significantly to refining the boundaries. In other words, this strategy efficiently eliminates noise in low-level cross-modal features, by exploring the specialties of multi-level features, and predicts the final saliency map in a progressive refinement manner.

Specifically, we first compute cross-modal features \(\{f_i^{cm}; i=1,2,...,5\}\) by merging RGB and depth features processed by the DEM (Fig. 3(a)). In stage one, the three cross-modality teacher features (i.e., \(f_3^{cm},f_4^{cm},f_5^{cm}\)) are aggregated by the first cascaded decoder, which is formulated:

$$\begin{aligned} S_1=\mathbf{T} _1\big (\mathbf{F} _{CD1}(f_3^{cm},f_4^{cm},f_5^{cm})\big ), \end{aligned}$$
(1)

where \(S_1\) is the initial saliency map, \(\mathbf{F} _{CD1}\) is the first cascaded decoder and \(\mathbf{T} _1\) represents two simple convolutional layers that change the channel number from 32 to 1. In stage two, the initial saliency map \(S_1\) is leveraged to refine the three cross-modal student features, which is defined as:

$$\begin{aligned} f_i^{cm^{\prime }}=f_i^{cm}\odot S_1 , \end{aligned}$$
(2)

where \(f_i^{cm^{\prime }}\) (\(i\in \{1,2,3\}\)) denotes the refined features and \(\odot \) represents the element-wise multiplication. Then, the three refined student features are integrated by another decoder followed by a progressively transposed module (PTM), which is defined as,

$$\begin{aligned} S_2=\mathbf{T} _2\Big (\mathbf{F} _{CD2}(f_1^{cm^{\prime }},f_2^{cm^{\prime }},f_3^{cm^{\prime }})\Big ), \end{aligned}$$
(3)

where \(S_2\) is the final saliency map. \(\mathbf{T} _2\) represents the PTM module and \(\mathbf{F} _{CD2}\) denotes the second cascaded decoder. Finally, we jointly optimize the two stages by defining the total loss:

$$\begin{aligned} \mathcal {L}=\alpha \ell _{ce}(S_1,G)+ (1-\alpha ) \ell _{ce}(S_2,G), \end{aligned}$$
(4)

in which \(\ell _{ce}\) represents the widely used binary cross entropy loss and \(\alpha \in [0,1]\) controls the trade-off between the two parts of the losses. The \(\ell _{ce}\) is computed as:

$$\begin{aligned} \ell _{ce}(S,G)=G\log S + (1-G)\log (1-S), \end{aligned}$$
(5)

in which S is the predicted saliency map and G denotes the ground-truth binary saliency map.

Cascaded Decoder. Given the two groups of multi-level, cross-modal features (\(\{f_i^{cm},f_{i+1}^{cm},f_{i+2}^{cm}\},i\in \{1,3\}\)) fused by the RGB and depth features from different layers, we need to efficiently utilize the multi-level, multi-scale information in each group to carry out our cascaded refinement strategy. Thus, we introduce a light-weight cascaded decoder  [65] to aggregate the two groups of multi-level, cross-modal features. As shown in Fig. 3(b), the cascaded decoder contains three global context modules (GCM) and a simple feature aggregation strategy. The GCM is refined from the RFB module  [46] with an additional branch to enlarge the receptive field and a residual connection  [31] to preserve the original information. Specifically, as illustrated in Fig. 3(c), the GCM module contains four parallel branches. For all of these branches, a \(1\times 1\) convolution is first applied to reduce the channel size to 32. For the \(k^{th}\) branch (\(k\in \{2,3,4\}\)), a convolution operation with a kernel size of \(2k-1\) and dilation rate of 1 is applied. This is followed by another 3 \(\times \) 3 convolution operation with a dilation rate of \(2k\ -\ 1\). The goal here is to extract the global contextual information from the cross-modal features. Next, the outputs of the four branches are concatenated together and their channel number is reduced to 32 with a 1 \(\times \) 1 convolution operation. Finally, the concatenated features form a residual connection with the input feature. The outputs of the GCM modules in the two cascaded decoders are defined by:

$$\begin{aligned} f_i^{gcm}=\mathbf{F} _{GCM}(f_i), \end{aligned}$$
(6)

To further improve the representational ability of cross-modal features, we leverage a pyramid multiplication and concatenation feature aggregation strategy to integrate the cross-modal features (\(\{f_i^{gcm},f_{i+1}^{gcm},f_{i+2}^{gcm}\},i\in \{1,3\}\)). As shown in Fig. 3(b), first, each refined feature \(f_i^{gcm}\) is updated by multiplying it with all higher-level features:

$$\begin{aligned} f_i^{gcm^{\prime }}=f_i^{gcm} \odot \varPi _{k=i+1}^{k_{max}} Conv\Big (\mathbf{F} _{U P}\left( f_k^{gcm}\right) \Big ), \end{aligned}$$
(7)

where \(i \in \{1,2,3\}\), \(k_{max}=3\) or \( i \in \{3,4,5\}\), \(k_{max}=5\). \(Conv (\cdot )\) represents the standard 3 \(\times \) 3 convolution operation, and \(\mathbf{F} _{UP}\) denotes the upsampling operation if these features are not in the same scale. \(\odot \) represents the element-wise multiplication. Second, the updated features are aggregated by a progressive concatenation strategy to generate the output:

$$\begin{aligned} S=\mathbf{T} \left( \left[ f_k^{gcm^{\prime }};Conv\Big (\mathbf{F} _{UP}\left[ f_{k+1}^{gcm^{\prime }};Conv\left( \mathbf{F} _{UP}(f_{k+2}^{gcm^{\prime }})\Big )\right] \right) \right] \right) , \end{aligned}$$
(8)

where S is the generated salient map, \(k\in \{1,3\}\), and [xy] denotes the concatenation operation of x and y. In the first stage, \(\mathbf{T} \) represents two sequential convolutional layers (\(\mathbf{T} _1\)), while it denotes the PTM module (\(\mathbf{T} _2\)) for the second stage. The output (88 \(\times \) 88) of the second decoder is 1/4 of the ground-truth resolution (352 \(\times \) 352), so directly up-sampling the output to the ground-truth size will result in a loss of some details. To address this problem, we design a simple yet effective progressively transposed module (PTM, Fig. 3(d)) to predict the final saliency map (\(S_2\)) in a progressive upsampling way. It is composed of two sequential residual-based transposed blocks  [33] and three sequential \(1\times 1\) convolutions. Each residual-based transposed block consists of a \(3\times 3\) convolution and a residual-based transposed convolution.

Note that our cascaded refinement mechanism is different from the recent refinement mechanisms R3Net  [14], CRN  [6], and RFCN  [61] in its usage of multi-level features and the initial map. The obvious difference and superiority of our design is that we only need one round of saliency refinement to obtain a good performance, while R3Net, CRN, and RFCN all need more iterations, which will increase the training time and computational resources. In addition, our cascaded strategy is different from CPD  [65] in that it exploits the details in low-level features and semantic information in high-level features, while suppressing the noise in low-level features, simultaneously.

3.3 Depth-Enhanced Module (DEM)

There are two main problems when trying to fuse RGB and depth features. One is the compatibility of the two due to the intrinsic modality difference, and the other is the redundancy and noise in low-quality depth features. Inspired by  [64], we introduce a depth-enhanced module (DEM) to improve the compatibility of multi-modal features and to excavate the informative cues from the depth features.

Specifically, let \(f_i^{rgb}\), \(f_i^{d}\) denote the feature maps of the \(i^{th}\) (\(i\in 1,2,...,5\)) side-out layer from the RGB and depth branches, respectively. As shown in Fig. 3, each DEM is added before each side-out feature map from the depth branch to improve the compatibility of the depth features. Such a side-out process enhances the saliency representation of depth features and preserves the multi-level, multi-scale information. The fusion process of the two modalities is formulated as:

$$\begin{aligned} f_i^{cm}=f_i^{rgb}+\mathbf{F} _{DEM}(f_i^{d}), \end{aligned}$$
(9)

where \(f_i^{cm}\) represents the cross-modal features of the \(i^{th}\) layer. As illustrated in Fig. 3(a), the DEM includes a sequential channel attention operation and a spatial attention operation. The operation of the DEM is defined as:

$$\begin{aligned} \mathbf{F} _{DEM}(f_i^{d})=\mathbf{S} _{att}\Big (\mathbf{C} _{att}(f^{d}_i)\Big ), \end{aligned}$$
(10)

where \(\mathbf{C} _{att}(\cdot )\) and \(\mathbf{S} _{att}(\cdot )\) denote the spatial and channel attention, respectively. More specifically,

$$\begin{aligned} \mathbf{C} _{att}(f)=\mathbf{M} \Big (\mathbf{P} _{max}(f)\Big )\otimes f, \end{aligned}$$
(11)

where \(\mathbf{P} _{max}(\cdot )\) represents the global max pooling operation for each feature map, f denotes the input feature map, \(\mathbf{M} (\cdot )\) is a multi-layer (two-layer) perceptron, and \(\otimes \) denotes the multiplication by the dimension broadcast. The spatial attention is implemented as:

$$\begin{aligned} \mathbf{S} _{att}(f)=Conv\Big (\mathbf {R}_{max}(f)\Big )\odot f, \end{aligned}$$
(12)

where \(\mathbf {R}_{max}(\cdot )\) represents the global max pooling operation for each point in the feature map along the channel axis. Our depth enhanced module is different from previous RGB-D models. Previous models fuse the corresponding multi-level features from the RGB and depth branches by direct concatenation  [3, 4, 76], enhance the depth map by contrast prior  [74] or process the multi-level depth features by a simple convolutional layer  [52]. To the best of our knowledge, we are the first to introduce the attention mechanism to excavate informative cues from depth features in multiple side-out layers. Our experiments (see Table 4 and Fig. 5) show the effectiveness of this approach in improving the compatibility of multi-modal features.

Moreover, the spatial and channel attention mechanisms are different from the operation proposed in  [64]. We only leverage a single global max pooling  [50] operation to excavate the most critical cues in the depth features and reduce the complexity of the module simultaneously, which is based on the intuition that SOD aims at finding the most important area in an image.

4 Experiments

4.1 Experimental Settings

Datasets. We tested seven challenging RGB-D SOD datasets, i.e., NJU2K  [34], NLPR  [51], STERE  [49], SIP  [20], DES  [9], LFSD  [41], and SSD  [77].

Training/Testing. Following the same training settings as in  [52, 74], we use 1, 485 samples from the NJU2K dataset and 700 samples from the NLPR dataset as our training set. The remaining images in the NJU2K and NLPR datasets and the whole datasets of STERE, DES, LFSD, SSD, and SIP are used for testing.

Evaluation Metrics. We adopt four widely used metrics, including S-measure (\(S_{\alpha }\))  [17], maximum E-measure (\(E_{\xi }\))  [18], maximum F-measure (\(F_{\beta }\))  [1], mean absolute error (MAE). Evaluation code: http://dpfan.net/d3netbenchmark/.

Contenders. We compare the proposed BBS-Net with ten models that use hand crafted features  [9, 12, 24, 27, 34, 43, 51, 55, 58, 78] and eight models  [3,4,5, 28, 52, 54, 62, 74] based on deep learning. We train and test the above models using their default settings, as proposed in the original papers. For those models without released source codes, we used their published results for comparisons.

Inference Time. In terms of speed, BBS-Net 14 fps and 48 fps on a single GTX 1080Ti with a batch size of 1 and 10, respectively.

Implementation Details. We perform our experiments using the PyTorch  [59] framework on a single 1080Ti GPU. Parameters of the backbone network (ResNet-50  [31]) are initialized from the model pre-trained on ImageNet  [36]. We discard the last pooling and fully connected layers of ResNet-50 and leverage each middle output of the five convolutional blocks as the side-out feature maps. The two branches do not share weights and the only difference between them is that the depth branch has the input channel number set to 1. The other parameters are initialized as the PyTorch default settings. We use the Adam algorithm  [35] to optimize the proposed model. The initial learning rate is set to 1e−4 and is divided by 10 every 60 epochs. We resize the input RGB and depth images to \(352\times 352\) for both the training and test phases. All the training images are augmented using multiple strategies (i.e., random flipping, rotating and border clipping). It takes about 10 h to train our model with a mini-batch size of 10 for 200 epochs.

Table 1. Quantitative comparison of models using S-measure (\(S_{\alpha }\)), max F-measure (\(F_{\beta }\)), max E-measure (\(E_{\xi }\)) and MAE (M) scores on seven datasets. \(\uparrow \) (\(\downarrow \)) denotes that the higher (lower) the better. The best score in each row is highlighted in bold.

4.2 Comparison with SOTAs

Quantitative Results. As shown in Table 1, our method performs favorably against all algorithms based on hand crafted features as well as SOTA CNN-based methods by a large margin, in terms of all four evaluation metrics. Performance gains over the best compared algorithms (ICCV’19 DMRA  [52] and CVPR’19 CPFP  [74]) are (2.5%–3.5%, 0.7%–3.9%, 0.8%–2.3%, 0.009–0.016) for the metrics (\(S_{\alpha }\), \(max F_{\beta }\), \(max E_{\xi }\), M) on seven challenging datasets.

Fig. 4.
figure 4

Qualitative visual comparison of the proposed model versus 8 SOTAs.

Visual Comparison. Figure 4 provides sample saliency maps predicted by the proposed method and several SOTA algorithms. Visualizations cover simple scenes (a) and various challenging scenarios, including small objects (b), multiple objects (c), complex backgrounds (d), and low-contrast scenes (e). First, (a) is an easy example. The flower in the foreground is evident in the original RGB image, but the depth image is low-quality and contains some misleading information. The top two models, i.e., DMRA and CPFP, fail to predict the whole extent of the salient object due to the interference from the depth map. Our method can eliminate the side-effects of the depth map by utilizing the complementary depth information more effectively. Second, two examples of small objects are shown in (b). Despite the handle of the teapot in the first row being tiny, our method can accurately detect it. Third, we show two examples with multiple objects in the image in (c). Our method locates all salient objects in the image. It segments the objects better and generates sharper edges compared to other algorithms. Even though the depth map in the first row of (c) lacks clear information, our algorithm predicts the salient objects correctly. Fourth, (d) shows two examples with complex backgrounds. Here, our method produces reliable results, while other algorithms confuse the background as a salient object. Finally, (e) presents two examples in which the contrast between the object and background is low. Many algorithms fail to detect and segment the entire extent of the salient object. Our method produces satisfactory results by suppressing background distractors and exploring the informative cues from the depth map.

Table 2. Performance comparison using different backbones.
Table 3. Comparison of different feature aggregation strategies on seven datasets.

5 Discussion

Scalability. There are three popular backbone architectures (i.e., VGG-16  [57], VGG-19  [57] and ResNet-50  [31]) that are used in deep RGB-D models. To further validate the scalability of the proposed method, we provide performance comparisons using different backbones in Table 2. We find that our BBS-Net exceeds the SOTA methods (e.g., CPFP  [74], and DMRA  [52]) with all of these popular backbones, showing the strong scalability of our framework.

Aggregation Strategies. We conduct several experiments to validate the effectiveness of our cascaded refinement mechanism. Results are shown in Table 3 and Fig. 5(a). ‘Low3’ means that we only integrate the student features (Conv1\(\sim \)3) using the decoder without the refinement from the initial map for training and testing. Student features contain abundant details that are beneficial for refining the object edges, but at the same time introduce a lot of background distraction. Integrating only low-level features produces unsatisfactory results and generates many distractors (e.g., \(1^{st}\)\(2^{nd}\) row in Fig. 5(a)) or fails to locate the salient objects (e.g., the \(3^{rd}\) row in Fig. 5(a)). ‘High3’ only integrates the teacher features (Conv3\(\sim \)5) using the decoder to predict the saliency map. Compared with student features, teacher features are ‘sophisticated’ and thus contain more semantic information. As a result, they help locate the salient objects and preserve edge information. Thus, integrating teacher features leads to better results. ‘All5’ aggregates features from all five levels (Conv1\(\sim \)5) directly using a single decoder for training and testing. It achieves comparable results with the ‘High3’ but may generate background noise introduced by the student features. ‘BBS-NoRF’ indicates that we directly remove the refinement flow of our model. This leads to poor performance. ‘BBS-RH’ can be seen as a reverse refinement strategy to our cascaded refinement mechanism, where teacher features (Conv3\(\sim \)5) are first refined by the initial map aggregated by student features (Conv1\(\sim \)3) and are then integrated to generate the final saliency map. It performs worse than our final mechanism (BBS-RL), because, with this reverse refinement strategy, noise in student features cannot be effectively suppressed. Besides, compared to ‘All5’, our method fully utilizes the features at different levels, and thus achieves significant performance improvement with fewer background distractors and sharper edges (i.e., ‘BBS-RL’ in Fig. 5(a)).

Table 4. Ablation study of our BBS-Net. ‘BM’ = base model. ‘CA’ = channel attention. ‘SA’ = spatial attention. ‘PTM’ = progressively transposed module.
Table 5. Effectiveness analysis of the cascaded decoder.
Fig. 5.
figure 5

(a): Visual comparison of different aggregation strategies, (b): Visual effectiveness of gradually adding modules. ‘#’ denotes the corresponding row of Table 4.

Impact of Different Modules. As shown in Table 4 and Fig. 5(b), we conduct an ablation study to test the effectiveness of different modules in our BBS-Net. The base model (BM) is our BBS-Net without additional modules (i.e., CA, SA, and PTM). Note that just the BM performs better than the SOTA methods over almost all datasets, as shown in Table 1 and Table 4. Adding the channel attention (CA) and spatial attention (SA) modules enhances performance on most of the datasets. See the results shown in the second and third rows of Table 4. When we combine the two modules (fourth row in Table 4), the performance is greatly improved on all datasets, compared with the BM. We can easily conclude from the ‘#3’ and ‘#4’ columns in Fig. 5(b) that the spatial attention and channel attention mechanisms in DEM allow the model to focus on the informative parts of the depth features, which results in better suppression of background distraction. Finally, we add a progressively transposed block before the second decoder to gradually upsample the feature map to the same resolution as the ground truth. The results in the fifth row of Table 4 and the ‘#5’ column of Fig. 5(b) show that the ‘PTM’ achieves impressive performance gains on all datasets and generates sharper edges with fine details.

To further analyze the effectiveness of the cascaded decoder, we experiment by changing the decoder to an element-wise summation mechanism. That is to say, we first change the features from different layers using \(1\times 1\) convolution and upsampling operation to the same dimension and then fuse them by element-wise summation. Experimental results in Table 5 demonstrate the effectiveness of the cascaded decoder.

Table 6. \(S_{\alpha }\) comparison with SOTA RGB SOD methods on three datasets. ‘w/o D’ and ‘w/ D’ represent training and testing the proposed method without/with the depth.

Benefits of the Depth Map. To explore whether or not the depth information can really contribute to the performance of SOD, we conduct two experiments in Table 6: (i) We compare the proposed method with five SOTA RGB SOD methods (i.e., CPD  [66], PoolNet  [44], PiCANet  [45], PAGRN  [71] and R3Net  [14]) when neglecting the depth information. We train and test these methods using the same training and testing sets as our BBS-Net. It is shown that the proposed methods (i.e., Ours (w/ D)) can significantly exceed SOTA RGB SOD methods due to the usage of depth information. (ii) We train and test the proposed method without using the depth information by setting the inputs of the depth branch to zero (i.e., Ours (w/o D)). Comparing the results of Ours (w/ D) with Ours (w/o D), we find that the depth information can effectively improve the performance of the proposed model. The above two experiments together demonstrate the benefits of the depth information for SOD, since depth maps can be seen as prior knowledge that provides spatial-distance information and contour guidance to detect salient objects.

6 Conclusion

We presented a new multi-level multi-modality learning framework that demonstrates state-of-the-art performance on seven challenging RGB-D salient object detection datasets using several evaluation measures. Our BBS-Net is based on a novel bifurcated backbone strategy (BBS) with a cascaded refinement mechanism. Importantly, our simple architecture is backbone independent, making it promising for further research on other related topics, including semantic segmentation, object detection and classification.