Keywords

1 Introduction

When finding the interest objects in an image, human can automatically capture the semantic information between objects and their contexts, paying much attention to the prominent objects, and selectively suppress unimportant factors. This precise visual attention mechanism has already been explained in various of biologically plausible model [1, 2]. Saliency prediction aims to automatically predict the most informative and attractive parts in images. Rather than concentrating on the whole image, noticing salient objects can not only reduce the calculational cost but also improve the performance of saliency models in many imaging applications, including quality assessment [3, 4], segmentation [5,6,7], recognition [8, 9], to name a few. Existing saliency models mainly utilize two strategies to localize salient objects: bottom-up strategy and top-down strategy. Bottom-up strategy is driven by external stimuli. By exploiting low-level features, such as brightness, color and orientation, the bottom-up strategy outputs final saliency maps in an unsupervised manner. However, only using bottom-up strategy cannot recognize sufficient high-level semantic information. Conversely, the top-down strategy is driven by tasks. It can learn high-level features in a supervised manner with labeled ground truth data, which is beneficial to discriminate salient objects and surrounding pixels.

The recent trend is to extract multi-scale and multi-level features in high-level with deep neural networks (DNNs). We have witnessed an unprecedented explosion in the availability of and access to the saliency models focus on RGB images and videos [10, 14, 36]. Although saliency models in 2D images have achieved remarkable performance, most of them are not available in 3D applications. With the recent advent of widely-used RGB-D sensing technologies, RGB-D sensors now have the ability to intelligently capture depth images, which can provide complementary spatial information over RGB information to mimic human visual attention mechanism, as well as improve the performance of saliency models. Therefore, how to make full use of the cross-modal and cross-level information becomes a challenge. Among a variety of DNNs based RGB-D saliency models [40, 41], U-shape based structures [42, 43] attract most attention since their bottom-up pathways can extract low-level feature information while top-down pathways can generate rich and informative high-level features, which can take advantage of cross-modal complementarity and cross-level continuity of RGB-D information. However, in the top-down pathways of U-shape network, while high-level information is transmitting to shallower stages, semantic information is gradually diluting. We remedy this drawback by designing a global guidance module following by a series of global guidance flows. Global guidance flows deliver high-level semantic information to feature maps in shallower layers as complementarity after dilution. And we introduce a channel-wise attention module, which helps to strengthen the feature representation.

What is more, we observed that the current RGB-D visual saliency models are essentially a symmetric dual-stream input encoder structure (both RGB stream and depth stream have the same encoder structure). Although the same encoder structure improves the accuracy of the results, it also imposes a bottleneck on RGB-D saliency prediction. Therefore, the depth stream does not need to use the same deep level encoder structure as the RGB stream. Through the analysis of the asymmetric encoder structure, and inspired by the above studies, we exploit an asymmetric U-shape based network to make full leverage of high-level semantic information with global guidance module. In refinement processing, we build fusion models, which embedding with global guidance flows and channel-wise attention modules, to gradually refine the saliency maps and finally obtain high-quality fine-grained saliency maps. Overall, the main three contributions of our architecture are as follows:

  1. 1.

    Instead of using the symmetric encoder structure, we proposed an asymmetric encoder structure for RGB-D saliency (VGG-16 for RGB stream and ResNet-50 for depth stream), which can extract the RGB-D features effectively.

  2. 2.

    We designed a global guidance module at the top of encoder stream, which links strongly with the top-down stream, to deal with the problem of high-level information diluting in U-shape architecture.

  3. 3.

    We innovatively introduced a channel-wise self-attention model in the cross-modal distillation stream, to emphasize salient objects and suppress unimportant surroundings, thus improves the feature representation.

2 Related Work

Recently, the capability of DNNs attracts more and more attention, which is helpful to extract significant features. Numerous DNN based models have proposed in 2D saliency prediction. For instance, Yang et al. [11] proposed a salient object predicted network for RGB images by using parallel dilated convolutions in different sizes, and providing a set of loss functions to optimize the network. Cordel et al. [12] proposed a DNN-based model, embedding with the improved evaluation metric, to deal with the problem of measure the saliency prediction for 2D images. Cornia et al. [13] proposed an RGB saliency attentive model by combining the dilated convolutions, attention models and learned prior maps. Wang et al. [14] trained an end-to-end architecture with deep supervision manner, to obtain final saliency maps of RGB images. Liu et al. [15] aimed at improving saliency prediction in three aspects, and thus proposed a saliency prediction model to capture multiple contexts. Liu et al. [16] aiming at predicting human eye-fixations in RGB images, hence proposed a computation network. Although these methods achieve great success, they ignore the generalization in 3D scenarios.

To overcome this problem, Zhang et al. [17] proposed a deep learning feature based RGBD visual saliency detection model. Yang et al. [18] proposed a two-stage clustering-based RGBD visual saliency model for human visual fixation prediction in dynamic scenarios. Nguyen et al. [19] investigated a Deep Visual Saliency (DeepVS) model to achieve a more accurate and reliable saliency predictor even in the presence of distortions. Liu et al. [20] first proposed a cluster-contrast saliency prediction model for depth maps, and then obtained the human fixation prediction with the centroid of the largest clusters of each depth super-pixel. Sun et al. [21] proposed a real-time video saliency prediction model in an end-to-end manner, via 3D residual convolutional neural network (3D-ResNet). Despite they emphasize the importance of auxiliary spatial information, there is still a larger room to improve. To fully utilize cross-modal features, we adopt a U-shaped based architecture in light of its cross-modal and cross-level ability.

Considering that different spatial and channel information of features return different salient responses, some works introduce the attention mechanism in saliency models, so as to enhance the discrimination of salient regions and background. Liu et al. [22] on the base of previous studies, but went further, they proposed an attention-guided RGB-D salient object network, the attention model consists of spatial and channel attention mechanism, which provides the better feature representation. Noori et al. [10] employed a multi-scale attention guided model in the RGB-D saliency model, to intelligently pay more attention to salient regions. Li et al. [23] developed an RGB-D object model, embedding with a cross-modal attention model, to enhance the salient objects. Jiang et al. [24] properly exploit the attention model in the RGB-D tracking model to assign larger weights to salient features of two modalities. Therefore, the model obtains the more fined salient features. Different from above attention mechanism, our attention module focuses on channel features, and can model the interdependencies of high-level features. The proposed attention mechanism can be embedded in any network seamlessly and improve the performance of saliency prediction models.

3 The Proposed Method

3.1 Overall Architecture

The proposed RGB-D saliency model comprises five primary parts on the base of the U-shape architectures: the encoder structures, the cross-model distillation streams, the global guidance module, the channel attention module, and the fusion modules for refinement. Figure 1 shows the overall framework of the proposed model. Concretely, we adopt the VGG-16 network [27] as the backbone of encoder structure for depth information, and the ResNet-50 network [28] as the backbone of encoder structure for RGB information. In order to meet the need of saliency prediction, we remain five basic convolution blocks of VGG-16 and ResNet-50 network, and eliminate their last pooling layers and full connection layers. We build the global guidance module inspired by the Atrous Spatial Pyramid Pooling [29]. More specifically, our global guidance module is placed on the top of RGB and depth bottom-up streams to capture high-level semantic information. The refinement stream mainly consists of fusion modules, gradually refining the coarse saliency maps to high-quality predicted saliency maps.

Fig. 1.
figure 1

The overall architecture of proposed saliency prediction model: dotted box here represents the fusion processing in refinement stream, which is explained in Sect. 3.5. (Colour figure online)

3.2 Hierarchical RGB-D Feature Extraction

The main functionality of the RGB and depth streams is to extract the multi-scale and multi-level RGB-D feature information at different levels of abstraction. In proposed U-shape based architecture, we use ImageNet [30] pre-trained VGG-16 network and pre-trained ResNet-50 network as backbone of the bottom-up stream for RGB and depth images, respectively. Since depth information cannot input into backbone network directly, we transform depth image into three-channel HHA image [31]. We resize the input resolution W × H (W represents the width and H represents the height) of RGB image and paired depth image to size 288 × 288, and then feed into backbone networks. Meanwhile, considering that the size of RGB feature maps and depth feature maps in backbone can both denote as \( \left( {\frac{\text{W}}{{2^{{{\text{n}} - 1}} }},\frac{\text{H}}{{2^{{{\text{n}} - 1}} }}} \right) \), (n = 1, 2, 3, 4, 5), we utilize RGB feature maps and depth feature maps to learn the multi-scale and multi-level feature maps (FM).

3.3 Global Guidance Module

We utilize U-shape based architecture due to its ability to build affluent feature information with top-down flows. In light of that U-shape network will lead a problem that the high-level features gradually dilutes when transmitting to shallower layers, inspired by introducing dilated convolutions in saliency prediction networks to capture multi-scale high-level semantic information in the top layers [32, 33], we provide an individual module with a set of global guiding flows (GFs) (shown in Fig. 1 as a series of green arrows) to explicitly make high-level feature maps be aware of the locations of the salient targets. To be more specific, the global guidance module (GM) is constructed upon the top of the bottom-up pathways. GM consists of four branches to capture the context information of high-level feature maps (GFM). We design the first branch by using a traditional convolution with kernel size as 1 × 1. And we use 2, 6, 12 dilation convolutions for other branches with kernel size as 2 × 2, 6 × 6, 12 × 12, respectively. All the strides are set to 1. To deal with the diluting issues, we introduce a set of GFs, which deliver GFM to and merge with feature maps in shallower layers. Following this way, we effectively supplement the high-level feature information to lower layers when transmitting, and prevent the salient information from diluting when refining the saliency maps.

3.4 Channel Attention Module

In top-down streams we introduce attention mechanism to enhance the feature representation, which is beneficial for discriminating salient details and background. The proposed channel-wise attention module (CWAM) is illustrated in Fig. 2.

Fig. 2.
figure 2

Illustration of proposed channel-wise attention module: symbol ⊗ denotes the matrix multiplication, ⊕ denotes the addition operation, ⊝ denotes the subtract operation.

Specifically, from the origin hierarchical feature maps FM ∈ ℝC × H × W, we calculate the channel attention map AM ∈ ℝC × N through reshaping the H × W dimension into N dimensions. Then use a matrix multiplication between AM and transpose of AM, which denotes as RA. Next, for the purpose of distinguishing salient targets and background, we import Max() function to determine the maximum between the value and −1, and then utilize multiplication results subtract the maximum. Then we adopt Mean() function to obtain an average result, which is denoted as AB, to suppress useless targets and emphasize significant pixels. The operation is shown in Eq. (1).

$$ A_{B} = \frac{1}{c}\sum\nolimits_{i = 1}^{C} {\left\{ {Max\left[ {\left( {R_{A} \otimes \left( {R_{A} } \right)^{T} } \right), - 1} \right] - R_{A} )} \right\}} $$
(1)

where \( \otimes \) denotes the matrix multiplication operation. Then input AB into the Softmax layer to obtain channel-wise attention maps. In addition, we adopt another matrix multiplication and reshape into origin dimension. The formula of final output is shown as Eq. (2).

$$ A_{F} = R\left( {A_{B} \otimes R_{A} } \right) \oplus \left( {\uptheta \times F_{M} } \right) $$
(2)

where \( \oplus \) denotes the addition operation, θ represents a scale parameter learnt gradually from zero.

3.5 Refinement

We improve the quality of feature maps in refinement stream. Although attention mechanism can reinforce feature representation, this leads to unsatisfactory salient constructions. Aiming to obtain subtle and accurate saliency prediction, we employ integrated fusion modules (IFM) in the refinement stream. Specifically, four IFMs linking with four CWAMs make up of one top-down stream, and two top-down streams combines with global features GFM transmitting with GFs comprise the refinement stream. Figure 3 illustrates the details in refinement stream.

Fig. 3.
figure 3

Illustration of the details in refinement stream: as shown in the dotted box in Fig. 1, at end of our network, the global guidance flow (the green thick arrow) compromises IFM of RGB information (right dotted boxes) and IFM of depth information (left dotted boxes); Two IFMs concatenate with high-level global features, then expand their size with an up-sampling layer to obtain the final saliency maps. (Colour figure online)

For each IFMm (m = 1, 2, 3, …, 8), its input contains three parts:

  1. i)

    FM of the RGB encoder stream (shown as a red cuboid in Fig. 3), or the depth encoder stream (shown as a pale blue cuboid in Fig. 3).

  2. ii)

    AF output by CWAM in corresponding top-down stream, following by traditional convolution layers, nonlinear activation function ReLU and Batchnormal layers, and up-sampling layers.

  3. iii)

    GFM following by traditional convolution layers, nonlinear activation function ReLU, Batchnormal layers, and up-sampling layers.

We set all the size of convolution kernels to 3 × 3, and the stride set to 1. We formulate the output of IFM as Eq. (4).

$$ O\left( {IFM_{m} } \right) = \sum\nolimits_{i = 1}^{2} {\lambda \left( {\upgamma\left( {u_{i} \left( {c_{i} \left( {A_{F} , G_{FM} } \right)} \right)} \right)} \right) \odot F_{M} } $$
(4)

Where γ() denotes the ReLU function, and λ() denotes the Batchnormal layer. ci, ui (i = 1, 2) represents traditional convolution layers and upsampling layers of AF and GFM, respectively. Symbol \( \odot \) denotes the concatenate operation.

At last, we combine two types metric, including MSE and modified Pearsons Linear Correlation Coefficient (CC) metric as the loss function, to compare the final saliency prediction map and the ground truth. We use the standard deviation σ() here, and the loss function is shown as Eq. (3).

$$ LOSS = \frac{1}{M}\sum\nolimits_{m = 1}^{M} {\left\| {T - \hat{T}} \right\|^{2} } + 1 - \frac{{\sigma \left( {T, \hat{T}} \right)}}{{\sigma \left( T \right) \times \sigma \left( {\hat{T}} \right)}} $$
(3)

Where T denotes the real value, and \( \hat{T} \) denotes the predicted value. σ() denotes the function of correlation coefficient.

4 Experiments

4.1 Implementation Details

The entire experiments were implemented on the PyTorch 1.1.0 [34] framework. The training and testing processes were equipped with a TITAN Xp GPU with 8 GB memory workspace. Our backbone, including VGG-16 network and ResNet-50 network, their parameters are initialized on the ImageNet dataset. We use 288 × 288 resolution images to train the network in an end-to-end manner. The batch size was set to one image in every iteration, and we initiate the learning rate to 10−5. We trained the proposed network for 70 epochs altogether.

4.2 Datasets

We conduct experiments on two popular public saliency prediction datasets, including NUS dataset [25] and NCTU dataset [26], to evaluate the performance of our proposed network. The NUS dataset is comprised of 600 images, including 3D images and corresponding color stimuli, fixation maps, smooth depth maps, as well as paired depth images. The NCTU dataset is comprised of 475 images with 1920 × 1080 resolution, including 3D images, corresponding left and right view maps, fixation maps, disparity maps, and paired depth images.

4.3 Evaluation Criteria

To evaluate our approach performance, we use four widely-agreed and evaluation metrics: Pearsons Linear Correlation Coefficient (CC), Area Under Curve (AUC), Normalized Scanpath Saliency (NSS), and Kullback-Leibler Divergence (KL-Div).

CC is a statistical index, which reflects the linear correlation between our model (T) and predicted human fixation (G). The bigger the CC value is, the more relevant two variables are. Equation (5) represents the calculation of CC.

$$ CC = \frac{{cov\left( {T,G} \right)}}{{\sqrt {cov\left( T \right)} \sqrt {cov\left( G \right)} }} $$
(5)

where cov means the covariance between the final output T and the round truth G.

AUC is defined as the area under the ROC curve. We use AUC as the evaluation criteria since a single ROC curve cannot assess the performance of the model adequately. The larger the AUC value is, the better the model performance is. Equation (6) shows the formulation of AUC.

$$ AUC = \frac{{\mathop \sum \nolimits_{pos} k - \frac{{NUM_{pos} \left( {NUM_{pos} + 1} \right)}}{2}}}{{NUM_{pos} NUM_{neg} }} $$
(6)

where the posk in the nominator is a fixed value, and only relevant with the amount of positive instances. pos represents the sum of positive instances, and k is the ranking. NUMpos and NUMneg denote the number of positive and negative instances, respectively.

NSS mainly evaluates the average of M human fixations in a normalized map. The bigger the NSS value is, the better the performance of model is. We utilize the standard deviation σ() to calculate NSS with Eq. (7).

$$ NSS = \frac{1}{M}\sum\nolimits_{i = 1}^{M} {\frac{{T\left( {x_{G}^{i} , y_{G}^{i} } \right) - \mu_{T} }}{{\sigma_{T} }}} $$
(7)

KLDiv measure is used for saliency prediction model evaluation. The smaller the KLDiv value is, the better the saliency model performance is. Given two probability distribution for x, which are denoted as c(x) and d(x), the KLDiv can be calculated with Eq. (8).

$$ KLDiv = \sum\nolimits_{i = 1}^{n} {c\left( x \right)log\frac{c\left( x \right)}{d\left( x \right)}} $$
(8)

4.4 Ablation Studies and Analysis

In order to thoroughly investigate the effectiveness of different components in proposed network, we conduct a series of ablation studies on NCTU and NUS datasets. The comparison results are demonstrated in Table 1. We remove GM and CWAMs of proposed network, what is more, we denote the backbone as B in the Table 1. Figure 4 shows a visual comparison of different model components on NCTU dataset. To explore the effectiveness of GM, we remove them from the network, which is denoted as B + A in Table 1. We can find that GM contributes to the proposed network. To prove the positive effects of CWAMs, we take away them from the proposed network, which is denoted as B + G in Table 1.

Table 1. Ablation studies of different components
Fig. 4.
figure 4

Visual comparison of ablation studies.

In Fig. 4, we can see clearly that B trained with GM (see column 3), learns more elaborate salient region. And B trained with CWAMs (see column 4) can enhance prominent regions, learn more certain and less blurry salient objects. Thus, taking the advantages of adopting above modules in our network, our model (see column 5) can generate more accurate saliency maps, which are much closer to the ground truth (see column 6) compared to other six methods.

4.5 Comparison with Other Saliency Prediction Models

We compared our proposed model on the two benchmark datasets against with six other state-of-the-art models, namely, Qi’s [39], Fang’s [35], DeepFix [36], ML-net [37], DVA [14], and iSEEL [38]. Note that all the saliency maps of above models are obtained by running source codes with recommended parameters. For fair comparison, we trained all models on NCTU and NUS datasets. Table 2 presents the quantitative results of different models, and we show a visual comparison between selective six models and our proposed model in Fig. 5.

Table 2. CC, KLDiv, AUC and NSS comparisons of different models
Fig. 5.
figure 5

Visual comparison between selective six models and our proposed model: we comparing with six saliency prediction models; column 1 shows origin left images, column 2 represents the ground truth, column 3 is our proposed method, and column 4 to 9 represents the saliency prediction models from [14, 35,36,37,38,39].

For stimuli-driven scenes, no matter the discrimination between targets and background is explicit (see row 1, 3), or implicit (see row 4, 7), our model can handle effectively. For task-driven scenes, our model can predict faces (see row 2, 5, 8) and people in complex background (see row 6). As for the scenes are influenced by light (see row 9, 10), our attention mechanism locates the salient objects appropriately. It can be seen that our method is capable of ignoring disturbed background and highlighting salient objects in various scenes.

5 Conclusion

In this paper, we proposed an asymmetric attention-based network. Concretely, in bottom-up streams, we capture multi-scale and multi-level features of RGB and paired depth images. In top-down streams for cross-modal features, incorporating with global guidance information and features from parallel layers, we introduce a channel-wise attention model to enhance salient feature representation. Experimental results show that our model out-performs six state-of-the-art models. For the future work, we expect the proposed model can be applied in other 3D scenarios including video object detection and object tracking.