Skip to main content
Log in

Learnable Depth-Sensitive Attention for Deep RGB-D Saliency Detection with Multi-modal Fusion Architecture Search

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

RGB-D salient object detection (SOD) is usually formulated as a problem of classification or regression over two modalities, i.e. , RGB and depth. Hence, effective RGB-D feature modeling and multi-modal feature fusion both play a vital role in RGB-D SOD. In this paper, we propose a depth-sensitive RGB feature modeling scheme using the depth-wise geometric prior of salient objects. In principle, the feature modeling scheme is carried out in a Depth-Sensitive Attention Module (DSAM), which leads to the RGB feature enhancement as well as the background distraction reduction by capturing the depth geometry prior. Furthermore, we extend and enhance the original DSAM to DSAMv2 by proposing a novel Depth Attention Generation Module (DAGM) to generate learnable depth attention maps for more robust depth-sensitive RGB feature extraction. Moreover, to perform effective multi-modal feature fusion, we further present an automatic neural architecture search approach for RGB-D SOD, which does well in finding out a feasible architecture from our specially designed multi-modal multi-scale search space. Extensive experiments on nine standard benchmarks have demonstrated the effectiveness of the proposed approach against the state-of-the-art. We name the enhanced learnable Depth-Sensitive Attention and Automatic multi-modal Fusion framework DSA\(^{2}\)Fv2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In IEEE conference on computer vision and pattern recognition (pp. 1597–1604).

  • Anandalingam, G., & Friesz, T. (1992). Hierarchical optimization: An introduction. Annals of Operations Research, 34, 1–11.

    Article  MathSciNet  Google Scholar 

  • Baker, B., Gupta, O., Naik, N., & Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In International conference on learning representations.

  • Bender, G., Kindermans, P., Zoph, B., Vasudevan, V., & Le, Q. V. (2018). Understanding and simplifying one-shot architecture search. In International conference on machine learning.

  • Borji, A., Cheng, M. M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.

    Article  MathSciNet  Google Scholar 

  • Brock, A., Lim, T., Ritchie, J. M., & Weston, N. (2018). Smash: One-shot model architecture search through hypernetworks. In International conference on learning representations. arxiv: abs/1708.05344.

  • Cai, H., Chen, T., Zhang, W., Yu, Y., & Wang, J. (2018). Efficient architecture search by network transformation. In AAAI (Vol. 32).

  • Chen, H., Deng, Y., Li, Y., Hung, T. Y., & Lin, G. (2020). Rgbd salient object detection via disentangled cross-modal fusion. IEEE Transactions on Image Processing, 29, 8407–8416.

    Article  Google Scholar 

  • Chen, H., & Li, Y. (2018). Progressively complementarity-aware fusion network for RGB-D salient object detection. In IEEE conference on computer vision and pattern recognition (pp. 3051–3060).

  • Chen, H., & Li, Y. (2019). Three-stream attention-aware network for RGB-D salient object detection. IEEE Transactions on Image Processing, 28, 2825–2835.

    Article  MathSciNet  Google Scholar 

  • Chen, H., Li, Y., & Su, D. (2019). Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition, 86, 376–385.

    Article  Google Scholar 

  • Chen, H., Li, Y., & Su, D. (2020). Discriminative cross-modal transfer learning and densely cross-level feedback fusion for RGB-D salient object detection. IEEE Transactions on Cybernetics, 50, 4808–4820.

    Article  Google Scholar 

  • Chen, Q., Liu, Z., Zhang, Y., Fu, K., Zhao, Q., & Du, H. (2021). RGB-D salient object detection via 3d convolutional neural networks. In AAAI.

  • Chen, S., & Fu, Y. (2020). Progressively guided alternate refinement network for RGB-D salient object detection. In European conference on computer vision.

  • Cheng, Y., Fu, H., Wei, X., Xiao, J., & Cao, X. (2014). Depth enhanced saliency detection method. In ICIMCS (pp. 23–27).

  • Chen, Y., Meng, G., Zhang, Q., Xiang, S., Huang, C., Mu, L., & Wang, X. (2018). Reinforced evolutionary neural architecture search. arXiv preprint arXiv:1808.00193.

  • Chen, Z., Cong, R., Xu, Q., & Huang, Q. (2020). Dpanet: Depth potentiality-aware gated attention network for RGB-D salient object detection. IEEE Transactions on Image Processing, 30, 7012–7014.

    Article  Google Scholar 

  • Ciptadi, A., Hermans, T., & Rehg, J.M. (2013). An in depth view of saliency. In British machine vision conference.

  • Colson, B., Marcotte, P., & Savard, G. (2007). An overview of bilevel optimization. Annals of Operations Research, 153, 235–256.

    Article  MathSciNet  Google Scholar 

  • Desingh, K., Krishna, K. M., Rajan, D., & Jawahar, C. (2013). Depth really matters: Improving visual salient region detection with depth. In British machine vision conference (pp. 1–11).

  • Fan, D. P., Cheng, M. M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In International conference on computer vision (pp. 4548–4557).

  • Fan, D. P., Gong, C., Cao, Y., Ren, B., Cheng, M. M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In IJCAI.

  • Fan, D. P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M. M. (2020). Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networks and Learning Systems, 32, 2075–2089.

    Article  Google Scholar 

  • Fan, D. P., Lin, Z., Zhao, J., Liu, Y., Zhang, Z., Hou, Q., et al. (2020). Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks. IEEE Transactions on neural networks and learning systems, 32, 2075–2089.

    Article  Google Scholar 

  • Fan, D. P., Wang, W., Cheng, M. M., & Shen, J. (2019). Shifting more attention to video salient object detection. In IEEE conference on computer vision and pattern recognition (pp. 8554–8564).

  • Fan, D. P., Zhai, Y., Borji, A., Yang, J., & Shao, L. (2020c). Bbs-net: RGB-D salient object detection with a bifurcated backbone strategy network. In European conference on computer vision.

  • Fan, X., Liu, Z., & Sun, G. (2014). Salient region detection for stereoscopic images. In DSP (pp. 454–458).

  • Feng, D., Barnes, N., You, S., & McCarthy, C. (2016). Local background enclosure for RGB-D salient object detection. In IEEE conference on computer vision and pattern recognition (pp. 2343–2350).

  • Fu, K., Fan, D. P., Ji, G. P., & Zhao, Q. (2020). JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In IEEE conference on computer vision and pattern recognition (pp. 3052–3062).

  • Fu, K., Fan, D. P., Ji, G. P., Zhao, Q., Shen, J., & Zhu, C. (2021). Siamese network for RGB-D salient object detection and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Gao, S., Cheng, M. M., Zhao, K., Zhang, X. Y., Yang, M. H., & Torr, P. H. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Gao, Y., Wang, M., Tao, D., Ji, R., & Dai, Q. (2012). 3-d object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing, 21, 4290–4303.

    Article  MathSciNet  Google Scholar 

  • Ghiasi, G., Lin, T. Y., Pang, R., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In IEEE conference on computer vision and pattern recognition (pp. 7029–7038).

  • Guo, J., Ren, T., & Bei, J. (2016). Salient object detection for RGB-D image via saliency evolution. In IEEE international conference on multimedia and expo (pp. 1–6).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hong, S., You, T., Kwak, S., & Han, B. (2015). Online tracking by learning discriminative saliency map with convolutional neural network. In International conference on machine learning.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

  • Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with Gumbel-Softmax. In International conference on learning representation.

  • Ji, W., Li J, Zhang, M., Piao, Y., & Lu, H. (2020). Accurate RGB-D salient object detection via collaborative learning. In European conference on computer vision.

  • Jin, W. D., Xu, J., Han, Q., Zhang, Y., & Cheng, M. M. (2021). Cdnet: Complementary depth network for RGB-D salient object detection. IEEE Transactions on Image Processing, 30, 3376–3390.

    Article  Google Scholar 

  • Ju, R., Ge, L., Geng, W., Ren, T., & Wu, G. (2014). Depth saliency based on anisotropic center-surround difference. In IEEE international conference on image processing (pp. 1115–1119).

  • Lang, C., Nguyen, T. V., Katti, H., Yadati, K., Kankanhalli, M., & Yan, S. (2012). Depth matters: Influence of depth cues on visual saliency. In: European conference on computer vision.

  • Li, C., Cong, R., Piao, Y., Xu, Q., & Loy, C. C. (2020a). RGB-D salient object detection with cross-modality modulation and selection. In European conference on computer vision.

  • Li, G., Liu, Z., Chen, M., Bai, Z., Lin, W., & Ling, H. (2021). Hierarchical alternate interaction network for RGB-D salient object detection. IEEE Transactions on Image Processing, 30, 3528–3542.

    Article  Google Scholar 

  • Li, G., Liu, Z., Ye, L., Wang, Y., & Ling, H. (2020b). Cross-modal weighting network for RGB-D salient object detection. In European conference on computer vision.

  • Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J. (2014). Saliency detection on light field. In IEEE conference on computer vision and pattern recognition (pp. 2806–2813).

  • Lin, P. W., Sun, P., Cheng, G., Xie, S., Li, X., & Shi, J. (2020). Graph-guided architecture search for real-time semantic segmentation. In IEEE conference on computer vision and pattern recognition (pp. 4202–4211).

  • Liu, C., Chen, L. C., Schroff, F., Adam, H., Hua, W., Yuille, A., & Fei-Fei, L. (2019a). Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In IEEE conference on computer vision and pattern recognition.

  • Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L. J., Fei-Fei, L., Yuille A., Huang, J., & Murphy, K. (2017). Progressive neural architecture search. In European conference on computer vision.

  • Liu, G., & Fan, D. P. (2013). A model of visual attention for natural image retrieval. In 2013 international conference on information science and cloud computing companion (pp. 728–733).

  • Liu, H., Simonyan, K., & Yang, Y. (2019b). Darts: Differentiable architecture search. In International conference on learning representation.

  • Liu, N., Zhang, N., & Han, J. (2020a). Learning selective self-mutual attention for RGB-D saliency detection. In IEEE conference on computer vision and pattern recognition (pp. 13753–13762).

  • Liu, N., Zhang, N., Shao, L., & Han, J. (2020b). Learning selective mutual attention and contrast for RGB-D saliency detection. abs/2010.05537.

  • Liu, Z., Shi, S., Duan, Q., Zhang, W., & Zhao, P. (2019). Salient object detection for RGB-D image by single stream recurrent convolution neural network. Neurocomputing, 363, 46–57.

    Article  Google Scholar 

  • Mahadevan, V., & Vasconcelos, N. (2009). Saliency-based discriminant tracking. In IEEE conference on computer vision and pattern recognition (pp. 1007–1013).

  • Nguyen, T. V., Zhao, Q., & Yan, S. (2018). Attentive systems: A survey. International Journal of Computer Vision, 126(1), 86–110.

    Article  Google Scholar 

  • Nian, L., Ni, Z., Kaiyuan, W., Junwei, H., & Ling, S. (2021). Visual saliency transformer. arXiv preprint arXiv:2101.10241.

  • Niu, Y., Geng, Y., Li, X., & Liu, F. (2012). Leveraging stereopsis for saliency analysis. In IEEE conference on computer vision and pattern recognition (pp. 454–461).

  • Pang, Y., Zhang, L., Zhao, X., & Lu, H. (2020). Hierarchical dynamic filtering network for RGB-D salient object detection. In European conference on computer vision.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems.

  • Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). RGBD salient object detection: A benchmark and algorithms. In European conference on computer vision (pp. 92–109). Springer.

  • Pérez-Rúa, J. M., Vielzeuf, V., Pateux, S., Baccouche, M., & Jurie, F. (2019). Mfas: Multimodal fusion architecture search. In IEEE Conference on computer vision and pattern recognition (pp. 6959–6968).

  • Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In European conference on computer vision (pp. 7254–7263).

  • Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., & Yang, Q. (2017). RGBD salient object detection via deep fusion. IEEE Transactions on Image Processing, 26, 2274–2285.

    Article  MathSciNet  Google Scholar 

  • Quan, R., Dong, X., Wu, Y., Zhu, L., & Yang, Y. (2019). Auto-reid: Searching for a part-aware convnet for person re-identification. In International conference on computer vision (pp. 3750–3759).

  • Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). Regularized evolution for image classifier architecture search. arXiv:abs/1802.01548.

  • Ren, J., Gong, X., Yu, L., Zhou, W., & Ying Yang, M. (2015). Exploiting global priors for RGB-D saliency detection. In IEEE conference on computer vision and pattern recognition. Workshops.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.

    Article  MathSciNet  Google Scholar 

  • Shigematsu, R., Feng, D., You, S., & Barnes, N. (2017). Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features. In IEEE conference on computer vision. Workshop (pp. 2749–2757).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representation. abs/1409.1556.

  • Song, H., Liu, Z., Du, H., Sun, G., Meur, O. L., & Ren, T. (2017). Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning. IEEE Transactions on Image Processing, 26, 4204–4216.

    Article  MathSciNet  Google Scholar 

  • Sun, P., Zhang, W., Wang, H., Li, S., & Li, X. (2021). Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion. In IEEE conference on computer vision and pattern recognition.

  • Wang, W., Shen, J., & Porikli, F. (2015). Saliency-aware geodesic video object segmentation. In IEEE conference on computer vision and pattern recognition (pp. 3395–3402).

  • Xu, H., Yao, L., Li, Z., Liang, X., & Zhang, W. (2019). Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In IEEE conference on computer vision (pp. 6648–6657).

  • Yu, Z., Cui, Y., Yu, J., Wang, M., Tao, D., & Tian, Q. (2020). Deep multimodal neural architecture search. In ACM international conference on multimedia.

  • Zhang, J., Fan, D.P., Dai, Y., Yu, X., Zhong, Y., Barnes, N., & Shao, L. (2021). RGB-D saliency detection via cascaded mutual information minimization. In IEEE conference on computer vision (pp. 4338–4347).

  • Zhang, M., Fei, S. X., Liu, J., Xu, S., Piao, Y., & Lu, H. (2020a). Asymmetric two-stream architecture for accurate RGB-D saliency detection. In European conference on computer vision.

  • Zhang, M., Ren, W., Piao, Y., Rong, Z., & Lu, H. (2020b). Select, supplement and focus for RGB-D saliency detection. In IEEE conference on computer vision and pattern recognition (pp. 3469–3478).

  • Zhao, J. X., Cao, Y., Fan, D. P., Cheng, M. M., Li, X. Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for RGBD salient object detection. In IEEE conference on computer vision and pattern recognition.

  • Zhao, R., Ouyang, W., & Wang, X. (2013). Unsupervised salience learning for person re-identification. In IEEE conference on computer vision and pattern recognition (pp. 3586–3593).

  • Zhao, X., Zhang, L., Pang, Y., Lu, H., & Zhang, L. (2020). A single stream network for robust and real-time RGB-D salient object detection. In European conference on computer vision.

  • Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In IEEE conference on computer vision and pattern recognition (pp. 2921–2929).

  • Zhou, T., Fan, D. P., Cheng, M. M., Shen, J., & Shao, L. (2021). RGB-D salient object detection: A survey. Computational Visual Media, 7(1), 37–69.

    Article  Google Scholar 

  • Zhu, C., Cai, X., Huang, K., Li, T. H., & Li, G. (2019). Pdnet: Prior-model guided depth-enhanced network for salient object detection. In International conference on multimedia and expo (pp. 199–204).

  • Zhu, C., & Li, G. (2017). A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In IEEE conference on computer vision and pattern recognition. Workshop (pp. 3008–3014).

  • Zhu, C., Li, G., Wang, W., & Wang, R. (2017). An innovative salient object detection using center-dark channel prior. In IEEE conference on computer vision and pattern recognition (pp. 1509–1515).

  • Zoph, B., Le, & Q. V. (2017). Neural architecture search with reinforcement learning. In International conference on learning representation.

Download references

Acknowledgements

This work is supported in part by National Key Research and Development Program of China under Grant 2020AAA0107400, Zhejiang Provincial Natural Science Foundation of China under Grant LR19F020004, National Natural Science Foundation of China under Grant U20A20222.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xi Li.

Additional information

Communicated by V. Lepetit.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, P., Zhang, W., Li, S. et al. Learnable Depth-Sensitive Attention for Deep RGB-D Saliency Detection with Multi-modal Fusion Architecture Search. Int J Comput Vis 130, 2822–2841 (2022). https://doi.org/10.1007/s11263-022-01646-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01646-0

Keywords

Navigation