Skip to main content
Log in

Spatiotemporal context-aware network for video salient object detection

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

It has been witnessed that there is an increasing interest in video salient object detection (VSOD) in computer vision field. Different from image salient object detection (ISOD), VSOD not only requires appearance information but also needs motion cues. Thus, it is essential to exploit spatiotemporal information to generate accurate saliency results. Existing VSOD models mainly combine an ISOD model with long short-term memory (LSTM) or flow-estimation modules to integrate saliency cues estimated from spatial and temporal domain. However, flow-estimation modules heavily rely on optical flow images; the generation process of which is rather time-consuming and severely limits its applications in practice. Besides, the LSTM can only exploit motion cues via a step-by-step propagation in the time domain and is hard to realize the multi-scale spatiotemporal interaction. In this paper, we propose the SCANet to solve the above problems. Specifically, we develop the pyramid dilated 3D convolutional (PD3C) module to generate rich temporal features by leveraging context information. Besides, a feature aggregation module is designed to effectively integrate spatial and temporal features. Equipped with these modules, the SCANet is capable of generating high-quality saliency maps at more than real-time inference speed (41 FPS on a single Titan Xp GPU). Extensive experimental results on six widely used benchmark datasets prove that SCANet outperforms state-of-the-art methods in terms of three standard evaluation metrics. Our code will be publicly available at https://github.com/clelouch/SCANet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Qiu W, Gao X, Han B (2018) Eye fixation assisted video saliency detection via total variation-based pairwise interaction. IEEE Trans Image Process 27:10. https://doi.org/10.1109/TIP.2018.2843680

    Article  MathSciNet  MATH  Google Scholar 

  2. Le TN, Sugimoto A (2018) Video salient object detection using spatiotemporal deep features. IEEE Trans Image Process 27:10. https://doi.org/10.1109/TIP.2018.2849860

    Article  MathSciNet  Google Scholar 

  3. Kim W, Jung C, Kim C (2011) Spatiotemporal saliency detection and its applications in static and dynamic scenes. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2011.2125450

    Article  Google Scholar 

  4. Hu YT, Bin Huang J, Schwing AG (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11205 LNCS. https://doi.org/10.1007/978-3-030-01246-5_48

  5. Gorji S, Clark JJ (2018) Going from image to video saliency: augmenting image salience with dynamic attentional push. CVPR. https://doi.org/10.1109/CVPR.2018.00783

    Article  Google Scholar 

  6. Chen C, Li S, Qin H, Pan Z, Yang G (2018) Bilevel feature learning for video saliency detection. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2018.2839523

    Article  Google Scholar 

  7. Xu N, Price B, Cohen S, Yang J, Huang T (2016) Deep interactive object selection. CVPR. https://doi.org/10.1109/CVPR.2016.47

    Article  Google Scholar 

  8. Hadizadeh H, Bajic IV (2014) Saliency-aware video compression. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2013.2282897

    Article  MathSciNet  MATH  Google Scholar 

  9. Guo C, Zhang L (2010) A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2009.2030969

    Article  MathSciNet  MATH  Google Scholar 

  10. Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. CVPR. https://doi.org/10.1109/CVPR.2017.111

    Article  Google Scholar 

  11. Xu N et al (2018) YouTube-VOS: sequence-to-sequence video object segmentation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11209 LNCS. https://doi.org/10.1007/978-3-030-01228-1_36.

  12. Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 07-12-June-2015. https://doi.org/10.1109/CVPR.2015.7298961

  13. Zhang Z, Fidler S, Urtasun R (2016) Instance-level segmentation for autonomous driving with deep densely connected MRFs. CVPR. https://doi.org/10.1109/CVPR.2016.79

    Article  Google Scholar 

  14. Yang C, Zhang L, Lu H, Ruan X, Yang MH (2013) Saliency detection via graph-based manifold ranking. CVPR. https://doi.org/10.1109/CVPR.2013.407

    Article  Google Scholar 

  15. Wang W, Shen J, Yang R, Porikli F (2018) Saliency-aware video object segmentation. IEEE Trans Pattern Anal Mach Intel. https://doi.org/10.1109/TPAMI.2017.2662005

    Article  Google Scholar 

  16. Wang W, Shen J, Guo F, Cheng MM, Borji A (2018) Revisiting video saliency: a large-scale benchmark and a new model. https://doi.org/10.1109/CVPR.2018.00514.

  17. Wang W, Shen J (2018) Deep visual attention prediction. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2017.2787612

    Article  MathSciNet  Google Scholar 

  18. Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2015.2460013

    Article  MathSciNet  MATH  Google Scholar 

  19. Guo F et al (2018) Video saliency detection using object proposals. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2017.2761361

    Article  Google Scholar 

  20. Fang Y, Wang Z, Lin W, Fang Z (2014) Video saliency incorporating spatiotemporal cues and uncertainty weighting. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2014.2336549

    Article  MathSciNet  MATH  Google Scholar 

  21. Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2021.3068644

    Article  Google Scholar 

  22. Wu Z, Su L, Huang Q (2019) Cascaded partial decoder for fast and accurate salient object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June. https://doi.org/10.1109/CVPR.2019.00403

  23. Zhao J, Liu JJ, Fan DP, Cao Y, Yang J, Cheng MM (2019) EGNet: edge guidance network for salient object detection. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October. https://doi.org/10.1109/ICCV.2019.00887

  24. Liu JJ, Hou Q, Cheng MM, Feng J, Jiang J (2019) A simple pooling-based design for real-time salient object detection. CVPR. https://doi.org/10.1109/CVPR.2019.00404

    Article  Google Scholar 

  25. Li S, Sui X, Luo X, Xu X, Liu Y, Goh R (2021) Medical image segmentation using squeeze-and-expansion transformers [Online]. Available: http://arxiv.org/abs/2105.09511

  26. Mao Y et al (2021) Transformer transforms salient object detection and camouflaged object detection 14(8):1–15, [Online]. Available: http://arxiv.org/abs/2104.10127

  27. Vaswani A et al (2017) Attention is all you need. Advances in neural information processing systems, vol. 2017-December

  28. Chen J et al (2021) TransUNet: transformers make strong encoders for medical image segmentation, pp 1–13, [Online]. Available: http://arxiv.org/abs/2102.04306

  29. Dosovitskiy A et al (2021) An image is worth 16 × 16 words: transformers for image recognition at scale. ICLR

  30. Li H, Chen G, Li G, Yu Y (2019) Motion guided attention for video salient object detection. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October. https://doi.org/10.1109/ICCV.2019.00737

  31. Ren S, Han C, Yang X, Han G, He S (2020) TENet: triple excitation network for video salient object detection. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12350 LNCS. https://doi.org/10.1007/978-3-030-58558-7_13

  32. Song H, Wang W, Zhao S, Shen J, Lam KM (2018) Pyramid dilated deeper ConvLSTM for video salient object detection. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11215 LNCS, https://doi.org/10.1007/978-3-030-01252-6_44

  33. Fan DP, Wang W, Cheng MM, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2019-June. https://doi.org/10.1109/CVPR.2019.00875

  34. Chen Z, Xu Q, Cong R, Huang Q (2020) Global context-aware progressive aggregation network for salient object detection. AAAI. https://doi.org/10.1609/aaai.v34i07.6633

    Article  Google Scholar 

  35. Pang Y, Zhao X, Zhang L, Lu H (2020) Multi-scale interactive network for salient object detection. CVPR. https://doi.org/10.1109/cvpr42600.2020.00943

    Article  Google Scholar 

  36. Klein DA, Frintrop S (2021) Center-surround divergence of feature statistics for salient object detection. ICCV

  37. Jiang H, Wang J, Yuan Z, Wu Y, Zheng N, Li S (2013) Salient object detection: a discriminative regional feature integration approach. CVPR. https://doi.org/10.1109/CVPR.2013.271

    Article  Google Scholar 

  38. Liang J, Zhou J, Tong L, Bai X, Wang B (2018) Material based salient object detection from hyperspectral images. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.11.024

    Article  Google Scholar 

  39. Shijian L, Joo-Hwee L (2012) Saliency modeling from image histograms. ECCV

  40. Lu S, Tan C, Lim JH (2014) Robust and efficient saliency modeling from image co-occurrence histograms. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2013.158

    Article  Google Scholar 

  41. Zhang J, Ehinger KA, Wei H, Zhang K, Yang J (2017) A novel graph-based optimization framework for salient object detection. Pattern Recognit. https://doi.org/10.1016/j.patcog.2016.10.025

    Article  Google Scholar 

  42. Wang T, Borji A, Zhang L, Zhang P, Lu H (2017) A stagewise refinement model for detecting salient objects in images. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October. https://doi.org/10.1109/ICCV.2017.433

  43. Chen T, Hu X, Xiao J, Zhang G (2021) BPFINet: boundary-aware progressive feature integration network for salient object detection. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.04.078

    Article  Google Scholar 

  44. Zhao T, Wu X (2019) Pyramid feature attention network for saliency detection. CVPR. https://doi.org/10.1109/CVPR.2019.00320

    Article  Google Scholar 

  45. Qin X, Zhang Z, Huang C, Gao C, Dehghan M, Jagersand M (2019) Basnet: Boundary-aware salient object detection. CVPR. https://doi.org/10.1109/CVPR.2019.00766

    Article  Google Scholar 

  46. Feng M, Lu H, Ding E (2019) Attentive feedback network for boundary-aware salient object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2019-June. https://doi.org/10.1109/CVPR.2019.00172

  47. Liu JJ, Hou Q, Cheng MM (2020) Dynamic feature integration for simultaneous detection of salient object, edge, and skeleton. IEEE Trans Image Process 29:8652–8667. https://doi.org/10.1109/TIP.2020.3017352

    Article  MATH  Google Scholar 

  48. Chen Z, Xu Q, Cong R, Huang Q (2020) Global context-aware progressive aggregation network for salient object detection. https://doi.org/10.1609/aaai.v34i07.6633

  49. Zhao X, Pang Y, Zhang L, Lu H, Zhang L (2020) Suppress and balance: a simple gated network for salient object detection. ECCV

  50. Wu Z, Su L, Huang Q (2019) Stacked cross refinement network for edge-aware salient object detection. ICCV. https://doi.org/10.1109/ICCV.2019.00736

    Article  Google Scholar 

  51. Wei J, Wang S, Huang Q (2020) F3Net: fusion, feedback and focus for salient object detection. AAAI. https://doi.org/10.1609/aaai.v34i07.6916

    Article  Google Scholar 

  52. Zhu L et al (2020) Aggregating attentional dilated features for salient object detection. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2941017

    Article  Google Scholar 

  53. Tu Z, Ma Y, Li C, Li C, Tang J, Luo B (2020) Edge-guided non-local fully convolutional network for salient object detection. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/tcsvt.2020.2980853

    Article  Google Scholar 

  54. Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. CVPR. https://doi.org/10.1109/CVPR.2008.4587715

    Article  Google Scholar 

  55. Kim H, Kim Y, Sim JY, Kim CS (2015) Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2015.2425544

    Article  MathSciNet  MATH  Google Scholar 

  56. Zhang P, Wang D, Lu H, Wang H, Ruan X (2017) Amulet: aggregating multi-level convolutional features for salient object detection. In: Proceedings of the IEEE international conference on computer vision, vol. 2017-October. https://doi.org/10.1109/ICCV.2017.31

  57. Luo Z, Mishra A, Achkar A, Eichel J, Li S, Jodoin PM (2017) Non-local deep features for salient object detection. CVPR. https://doi.org/10.1109/CVPR.2017.698

    Article  Google Scholar 

  58. Liu N, Han J, Yang MH (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. CVPR. https://doi.org/10.1109/CVPR.2018.00326

    Article  MATH  Google Scholar 

  59. Zhang X, Wang T, Qi J, Lu H, Wang G (2018) Progressive attention guided recurrent network for salient object detection. CVPR. https://doi.org/10.1109/CVPR.2018.00081

    Article  Google Scholar 

  60. Wang B, Chen Q, Zhou M, Zhang Z, Jin X, Gai K (2020) Progressive feature polishing network for salient object detection. AAAI. https://doi.org/10.1609/aaai.v34i07.6892

    Article  Google Scholar 

  61. Wang W, Shen J, Shao L (2018) Video salient object detection via fully convolutional networks. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2017.2754941

    Article  MathSciNet  MATH  Google Scholar 

  62. Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. CVPR. https://doi.org/10.1109/CVPR.2018.00342

    Article  Google Scholar 

  63. Dosovitskiy A et al (2015) FlowNet: learning optical flow with convolutional networks. ICCV. https://doi.org/10.1109/ICCV.2015.316.

  64. Gu Y, Wang L, Wang Z, Liu Y, Cheng MM, Lu SP (2020) Pyramid constrained self-attention network for fast video salient object detection. AAAI. https://doi.org/10.1609/aaai.v34i07.6718

    Article  Google Scholar 

  65. Zheng S et al (2020) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, vol. abs/2012.15840, [Online]. Available: http://arxiv.org/abs/2012.15840

  66. Wang W et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions, [Online]. Available: http://arxiv.org/abs/2102.12122

  67. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. CVPR. https://doi.org/10.1109/CVPR.2016.90

    Article  Google Scholar 

  68. Hou Q, Cheng MM, Hu X, Borji A, Tu Z, Torr PHS (2019) Deeply supervised salient object detection with short connections. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2018.2815688

    Article  Google Scholar 

  69. Zhang L, Dai J, Lu H, He Y, Wang G (2018) A bi-directional message passing model for salient object detection. CVPR. https://doi.org/10.1109/CVPR.2018.00187

    Article  Google Scholar 

  70. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861

    Article  Google Scholar 

  71. Mattyus G, Luo W, Urtasun R (2017) DeepRoadMapper: extracting road topology from aerial images. In: Proceedings of the IEEE international conference on computer vision, vol. 2017-October. https://doi.org/10.1109/ICCV.2017.372

  72. Wang L et al (2017) Learning to detect salient objects with image-level supervision. CVPR. https://doi.org/10.1109/CVPR.2017.404

    Article  Google Scholar 

  73. Brox T, Malik J (2010) Object segmentation by long term analysis of point trajectories. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 6315 LNCS, no Part 5, https://doi.org/10.1007/978-3-642-15555-0_21

  74. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2016-December. https://doi.org/10.1109/CVPR.2016.85.

  75. Li J, Xia C, Chen X (2018) A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2017.2762594

    Article  MathSciNet  MATH  Google Scholar 

  76. Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. ICCV. https://doi.org/10.1109/ICCV.2013.273

    Article  Google Scholar 

  77. Fan DP, Zhai Y, Borji A, Yang J, Shao L (2020) BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. ECCV

  78. Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization

  79. Xi T, Zhao W, Wang H, Lin W (2017) Salient object detection with spatiotemporal background priors for video. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2016.2631900

    Article  MathSciNet  MATH  Google Scholar 

  80. Liu Z, Li J, Ye L, Sun G, Shen L (2017) Saliency Detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans Circ Sys Video Technol. https://doi.org/10.1109/TCSVT.2016.2595324

    Article  Google Scholar 

  81. Chen C, Li S, Wang Y, Qin H, Hao A (2017) Video Saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2017.2670143

    Article  MathSciNet  MATH  Google Scholar 

  82. Li S, Seybold B, Vorobyov A, Lei X, Kuo CCJ (2018) Unsupervised video object segmentation with motion-based bilateral networks. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11207 LNCS, https://doi.org/10.1007/978-3-030-01219-9_13

  83. Tang Y, Zou W, Jin Z, Chen Y, Hua Y, Li X (2019) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans Circ Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2859773

    Article  Google Scholar 

  84. Chen Y, Zou W, Tang Y, Li X, Xu C, Komodakis N (2018) SCOM: spatiotemporal constrained optimization for salient object detection. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2018.2813165

    Article  MathSciNet  MATH  Google Scholar 

  85. Wang W et al (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2019-June. https://doi.org/10.1109/CVPR.2019.00318

  86. Zhao W, Zhang J, Li L, Barnes N, Liu N, Han J (2021) Weakly supervised video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16826–16835

  87. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2017.2699184

    Article  Google Scholar 

  88. Zhang Z, Lin Z, Xu J, Da Jin W, Lu SP, Fan DP (2021) Bilateral attention network for RGB-D salient object detection. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2021.3049959

    Article  Google Scholar 

  89. Fan DP et al (2020) PraNet: parallel reverse attention network for polyp segmentation. MICCAI. https://doi.org/10.1007/978-3-030-59725-2_26

    Article  Google Scholar 

  90. Fan DP, Ji GP, Cheng MM, Shao L (2021) Concealed object detection. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3085766

    Article  Google Scholar 

  91. Howard A et al (2019) Searching for MobileNetV3, CoRR, vol. abs/1905.02244. Available: http://arxiv.org/abs/1905.02244

  92. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. ICLR

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (under Grant 51807003).

Funding

This work was supported by the Young Scientists Fund (Grant No. 51807003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jin Xiao.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, T., Xiao, J., Hu, X. et al. Spatiotemporal context-aware network for video salient object detection. Neural Comput & Applic 34, 16861–16877 (2022). https://doi.org/10.1007/s00521-022-07330-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07330-1

Keywords

Navigation