Fusion hierarchy motion feature for video saliency detection

Xiao, Fen; Luo, Huiyu; Zhang, Wenlei; Li, Zhen; Gao, Xieping

doi:10.1007/s11042-023-16593-2

Fusion hierarchy motion feature for video saliency detection

Published: 20 September 2023

Volume 83, pages 32301–32320, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Fen Xiao ORCID: orcid.org/0000-0001-7511-9418¹,
Huiyu Luo¹,
Wenlei Zhang¹,
Zhen Li¹ &
…
Xieping Gao²

98 Accesses
Explore all metrics

Abstract

Saliency detection plays an important role in computer vision and scene understanding, which has attracted increasing attention in recent years. Compared to the widely studied image saliency prediction, there are still many problems to be solved in the area of video saliency. Different from images, effectively describing and utilizing the motion information contained in video data is a critical issue. In this paper, we propose a spatial and motion dual-stream framework for video saliency detection. The coarse motion features extracting from optical flow are fine-tuned with higher level semantic spatial features via a residual cross-connection. A hierarchical fusion structure is proposed to maintain contextual information by integrating spatial and motion features in each level. To model the inter-frame correlation in the video, the convolutional gated recurrent unit (convGRU) is used to retain global consistency of the saliency area between neighbor frames. Experimental results on four widely used datasets demonstrate the effectiveness of the proposed method with other state-of-the-art methods. Our source codes can be acquired at https://github.com/banhuML/MFHF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

CBAM: Convolutional Block Attention Module

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Data Availibility

The datasets generated during and/or analysed during the current study are available in the DHF1K repository, https://github.com/wenguanwang/DHF1K.

References

Wang X, Qi C (2020) Detecting action-relevant regions for action recognition using a three-stage saliency detection technique. Multimed Tools Appl 79(11):7413–7433
Article Google Scholar
Cizmeciler K, Erdem E, Erdem A (2022) Leveraging semantic saliency maps for query-specific video summarization. Multimed Tools Appl 81(12):17457–17482
Article Google Scholar
Ullah J, Khan A, Jaffar MA (2018) Motion cues and saliency based unconstrained video segmentation. Multimed Tools Appl 77(6):7429–7446
Article Google Scholar
Li S, Xu M, Wang Z, Sun X (2016) Optimal bit allocation for ctu level rate control in hevc. IEEE Trans Circ Syst Video Technol 27(11):2409–2424
Article Google Scholar
Xu M, Liu Y, Hu R, He F (2018) Find who to look at: turning from action to saliency. IEEE Trans Image Proc 27(9):4529–4544
Article ADS MathSciNet Google Scholar
Chen C, Li S, Wang Y, Qin H, Hao A (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Proc 26(7):3156–3170
Article ADS MathSciNet Google Scholar
Chen C, Li S, Qin H, Pan Z, Yang G (2018) Bilevel feature learning for video saliency detection. IEEE Trans Multimed 20(12):3324–3336
Article Google Scholar
Li Y, Li S, Chen C, Hao A, Qin H (2019) Accurate and robust video saliency detection via self-paced diffusion. IEEE Trans Multimed 22(5):1153–1167
Article Google Scholar
Chen C, Wang G, Peng C, Zhang X, Qin H (2019) Improved robust video saliency detection based on long-term spatial-temporal information. IEEE Trans Image Proc 29:1090–1100
Article ADS MathSciNet Google Scholar
Zhang P, Liu J, Wang X, Pu T, Fei C, Guo Z (2020) Stereoscopic video saliency detection based on spatiotemporal correlation and depth confidence optimization. Neurocomputing 377:256–268
Article Google Scholar
Wang G, Chen C, Fan D, Hao A, Qin H (2021) Weakly supervised visual-auditory saliency detection with multigranularity perception. arXiv preprint arXiv:2112.13697
Li H, Chen G, Li G, Yu Y (2019) Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7274–7283
Chen C, Song J, Peng C, Wang G, Fang Y (2021) A novel video salient object detection method via semisupervised motion quality perception. IEEE Trans Circ Syst Video Technol 32(5):2732–2745
Article Google Scholar
Chen C, Wang H, Fang Y, Peng C (2022) A novel long-term iterative mining scheme for video salient object detection. IEEE Trans Circ Syst Video Technol 32(11):7662–7676
Article Google Scholar
Borji A, Cheng M-M, Jiang H, Li J (2015) Salient object detection: a benchmark. IEEE Trans Image Proc 24(12):5706–5722
Article ADS MathSciNet Google Scholar
Vig E, Dorr M, Cox D (2014) Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2798–2805
Liu N, Han J, Zhang D, Wen S, Liu T (2015) Predicting eye fixations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 362–370
Huang X, Shen C, Boix X, Zhao Q (2015) Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 262–270
Pan J, Sayrol E, Giro-i-Nieto X, McGuinness K, O’Connor NE (2016) Shallow and deep convolutional networks for saliency prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 598–606
Liu N, Han J, Liu T, Li X (2016) Learning to predict eye fixations via multiresolution convolutional neural networks. IEEE Trans Neur Netw Learn Syst 29(2):392–404
Article MathSciNet CAS Google Scholar
Kruthiventi SS, Ayush K, Babu RV (2017) Deepfix: a fully convolutional neural network for predicting human eye fixations. IEEE Trans Image Proc 26(9):4446–4456
Article ADS MathSciNet Google Scholar
Liu N, Han J (2018) A deep spatial contextual long-term recurrent convolutional network for saliency detection. IEEE Trans Image Proc 27(7):3264–3274
Article ADS MathSciNet Google Scholar
Mathe S, Sminchisescu C (2014) Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(7):1408–1424
Article Google Scholar
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Article Google Scholar
Gibson JJ (1950) The perception of the visual world
Teed Z, Deng J (2020) Raft: recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. Springer, pp 402–419
Cong R, Song W, Lei J, Yue G, Zhao Y, Kwong S (2022) Psnet: parallel symmetric network for video salient object detection. IEEE Trans Emerg Top Comput Intell
Bak C, Kocak A, Erdem E, Erdem A (2017) Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans Multimed 20(7):1688–1698
Article Google Scholar
Jiang L, Xu M, Liu T, Qiao M, Wang Z (2018) Deepvs: a deep learning based video saliency prediction approach. In: Proceedings of the European conference on computer vision (eccv), pp 602–617
Zhang K, Chen Z (2018) Video saliency prediction based on spatial-temporal two-stream network. IEEE Trans Circ Syst Video Technol 29(12):3544–3557
Article Google Scholar
Lai Q, Wang W, Sun H, Shen J (2019) Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans Image Proc 29:1113–1126
Article ADS MathSciNet Google Scholar
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432
Srinivasu PN, Bhoi AK, Jhaveri RH, Reddy GT, Bilal M (2021) Probabilistic deep q network for real-time path planning in censorious robotic procedures using force sensors. J Real-Time Image Proc 18(5):1773–1785
Article Google Scholar
Craye C, Filliat D, Goudou J-F (2016) Environment exploration for object-based visual saliency learning. In: 2016 IEEE international conference on robotics and automation (ICRA), pp 2303–2309. IEEE
Le Meur O, Le Callet P, Barba D, Thoreau D (2006) A coherent computational approach to model bottom-up visual attention. IEEE Trans Pattern Ana Mach Intell 28(5):802–817
Article Google Scholar
Zhang L, Tong MH, Marks TK, Shan H, Cottrell GW (2008) Sun: a bayesian framework for saliency using natural statistics. J Vision 8(7):32–32
Article Google Scholar
Gao D, Vasconcelos N (2005) Discriminant saliency for visual recognition from cluttered scenes. In: Adv Neural Inf Proc Syst, pp 481–488
Bruce N, Tsotsos J (2005) Saliency based on information maximization. Adv Neural Inf Proc Syst 18:155–162
Google Scholar
Cheng M-M, Mitra NJ, Huang X, Torr PH, Hu S-M (2014) Global contrast based salient region detection. IEEE Trans Pattern Anal Machi Intell 37(3):569–582
Article Google Scholar
Xu M, Ren Y, Wang Z (2015) Learning to predict saliency on face images. In: Proceedings of the IEEE international conference on computer vision, pp 3907–3915
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans Image Processing 27(10):5142–5154
Article ADS MathSciNet Google Scholar
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. Adv Neural Inf Proc Syst 28
Wang W, Shen J (2017) Deep visual attention prediction. IEEE Trans Image Proc 27(5):2368–2378
Article ADS MathSciNet Google Scholar
Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: 2008 IEEE conference on computer vision and pattern recognition, pp 1–8 IEEE
Itti L, Dhavale N, Pighin F (2003) Realistic avatar eye and head animation using a neurobiological model of visual attention. In: Applications and science of neural networks, fuzzy systems, and evolutionary computation VI, vol 5200, pp 64–78. SPIE
Wang W, Shen J, Xie J, Cheng M-M, Ling H, Borji A (2019) Revisiting video saliency prediction in the deep learning era. IEEE Trans Pattern Anal Mach Intell 1–1. https://doi.org/10.1109/TPAMI.2019.2924417
Zhu S, Chang Q, L, Q (2022) Video saliency aware intelligent hd video compression with the improvement of visual quality and the reduction of coding complexity. Neural Computing and Applications 1–20
Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans Image Proc 30:3995–4007
Article ADS Google Scholar
Zhang F, Woodford OJ, Prisacariu VA, Torr PH (2021) Separable flow: Learning motion cost volumes for optical flow estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10807–10817
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766
Mital P, mith TJ, Luke S, Henderson J (2013) Do low-level visual features have a causal influence on gaze during dynamic scene viewing? J Vision 13(9):144–144
Abrams RA (2003) Christ SE (2003) Motion onset captures attention. Psychol Sci 14(5):427–432
Article PubMed Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8): 1735–1780
Mital PK, Smith TJ, Hill RL, Henderson JM (2011) Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn Comput 3(1):5–24
Article Google Scholar
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision, pp 2106–2113. IEEE
Borji A, Tavakoli HR, Sihite DN, Itti L (2013) Analysis of scores, datasets, and models in visual saliency prediction. In: Proceedings of the IEEE international conference on computer vision, pp 921–928
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Linardos P, Mohedano E, Nieto JJ, O’Connor NE, Giro-i-Nieto X, McGuinness K (2019) Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint arXiv:1907.01869
Min K, Corso JJ (2019) Tased-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: Proceedings of the IEEE international conference on computer vision, pp 2394–2403
Wang Z, Zhou Z, Lu H, Hu Q, Jiang J (2020) Video saliency prediction via joint discrimination and local consistency. IEEE Transactions on Cybernetics
Wang Z, Zhou Z, Lu H, Jiang J (2020) Global and local sensitivity guided key salient object re-augmentation for video saliency detection. Pattern Recogn 103:107275
Article Google Scholar
Jiang L, Xu M, Zhang S, Sigal L (2020) Deepct: a novel deep complex-valued network with learnable transform for video saliency prediction. Pattern Recogn 102:107234
Zou W, Zhuo S, Tang Y, Tian S, Li X, Xu C (2021) Sta3d: spatiotemporally attentive 3d network for video saliency prediction. Pattern Recognition Letters 147:78–84
Article ADS Google Scholar
Xue H, Sun M, Liang Y (2022) Ecanet: explicit cyclic attention-based network for video saliency prediction. Neurocomput 468:233–244
Article Google Scholar
Chen J, Li Z, Jin Y, Ren D, Ling H (2021) Video saliency prediction via spatio-temporal reasoning. Neurocomput 462:59–68
Article Google Scholar

Download references

Acknowledgements

This research was supported by the National Science and Technology Major Project (Grant No. 2020YFA0713504), the National Natural Science Foundation of China (Nos. 62376238, 62372170), and the Scientific Research Foundation of Education Department of Hunan Province of China (Grant Nos. 21A0109)

Author information

Authors and Affiliations

The MOE Key Laboratory of Intelligent Computing and Information Processing, Xiangtan University, Xiangtan, Hunan, China
Fen Xiao, Huiyu Luo, Wenlei Zhang & Zhen Li
Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, Hunan, China
Xieping Gao

Authors

Fen Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Huiyu Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wenlei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Li
View author publications
You can also search for this author in PubMed Google Scholar
Xieping Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xieping Gao.

Ethics declarations

Conflicts of interest

The authors declare that there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xiao, F., Luo, H., Zhang, W. et al. Fusion hierarchy motion feature for video saliency detection. Multimed Tools Appl 83, 32301–32320 (2024). https://doi.org/10.1007/s11042-023-16593-2

Download citation

Received: 11 July 2022
Revised: 21 June 2023
Accepted: 21 August 2023
Published: 20 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16593-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusion hierarchy motion feature for video saliency detection

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

CBAM: Convolutional Block Attention Module

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fusion hierarchy motion feature for video saliency detection

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

CBAM: Convolutional Block Attention Module

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation