Skip to main content
Log in

Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Inspired by the human vision attention mechanism, the human vision system uses multilevel features to extract accurate visual saliency information, so multilevel features are important for saliency detection. On the basis of the numerous biological frameworks for visual information processing, we find that better combination and use of multilevel features with time information can greatly improve the accuracy of the video saliency model. The proposed TSFP-Net has the advantages of much higher prediction precision, simple structure, second smallest size, and the third fastest running time compared to the state-of-the-art methods. The encoder extracts multiscale temporal-spatial features from the input continuous video frames and then constructs a temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can significantly improve the precision of video saliency detection. Experimental results on three purely visual video saliency benchmarks demonstrate that our method outperforms the existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Hadizadeh H, Bajic´ IV. Saliency-aware video compression. IEEE Trans Image Process. 2013;23(1):19–33.

    Article  MathSciNet  MATH  Google Scholar 

  2. Zhu S, Liu C, Xu Z. High-definition video compression system based on perception guidance of salient information of a convolutional neural network and HEVC compression domain. IEEE Trans Circuits Syst Video Technol. 2019;30(7):1946–59.

    Google Scholar 

  3. Guraya FFE, Cheikh FA, Tremeau A, Tong Y, Konik H. Predictive saliency maps for surveillance videos. Ninth Int Symp Distrib Comput App to Bus Engr Sci IEEE. 2010;2010:508–13.

    Google Scholar 

  4. Lyu C, Liu Y, Wang X, Chen Y, Jin J, Yang J. Visual early leakage detection for industrial surveillance environments. IEEE Trans Industr Inf. 2022;18(6):3670–80.

    Article  Google Scholar 

  5. Nguyen TV, Xu M, Gao G, Kankanhalli M, Tian Q, Yan S. Static saliency vs. dynamic saliency: a comparative study. Proc of the 21st ACM Int Conf on Multimed. 2013:987–996.

  6. Wang W, Shen J, Guo F, Cheng MM, Borji A. Revisiting video saliency: a large-scale benchmark and a new model. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:4894–4903.

  7. Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. arXiv preprint. arXiv:1506.04214, 2015.

  8. Linardos P, Mohedano E, Nieto JJ, O'Connor NE, Giro-i-Nieto X, McGuinness K. Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint.arXiv:1907.01869, 2019.

  9. Wu X, Wu Z, Zhang J, Ju L, Wang S. Salsac: a video saliency prediction model with shuffled attentions and correlation-based convlstm. Proc AAAI Conf Artif Intel. 2020;34(07):12410–7.

    Google Scholar 

  10. Min K, Corso JJ. Tased-net: temporally aggregating spatial encoder-decoder network for video saliency detection. Proc IEEE/CVF Int Conf Comput Vis. 2019:2394–2403.

  11. Tsiami A, Koutras P, Maragos P. Stavis: spatiotemporal audiovisual saliency network. Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2020:4766–4776.

  12. Bellitto G, Salanitri FP, Palazzo S, Rundo F, Giordano D, Spampinato C. Hierarchical domain-adapted feature learning for video saliency prediction. arXiv preprint. arXiv:2010.01220v4, 2021.

  13. Lai Q, Wang W, Sun H, Shen J. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans Image Process. 2019;29:1113–26.

    Article  MathSciNet  MATH  Google Scholar 

  14. Chen L-C, Papandreou G, Kokki I, Murphy K,  Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint. arXiv:1412.7062, 2014.

  15. Zhu L, Ji D, Zhu S, Gan W, Wu W, Yan J. Learning statistical texture for semantic segmentation. Proc IEEE Conf Comput Vis Pattern Recognit. 2021:12532–12541.

  16. Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint. arXiv:1706.05587, 2017.

  17. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proc European Conf Comput Vis (ECCV). 2018:801–818.

  18. Lin G, Milan A, Shen C, Reid I. Refinenet: multipath refinement networks for high-resolution semantic segmentation. Proc IEEE Conf Comput Vis Pattern Recognit. 2017:1925–1934.

  19. Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481–95.

    Article  Google Scholar 

  20. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3431–3440.

  21. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Int Conf Med Image Comput Computer-assisted Intervention Springer. 2015:234–241.

  22. Jiang L, Xu M, Liu T, Qiao M, Wang Z. Deepvs: a deep learning based video saliency prediction approach. Proc European Conf Comput Vis (ECCV). 2018:602–617.

  23. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. Proc IEEE Conf Comput Vis Pattern Recognit. 2016:779–788.

  24. Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T. Flownet: learning optical flow with convolutional networks. Proc IEEE Int Conf Comput Vis. 2015:2758–2766.

  25. Huang X, Shen C, Boix X, Zhao Q. Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. Proc IEEE Int Conf Comput Vis. 2015:262–270.

  26. Chen J, Song H, Zhang K, Liu B, Liu Q. Video saliency prediction using enhanced spatiotemporal alignment network. Pattern Recogn. 2021;107615:1–12.

    Google Scholar 

  27. Droste R, Jiao J, Noble JA. Unified image and video saliency modeling. European Conf Comput Vis Springer. 2020:419–435.

  28. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. Mobilenetv2: inverted residuals and linear bottlenecks. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:4510–4520.

  29. Bellitto G, Proietto Salanitri F, Palazzo S, Rundo F, Giordano D, Spampinato C. Hierarchical domain-adapted feature learning for video saliency prediction. Int J Comput Vis 2021;129:3216–3232.

  30. Zheng Q, Li Y, Zheng L, Shen Q. Progressively real-time video salient object detection via cascaded fully convolutional networks with motion attention. Neurocomputing. 2022;467:465–75.

    Article  Google Scholar 

  31. Bazzani L, Larochelle H, Torresani L. Recurrent mixture density network for spatiotemporal visual attention. arXiv preprint arXiv:1603.08199, 2016.

  32. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. Proc IEEE Int Conf Comput Vis. 2015:4489–4497.

  33. Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Proc European Conf Comput Vis (ECCV). 2018:305–321.

  34. Jain S, Yarlagadda P, JyotiS, Karthik S, Subramanian R, Gandhi V. Vinet: pushing the limits of visual modality for audio-visual saliency prediction. arXiv preprint. arXiv:2012.06170v2, 2021.

  35. Aytar Y, Vondrick C, Torralba A. Soundnet: learning sound representations from unlabeled video. arXiv preprint. arXiv:1610.09001, 2016.

  36. Koutras P, Maragos P. Susinet: see, understand and summarize it. Proc IEEE/CVF Conf Comput Vis Pattern Recognit Workshops. 2019:809–819.

  37. Chen J, Li Q, Ling H, Ren D, Duan P. Audiovisual saliency prediction via deep learning. Neurocomputing. 2021;428:248–58.

    Article  Google Scholar 

  38. Mathe S, Sminchisescu C. Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2014;37(7):1408–24.

    Article  Google Scholar 

  39. Mital PK, Smith TJ, Hill RL, Henderson JM. Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn Comput. 2011;3(1):5–24.

    Article  Google Scholar 

  40. Coutrot A, Guyader N. How saliency, faces, and sound influence gaze in dynamic social scenes. J Vis. 2014;14(8):5–5.

    Article  Google Scholar 

  41. Coutrot A, Guyader N. Multimodal saliency models for videos. From Human Attention to Computational Attention Springer. 2016:291–304.

  42. Min X, Zhai G, Gu K, Yang X. Fixation prediction through multimodal analysis. ACM Trans Multimed Comput Commun Appl (TOMM). 2016;13(1):1–23.

    Google Scholar 

  43. Koutras P, Maragos P. A perceptually based spatiotemporal computational framework for visual saliency estimation. Signal Process: Image Commun. 2015;38:15–31.

    Google Scholar 

  44. Gygli M, Grabner H, Riemenschneider H, Van Gool L. Creating summaries from user videos. European Conf Comput Vis (ECCV) Springer. 2014:505–520.

  45. Rodriguez MD, Ahmed J, Shah M. Action mach a spatiotemporal maximum average correlation height filter for action recognition. IEEE Conf Comput Vis Pattern Recognit. 2008;2008:1–8.

    Google Scholar 

  46. Bylinskii Z, Judd T, Oliva A, Torralba A, Durand F. What do different evaluation metrics tell us about saliency models? IEEE Trans Pattern Anal Mach Intell. 2018;41(3):740–57.

    Article  Google Scholar 

  47. Lin T-Y, Dollar P, Girshick R, He KM, Hariharan B, Belongie S. Feature pyramid networks for object detection. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). 2017:2117–2125.

  48. Kingma DP, Ba J. Adam: a method for stochastic optimization. 3rd Int Conf Learning Rep San Diego. 2015:1–15.

  49. Riche N, Duvinage M, Mancas M, Gosselin B, Dutoit T. Saliency and human fixations: state-of-the-art and study of comparison metrics. Proc IEEE Conf Comput Vis. 2013:1153−1160.

  50. Borji A, Tavakoli HR, Sihite DN, Itti L. Analysis of scores, datasets, and models in visual saliency prediction. Proc IEEE Conf Comput Vis. 2013:921−928.

Download references

Funding

This study was funded by the National Natural Science Foundation of China (NSFC) under grants No. 61375025, 61075011, and 60675018 and the Scientific Research Foundation for the Returned Overseas Chinese Scholars from the State Education Ministry of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiping Zhu.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, Q., Zhu, S. Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection. Cogn Comput 15, 856–868 (2023). https://doi.org/10.1007/s12559-023-10114-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-023-10114-x

Keywords