Abstract
Inspired by the human vision attention mechanism, the human vision system uses multilevel features to extract accurate visual saliency information, so multilevel features are important for saliency detection. On the basis of the numerous biological frameworks for visual information processing, we find that better combination and use of multilevel features with time information can greatly improve the accuracy of the video saliency model. The proposed TSFP-Net has the advantages of much higher prediction precision, simple structure, second smallest size, and the third fastest running time compared to the state-of-the-art methods. The encoder extracts multiscale temporal-spatial features from the input continuous video frames and then constructs a temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can significantly improve the precision of video saliency detection. Experimental results on three purely visual video saliency benchmarks demonstrate that our method outperforms the existing state-of-the-art methods.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
Hadizadeh H, Bajic´ IV. Saliency-aware video compression. IEEE Trans Image Process. 2013;23(1):19–33.
Zhu S, Liu C, Xu Z. High-definition video compression system based on perception guidance of salient information of a convolutional neural network and HEVC compression domain. IEEE Trans Circuits Syst Video Technol. 2019;30(7):1946–59.
Guraya FFE, Cheikh FA, Tremeau A, Tong Y, Konik H. Predictive saliency maps for surveillance videos. Ninth Int Symp Distrib Comput App to Bus Engr Sci IEEE. 2010;2010:508–13.
Lyu C, Liu Y, Wang X, Chen Y, Jin J, Yang J. Visual early leakage detection for industrial surveillance environments. IEEE Trans Industr Inf. 2022;18(6):3670–80.
Nguyen TV, Xu M, Gao G, Kankanhalli M, Tian Q, Yan S. Static saliency vs. dynamic saliency: a comparative study. Proc of the 21st ACM Int Conf on Multimed. 2013:987–996.
Wang W, Shen J, Guo F, Cheng MM, Borji A. Revisiting video saliency: a large-scale benchmark and a new model. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:4894–4903.
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. arXiv preprint. arXiv:1506.04214, 2015.
Linardos P, Mohedano E, Nieto JJ, O'Connor NE, Giro-i-Nieto X, McGuinness K. Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint.arXiv:1907.01869, 2019.
Wu X, Wu Z, Zhang J, Ju L, Wang S. Salsac: a video saliency prediction model with shuffled attentions and correlation-based convlstm. Proc AAAI Conf Artif Intel. 2020;34(07):12410–7.
Min K, Corso JJ. Tased-net: temporally aggregating spatial encoder-decoder network for video saliency detection. Proc IEEE/CVF Int Conf Comput Vis. 2019:2394–2403.
Tsiami A, Koutras P, Maragos P. Stavis: spatiotemporal audiovisual saliency network. Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2020:4766–4776.
Bellitto G, Salanitri FP, Palazzo S, Rundo F, Giordano D, Spampinato C. Hierarchical domain-adapted feature learning for video saliency prediction. arXiv preprint. arXiv:2010.01220v4, 2021.
Lai Q, Wang W, Sun H, Shen J. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans Image Process. 2019;29:1113–26.
Chen L-C, Papandreou G, Kokki I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint. arXiv:1412.7062, 2014.
Zhu L, Ji D, Zhu S, Gan W, Wu W, Yan J. Learning statistical texture for semantic segmentation. Proc IEEE Conf Comput Vis Pattern Recognit. 2021:12532–12541.
Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint. arXiv:1706.05587, 2017.
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proc European Conf Comput Vis (ECCV). 2018:801–818.
Lin G, Milan A, Shen C, Reid I. Refinenet: multipath refinement networks for high-resolution semantic segmentation. Proc IEEE Conf Comput Vis Pattern Recognit. 2017:1925–1934.
Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481–95.
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proc IEEE Conf Comput Vis Pattern Recognit 2015:3431–3440.
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Int Conf Med Image Comput Computer-assisted Intervention Springer. 2015:234–241.
Jiang L, Xu M, Liu T, Qiao M, Wang Z. Deepvs: a deep learning based video saliency prediction approach. Proc European Conf Comput Vis (ECCV). 2018:602–617.
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. Proc IEEE Conf Comput Vis Pattern Recognit. 2016:779–788.
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T. Flownet: learning optical flow with convolutional networks. Proc IEEE Int Conf Comput Vis. 2015:2758–2766.
Huang X, Shen C, Boix X, Zhao Q. Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. Proc IEEE Int Conf Comput Vis. 2015:262–270.
Chen J, Song H, Zhang K, Liu B, Liu Q. Video saliency prediction using enhanced spatiotemporal alignment network. Pattern Recogn. 2021;107615:1–12.
Droste R, Jiao J, Noble JA. Unified image and video saliency modeling. European Conf Comput Vis Springer. 2020:419–435.
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. Mobilenetv2: inverted residuals and linear bottlenecks. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:4510–4520.
Bellitto G, Proietto Salanitri F, Palazzo S, Rundo F, Giordano D, Spampinato C. Hierarchical domain-adapted feature learning for video saliency prediction. Int J Comput Vis 2021;129:3216–3232.
Zheng Q, Li Y, Zheng L, Shen Q. Progressively real-time video salient object detection via cascaded fully convolutional networks with motion attention. Neurocomputing. 2022;467:465–75.
Bazzani L, Larochelle H, Torresani L. Recurrent mixture density network for spatiotemporal visual attention. arXiv preprint arXiv:1603.08199, 2016.
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. Proc IEEE Int Conf Comput Vis. 2015:4489–4497.
Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Proc European Conf Comput Vis (ECCV). 2018:305–321.
Jain S, Yarlagadda P, JyotiS, Karthik S, Subramanian R, Gandhi V. Vinet: pushing the limits of visual modality for audio-visual saliency prediction. arXiv preprint. arXiv:2012.06170v2, 2021.
Aytar Y, Vondrick C, Torralba A. Soundnet: learning sound representations from unlabeled video. arXiv preprint. arXiv:1610.09001, 2016.
Koutras P, Maragos P. Susinet: see, understand and summarize it. Proc IEEE/CVF Conf Comput Vis Pattern Recognit Workshops. 2019:809–819.
Chen J, Li Q, Ling H, Ren D, Duan P. Audiovisual saliency prediction via deep learning. Neurocomputing. 2021;428:248–58.
Mathe S, Sminchisescu C. Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2014;37(7):1408–24.
Mital PK, Smith TJ, Hill RL, Henderson JM. Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn Comput. 2011;3(1):5–24.
Coutrot A, Guyader N. How saliency, faces, and sound influence gaze in dynamic social scenes. J Vis. 2014;14(8):5–5.
Coutrot A, Guyader N. Multimodal saliency models for videos. From Human Attention to Computational Attention Springer. 2016:291–304.
Min X, Zhai G, Gu K, Yang X. Fixation prediction through multimodal analysis. ACM Trans Multimed Comput Commun Appl (TOMM). 2016;13(1):1–23.
Koutras P, Maragos P. A perceptually based spatiotemporal computational framework for visual saliency estimation. Signal Process: Image Commun. 2015;38:15–31.
Gygli M, Grabner H, Riemenschneider H, Van Gool L. Creating summaries from user videos. European Conf Comput Vis (ECCV) Springer. 2014:505–520.
Rodriguez MD, Ahmed J, Shah M. Action mach a spatiotemporal maximum average correlation height filter for action recognition. IEEE Conf Comput Vis Pattern Recognit. 2008;2008:1–8.
Bylinskii Z, Judd T, Oliva A, Torralba A, Durand F. What do different evaluation metrics tell us about saliency models? IEEE Trans Pattern Anal Mach Intell. 2018;41(3):740–57.
Lin T-Y, Dollar P, Girshick R, He KM, Hariharan B, Belongie S. Feature pyramid networks for object detection. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR). 2017:2117–2125.
Kingma DP, Ba J. Adam: a method for stochastic optimization. 3rd Int Conf Learning Rep San Diego. 2015:1–15.
Riche N, Duvinage M, Mancas M, Gosselin B, Dutoit T. Saliency and human fixations: state-of-the-art and study of comparison metrics. Proc IEEE Conf Comput Vis. 2013:1153−1160.
Borji A, Tavakoli HR, Sihite DN, Itti L. Analysis of scores, datasets, and models in visual saliency prediction. Proc IEEE Conf Comput Vis. 2013:921−928.
Funding
This study was funded by the National Natural Science Foundation of China (NSFC) under grants No. 61375025, 61075011, and 60675018 and the Scientific Research Foundation for the Returned Overseas Chinese Scholars from the State Education Ministry of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chang, Q., Zhu, S. Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection. Cogn Comput 15, 856–868 (2023). https://doi.org/10.1007/s12559-023-10114-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-023-10114-x