skip to main content
10.1145/3469877.3490597acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Video Saliency Prediction via Deep Eye Movement Learning

Published: 10 January 2022 Publication History

Abstract

Existing methods often utilize temporal motion information and spatial layout information in video to predict video saliency. However, the fixations are not always consistent with the moving object of interest, because human eye fixations are determined not only by the spatio-temporal information, but also by the velocity of eye movement. To address this issue, a new saliency prediction method via deep eye movement learning (EML) is proposed in this paper. Compared with previous methods that use human fixations as ground truth, our method uses the optical flow of fixations between successive frames as an extra ground truth for the purpose of eye movement learning. Experimental results on DHF1K, Hollywood2, and UCF-sports datasets show the proposed EML model achieves a promising result across a wide of metrics.

References

[1]
C. Bak, A. Kocak, E. Erdem, and A. Erdem. 2018. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimedia 20, 7 (2018), 1688–1698.
[2]
L. Bazzani, H. Larochelle, and L. Torresani. 2017. Recurrent mixture density network for spatiotemporal visual attention. In ICLR. 1–15.
[3]
S. Chaabouni, J. Benois-Pineau, and C. B. Amar. 2016. Transfer learning with deep networks for saliency prediction in natural video. In ICIP. 1604–1608.
[4]
S. Gorji and J. Clark. 2018. Going from image to video saliency: Augmenting image salience with dynamic attentional push. In CVPR. 7501–7511.
[5]
P. Holzman, L. Proctor, and D. Hughes. 1973. Eye-tracking patterns in schizophrenia. Science 181(1973), 179–181.
[6]
L. Jiang, M. Xu, and Z. Wang. 2017. Predicting video saliency with object-to-motion CNN and two-layer convolutional LSTM. arXiv preprint arXiv:1709.06316(2017).
[7]
M. Jiang, S. Huang, J. Duan, and Q. Zhao. 2015. SALICON: Saliency in context. In CVPR. 1072–1080.
[8]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, and P. Natsev. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950(2017).
[9]
D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In ICLR. 1–15.
[10]
H. Komatsu. 1988. Relation of cortical areas MT and MST to pursuit eye movements. I. Localization and visual properties of neurons. Journal of Neurophysiology 60 (1988), 12410–12417.
[11]
R. J. Krauzlis and S. G. Lisberger. 1994. A model of visually-guided smooth pursuit eye movements based on behavioral observations. Journal Of Computational Neuroscience 1 (1994), 265–283.
[12]
Q. Lai, W. Wang, H. Sun, and J. Shen. 2020. Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks. IEEE Trans. Image Process. 29 (2020), 1113–1126.
[13]
G. Leifman, D. Rudoy, T. Swedish, E. Bayro-Corrochano, and R. Raskar. 2017. Learning gaze transitions from depth to improve video saliency estimation. In ICCV. 1707–1716.
[14]
P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-i-Nieto, and K. McGuinness. 2019. Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint arXiv:1907.01869(2019).
[15]
S. G. Lisberger and J. A. Movshon. 1999. Visual Motion Analysis for Pursuit Eye Movements in Area MT of Macaque Monkeys. Journal of Neuroscience 19, 6 (1999), 2224–2246.
[16]
S. Mathe and C. Sminchisescu. 2015. Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. 37, 7 (2015), 1408–1424.
[17]
K. Min and J. Corso. 2019. TASED-Net: Temporally aggregating spatial encoder-decoder network for video saliency detection. In ICCV. 2394–2403.
[18]
N. Riche, M. Duvinage, M. Mancas, B. Gosselin, and T. Dutoit. 2014. Saliency and human fixations: State-of-the-art and study of comparison metrics. In ICCV. 1153–1160.
[19]
D. L. Ringach. 1996. A “tachometer” feedback model of smooth pursuit eye movements. Biological Cybernetics 73 (1996), 561–568.
[20]
D. Robinson. 1965. The mechanics of human smooth pursuit eye movement. Journal of Physiology 180 (1965), 569–591.
[21]
D. Robinson, J. L. Gordon, and S. E. Gordon. 1986. A Model of the Smooth Pursuit Eye Movement System. Biological Cybernetics 55 (1986), 43–57.
[22]
X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, and W.-K. Wong. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS. 802–810.
[23]
K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR. 1–14.
[24]
M. Sun, Z. Zhou, Q. Hu, Z. Wang, and J. Jiang. 2019. SG-FCN: A motion and memory-based deep learning model for video saliency detection. IEEE Trans. Cybernetics 49, 8 (2019), 2900–2911.
[25]
S. Tomohiro, T. Hiromitsu, S. Schaal, and M. Kawato. 2005. A Model of the Smooth Pursuit Eye Movement System. Biological Cybernetics 18 (2005), 213–224.
[26]
W. Wang, J. Shen, F. Guo, M. Cheng, and A. Borji. 2018. Revisiting video saliency: A large-scale benchmark and a new model. In CVPR. 4894–4903.
[27]
X. Wu, Z. Wu, J. Zhang, L. Ju, and S. Wang. 2020. SalSAC: A video saliency prediction model with shuffled attentions and correlation-based ConvLSTM. In AAAI. 12410–12417.
[28]
L. R. Young, J. D. Forster, and N. van Houtte. 1968. A revised stochastic sampled data model for eye tracking movements. In Fourth Ann NASA–University Conference on Manual Control. 12410–12417.

Cited By

View all
  • (2023)Transformer-Based Multi-Scale Feature Integration Network for Video Saliency PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327841033:12(7696-7707)Online publication date: 22-May-2023
  • (2023)GFNet: gated fusion network for video saliency predictionApplied Intelligence10.1007/s10489-023-04861-553:22(27865-27875)Online publication date: 19-Sep-2023

Index Terms

  1. Video Saliency Prediction via Deep Eye Movement Learning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
        December 2021
        508 pages
        ISBN:9781450386074
        DOI:10.1145/3469877
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 January 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. deep learning
        2. eye fixation
        3. eye movement
        4. video saliency

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        • National key research and development program of China
        • Research Programme on Applied Fundamentals and Frontier Technologies of Wuhan
        • Natural Science Foundation of China
        • Beijing Nova Program

        Conference

        MMAsia '21
        Sponsor:
        MMAsia '21: ACM Multimedia Asia
        December 1 - 3, 2021
        Gold Coast, Australia

        Acceptance Rates

        Overall Acceptance Rate 59 of 204 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)23
        • Downloads (Last 6 weeks)3
        Reflects downloads up to 07 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Transformer-Based Multi-Scale Feature Integration Network for Video Saliency PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327841033:12(7696-7707)Online publication date: 22-May-2023
        • (2023)GFNet: gated fusion network for video saliency predictionApplied Intelligence10.1007/s10489-023-04861-553:22(27865-27875)Online publication date: 19-Sep-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media