Abstract
With the proliferation of video data, video summarization is an ideal tool for users to browse video content rapidly. In this paper, we propose a novel foveated convolutional neural networks for dynamic video summarization. We are the first to integrate gaze information into a deep learning network for video summarization. Foveated images are constructed based on subjects’ eye movements to represent the spatial information of the input video. Multi-frame motion vectors are stacked across several adjacent frames to convey the motion clues. To evaluate the proposed method, experiments are conducted on two video summarization benchmark datasets. The experimental results validate the effectiveness of the gaze information for video summarization despite the fact that the eye movements are collected from different subjects from those who generated summaries. Empirical validations also demonstrate that our proposed foveated convolutional neural networks for video summarization can achieve state-of-the-art performances on these benchmark datasets.







Similar content being viewed by others
References
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 401–408. https://doi.org/10.1145/1282280.1282340
Bradley M M, Lang P J (2015) Memory, emotion, and pupil diameter: repetition of natural scenes. Psychophysiology 52(9):1186–1193. https://doi.org/10.1111/psyp.12442
Chang C C, Lin C J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27,1–27,27
Daniel P, Whitteridge D (1961) The representation of the visual field on the cerebral cortex in monkeys. J Physiol 159(2):203–221
Deng J, Dong W, Socher R, Li JL, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE computer society conference on computer vision and pattern recognition, pp 248–255
Detenber B, Simons R, Bennett GG Jr (1998) Roll ’em!: the effects of picture motion on emotional responses. J Broadcast Electron Media 42:113–127
Drucker H, Burges C J C, Kaufman L, Smola A J, Vapnik V (1997) Support vector regression machines. In: Mozer M C, Jordan M I, Petsche T (eds) Advances in neural information processing systems, vol 9, pp 155–161
Fu Y, Guo Y, Zhu Y, Liu F, Song C, Zhou Z H (2010) Multi-view video summarization. IEEE Trans Multimed 12(7):717–729. https://doi.org/10.1109/TMM.2010.2052025
Guenter B, Finch M, Drucker S, Tan D, Snyder J (2012) Foveated 3d graphics. ACM Trans Graph 31(6):164,1–164,10. https://doi.org/10.1145/2366145.2366183
Gygli M, Grabner H, Riemenschneider H, Van L (2014) Creating summaries from user videos. In: Proceedings of the European conference on computer vision
Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition
Hanjalic A, Xu L Q (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154
Holmberg N, Holmqvist K, Sandberg H (2015) Children’s attention to online adverts is related to low-level saliency factors and individual level of gaze control. J Eye Mov Res 8(2):1–10
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. CoRR arXiv:http://arXiv.org/abs/1408.5093
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, pp 2593–2600
Karessli N, Akata Z, Schiele B, Bulling A (2017) Gaze embeddings for zero-shot image classification. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Kleiner M, Brainard D, Pelli D, Ingling A, Murray R, Broussard C (2007) What’s new in psychtoolbox-3. Perception 36(14):1–16
Li Y, Merialdo B (2010) Multi-video summarization based on video-mmr. In: Proceedings of the 11th international workshop on image analysis for multimedia interactive services, pp 1–4
Li Y, Fathi A, Rehg JM (2013) Learning to predict gaze in egocentric video. In: Proceedings of the 2013 IEEE international conference on computer vision, pp 3216–3223. https://doi.org/10.1109/ICCV.2013.399
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Ma YF, Lu L, Zhang HJ, Li M (2002) A user attention model for video summarization. In: Proceedings of the Tenth ACM international conference on multimedia, pp 533–542. https://doi.org/10.1145/641007.641116 https://doi.org/10.1145/641007.641116
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Mishra A K, Aloimonos Y, Cheong L F, Kassim A (2012) Active visual segmentation. IEEE Trans Pattern Anal Mach Intell 34(4):639–653. https://doi.org/10.1109/TPAMI.2011.171
Nelson AL, Purdon C, Quigley L, Carriere J, Smilek D (2015) Distinguishing the roles of trait and state anxiety on the nature of anxiety-related attentional biases to threat using a free viewing eye movement paradigm. Cogn Emotion 29(3):504–526. https://doi.org/10.1080/02699931.2014.922460. pMID: 24884972
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175. https://doi.org/10.1023/A:1011139631724
Papoutsakimz A, Sangkloy P, Laskey J, Daskalova N, Huang J, Hays J (2016) Webgazer: scalable webcam eye tracking using user interactions. In: Proceedings of the 25th international joint conference on artificial intelligence, pp 3839–3845
Pereira M, Camargo M, Aprahamian I, Forlenza O (2014) Eye movement analysis and cognitive processing: detecting indicators of conversion to alzheimer’s disease. Neuropsychiatr Dis Treat 10:1273–1285
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of the European conference on computer vision
Rovamo J, Virsu V (1979) Estimation and application of the human cortical magnification factor. Exper Brain Res Experimentelle Hirnforschung Expérimentation cérébrale 37:495–510
Salehin MM, Paul M (2017) A novel framework for video summarization based on smooth pursuit information from eye tracker data. In: 2017 IEEE International Conference on Multimedia Expo Workshops, pp 692–697. https://doi.org/10.1109/ICMEW.2017.8026294
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:http://arXiv.org/abs/1409.1556
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187
Truong B T, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):1–37
Vul E, Alvarez G, Tenenbaum J B, Black M J (2009) Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model. In: Bengio Y, Schuurmans D, Lafferty J D, Williams C K I, Culotta A (eds) Advances in neural information processing systems, vol 22, pp 1955–1963
Wang Z, Bovik C A, Lu L (2003) Foveated wavelet image quality index. In: Proceedings of SPIE - the international society for optical engineering, p 4472
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. CoRR arXiv:http://arXiv.org/abs/1507.02159
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool L V (2016) Temporal segment networks: towards good practices for deep action recognition. CoRR arXiv:http://arXiv.org/abs/1608.00859
Wick D V, Martinez T, Restaino S R, Stone B R (2002) Foveated imaging demonstration. Opt Express 10(1):60–65. https://doi.org/10.1364/OE.10.000060
Wu J, Zhong S h, Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641
Xie Y H, Setia L, Burkhardt H (2007) Object-based color image retrieval using concentric circular invariant features. Int J Comput Sci Eng Syst 1:159–166
Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition, pp 2235–2244
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 982–990
Yun K, Peng Y, Samaras D, Zelinsky GJ, Berg TL (2013) Studying relationships between human gaze, description, and computer vision. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, pp 739–746
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l1 optical flow. In: Proceedings of the 29th DAGM conference on pattern recognition, pp 214–223
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, pp 2718–2726
Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition
Zhang K, Chao L, Wei, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of the European conference on computer vision
Zhang M, Ma K T, Lim J H, Zhao Q, Feng J (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol PP(99):1–11
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61502311, No. 61620106008), the Natural Science Foundation of Guangdong Province (No. 2016A030310053), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant (No.U1501501), the Shenzhen Emerging Industries of the Strategic Basic Research Project under Grant (No. JCYJ20160226191842793), the Shenzhen high-level overseas talents program, and the Tencent “Rhinoceros Birds” - Scientific Research Foundation for Young Teachers of Shenzhen University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Jiaxin Wu and Sheng-hua Zhong contributed equally to this work.
Rights and permissions
About this article
Cite this article
Wu, J., Zhong, Sh., Ma, Z. et al. Foveated convolutional neural networks for video summarization. Multimed Tools Appl 77, 29245–29267 (2018). https://doi.org/10.1007/s11042-018-5953-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5953-1