skip to main content
10.1145/3123266.3123328acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Recurrent Neural Network for Video Summarization

Published: 19 October 2017 Publication History

Abstract

Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such as video captioning and classification. However, RNN is not capable enough to handle the video summarization task, since traditional RNNs, including LSTM, can only deal with short videos, while the videos in the summarization task are usually in longer duration. To address this problem, we propose a hierarchical recurrent neural network for video summarization, called H-RNN in this paper. Specifically, it has two layers, where the first layer is utilized to encode short video subshots cut from the original video, and the final hidden state of each subshot is input to the second layer for calculating its confidence to be a key subshot. Compared to traditional RNNs, H-RNN is more suitable to video summarization, since it can exploit long temporal dependency among frames, meanwhile, the computation operations are significantly lessened. The results on two popular datasets, including the Combined dataset and VTW dataset, have demonstrated that the proposed H-RNN outperforms the state-of-the-arts.

References

[1]
Aya Aner and John R. Kender. 2002. Video Summaries through Mosaic-Based Shot and Scene Clustering. In Computer Vision - ECCV 2002, 7th European Conference on Computer Vision, Copenhagen, Denmark, May 28-31, 2002, Proceedings, Part IV. 388--402.
[2]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learn- ing long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157--166.
[3]
Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In IEEE Conference on Computer Vision and Pattern Recognition. 3584--3592.
[4]
Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection. IEEE Trans. Multimedia 14, 1 (2012), 66--75.
[5]
Sandra Eliza Fontes de Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araujo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56--68.
[6]
Ehsan Elhamifar, Guillermo Sapiro, and René Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1600--1607.
[7]
Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse Sequential Subset Selection for Supervised Video Summarization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 2069--2077.
[8]
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating Summaries from User Videos. In European Conference on Computer Vision. 505--520.
[9]
Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090--3098.
[10]
Youssef Hadi, Fedwa Essannouni, and Rachid Oulad Haj Thami. 2006. Video summarization by k-medoid clustering. In Proceedings of the 2006 ACM symposium on Applied computing. ACM, 1400--1401.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[12]
Aditya Khosla, Raffay Hamid, Chih-Jen Lin, and Neel Sundaresan. 2013. Large-Scale Video Summarization Using Web-Image Priors. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. 2698--2705.
[13]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. 1106--1114.
[14]
Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In IEEE Conference on Computer Vision and Pattern Recognition. 1346--1353.
[15]
David Liu, Gang Hua, and Tsuhan Chen. 2010. A Hierarchical Visual Model for Video Object Summarization. IEEE Trans. Pattern Analysis and Machine Intelligence 32, 12 (2010), 2178-- 2190.
[16]
Tiecheng Liu and John R. Kender. 2002. Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length. In Computer Vision - ECCV 2002, 7th European Conference on Computer Vision, Copenhagen, Denmark, May 28-31, 2002, Proceedings, Part IV. 403--417.
[17]
Shiyang Lu, Zhiyong Wang, Tao Mei, Genliang Guan, and David Dagan Feng. 2014. A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization. IEEE Trans. Multimedia 16, 6 (2014), 1497--1509.
[18]
Zheng Lu and Kristen Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2714--2721.
[19]
Qiao Luan, Mingli Song, Chu Yee Liau, Jiajun Bu, Zicheng Liu, and Ming-Ting Sun. 2014. Video Summarization based on Nonnegative Linear Reconstruction. In IEEE International Conference on Multimedia and Expo. 1--6.
[20]
Padmavathi Mundur, Yong Rao, and Yelena Yesha. 2006. Keyframe-based video summarization using Delaunay clustering. Int. J. on Digital Libraries 6, 2 (2006), 219--232.
[21]
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijaya-narasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 4694--4702.
[22]
Chong-Wah Ngo, Yu-Fei Ma, and HongJiang Zhang. 2003. Automatic Video Summarization by Graph Modeling. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14--17 October 2003, Nice, France. 104--109.
[23]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical Recurrent Neural Encoder for Video Represen- tation with Application to Captioning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,. 1029--1038.
[24]
Danila Potapov, Matthijs Douze, Zaïd Harchaoui, and Cordelia Schmid. 2014. Category-Specific Video Summarization. In Euro- pean Conference on Computer Vision. 540--555.
[25]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
[26]
Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179--5187.
[27]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 3104--3112.
[28]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 1--9.
[29]
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: Generic Features for Video Analysis. CoRR abs/1412.0767 (2014).
[30]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence - Video to Text. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. 4534--4542.
[31]
Huan Yang, Baoyuan Wang, Stephen Lin, David P. Wipf, Minyi Guo, and Baining Guo. 2015. Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. 4633--4641.
[32]
Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,. 982--990.
[33]
Wojciech Zaremba and Ilya Sutskever. 2014. Learning to Execute. CoRR abs/1410.4615 (2014).
[34]
Kuo-Hao Zeng, Tseng-Hung Chen, Juan Carlos Niebles, and Min Sun. 2016. Title Generation for User Generated Videos. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. 609--625.
[35]
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Summary Transfer: Exemplar-Based Subset Selection for Video Summarization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,. 1059--1067.
[36]
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video Summarization with Long Short-Term Memory. In Comput- er Vision - ECCV 2016 - 14th European Conference. 766--782.
[37]
Bin Zhao and Eric P. Xing. 2014. Quasi Real-Time Summarization for Consumer Videos. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 2513--2520.
[38]
Yueting Zhuang, Yong Rui, Thomas S. Huang, and Sharad Mehrotra. 1998. Adaptive Key Frame Extraction using Unsupervised Clustering. In Proceedings of the 1998 IEEE International Con- ference on Image Processing, ICIP-98, Chicago, Illinois, October 4-7, 1998. 866--870

Cited By

View all
  • (2025)Spatial–temporal multi-scale interaction for few-shot video summarizationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109883142(109883)Online publication date: Feb-2025
  • (2024)Explainable Video Summarization for Advancing Media Content ProductionEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch065(1-24)Online publication date: 1-Jul-2024
  • (2024)Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder FrameworkSymmetry10.3390/sym1606068016:6(680)Online publication date: 1-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. hierarchical recurrent neural network
  3. video summarization

Qualifiers

  • Research-article

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)8
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Spatial–temporal multi-scale interaction for few-shot video summarizationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109883142(109883)Online publication date: Feb-2025
  • (2024)Explainable Video Summarization for Advancing Media Content ProductionEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch065(1-24)Online publication date: 1-Jul-2024
  • (2024)Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder FrameworkSymmetry10.3390/sym1606068016:6(680)Online publication date: 1-Jun-2024
  • (2024)Unleashing the Power of Contrastive Learning for Zero-Shot Video SummarizationJournal of Imaging10.3390/jimaging1009022910:9(229)Online publication date: 14-Sep-2024
  • (2024)FastPerson: Enhancing Video-Based Learning through Video Summarization that Preserves Linguistic and Visual ContextsProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652922(205-216)Online publication date: 4-Apr-2024
  • (2024)Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language ModelsProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658032(978-987)Online publication date: 30-May-2024
  • (2024)Multi-modal Video SummarizationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657582(1214-1218)Online publication date: 30-May-2024
  • (2024)VideoXum: Cross-Modal Visual and Textural Summarization of VideosIEEE Transactions on Multimedia10.1109/TMM.2023.333587526(5548-5560)Online publication date: 2024
  • (2024)Locating X-Ray Coronary Angiogram Keyframes via Long Short-Term Spatiotemporal Attention With Image-to-Patch Contrastive LearningIEEE Transactions on Medical Imaging10.1109/TMI.2023.328685943:1(51-63)Online publication date: Jan-2024
  • (2024)Multi-Reference Evaluation of Dynamic Video Summaries Using Granule-Aware F-MeasureIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33698558:4(3040-3054)Online publication date: Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media