research-article

Hierarchical Recurrent Neural Network for Video Summarization

Authors:

Xiaoqiang LuAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 863 - 871

https://doi.org/10.1145/3123266.3123328

Published: 19 October 2017 Publication History

Abstract

Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such as video captioning and classification. However, RNN is not capable enough to handle the video summarization task, since traditional RNNs, including LSTM, can only deal with short videos, while the videos in the summarization task are usually in longer duration. To address this problem, we propose a hierarchical recurrent neural network for video summarization, called H-RNN in this paper. Specifically, it has two layers, where the first layer is utilized to encode short video subshots cut from the original video, and the final hidden state of each subshot is input to the second layer for calculating its confidence to be a key subshot. Compared to traditional RNNs, H-RNN is more suitable to video summarization, since it can exploit long temporal dependency among frames, meanwhile, the computation operations are significantly lessened. The results on two popular datasets, including the Combined dataset and VTW dataset, have demonstrated that the proposed H-RNN outperforms the state-of-the-arts.

References

[1]

Aya Aner and John R. Kender. 2002. Video Summaries through Mosaic-Based Shot and Scene Clustering. In Computer Vision - ECCV 2002, 7th European Conference on Computer Vision, Copenhagen, Denmark, May 28-31, 2002, Proceedings, Part IV. 388--402.

Digital Library

[2]

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learn- ing long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157--166.

Digital Library

[3]

Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In IEEE Conference on Computer Vision and Pattern Recognition. 3584--3592.

[4]

Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection. IEEE Trans. Multimedia 14, 1 (2012), 66--75.

Digital Library

[5]

Sandra Eliza Fontes de Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araujo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56--68.

Digital Library

[6]

Ehsan Elhamifar, Guillermo Sapiro, and René Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1600--1607.

Digital Library

[7]

Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse Sequential Subset Selection for Supervised Video Summarization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 2069--2077.

Digital Library

[8]

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating Summaries from User Videos. In European Conference on Computer Vision. 505--520.

[9]

Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090--3098.

[10]

Youssef Hadi, Fedwa Essannouni, and Rachid Oulad Haj Thami. 2006. Video summarization by k-medoid clustering. In Proceedings of the 2006 ACM symposium on Applied computing. ACM, 1400--1401.

Digital Library

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[12]

Aditya Khosla, Raffay Hamid, Chih-Jen Lin, and Neel Sundaresan. 2013. Large-Scale Video Summarization Using Web-Image Priors. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. 2698--2705.

Digital Library

[13]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. 1106--1114.

Digital Library

[14]

Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In IEEE Conference on Computer Vision and Pattern Recognition. 1346--1353.

Digital Library

[15]

David Liu, Gang Hua, and Tsuhan Chen. 2010. A Hierarchical Visual Model for Video Object Summarization. IEEE Trans. Pattern Analysis and Machine Intelligence 32, 12 (2010), 2178-- 2190.

Digital Library

[16]

Tiecheng Liu and John R. Kender. 2002. Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length. In Computer Vision - ECCV 2002, 7th European Conference on Computer Vision, Copenhagen, Denmark, May 28-31, 2002, Proceedings, Part IV. 403--417.

Digital Library

[17]

Shiyang Lu, Zhiyong Wang, Tao Mei, Genliang Guan, and David Dagan Feng. 2014. A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization. IEEE Trans. Multimedia 16, 6 (2014), 1497--1509.

[18]

Zheng Lu and Kristen Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2714--2721.

Digital Library

[19]

Qiao Luan, Mingli Song, Chu Yee Liau, Jiajun Bu, Zicheng Liu, and Ming-Ting Sun. 2014. Video Summarization based on Nonnegative Linear Reconstruction. In IEEE International Conference on Multimedia and Expo. 1--6.

[20]

Padmavathi Mundur, Yong Rao, and Yelena Yesha. 2006. Keyframe-based video summarization using Delaunay clustering. Int. J. on Digital Libraries 6, 2 (2006), 219--232.

Digital Library

[21]

Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijaya-narasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 4694--4702.

[22]

Chong-Wah Ngo, Yu-Fei Ma, and HongJiang Zhang. 2003. Automatic Video Summarization by Graph Modeling. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14--17 October 2003, Nice, France. 104--109.

Digital Library

[23]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical Recurrent Neural Encoder for Video Represen- tation with Application to Captioning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,. 1029--1038.

[24]

Danila Potapov, Matthijs Douze, Zaïd Harchaoui, and Cordelia Schmid. 2014. Category-Specific Video Summarization. In Euro- pean Conference on Computer Vision. 540--555.

[25]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).

[26]

Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179--5187.

[27]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 3104--3112.

Digital Library

[28]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 1--9.

[29]

Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: Generic Features for Video Analysis. CoRR abs/1412.0767 (2014).

[30]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence - Video to Text. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. 4534--4542.

Digital Library

[31]

Huan Yang, Baoyuan Wang, Stephen Lin, David P. Wipf, Minyi Guo, and Baining Guo. 2015. Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. 4633--4641.

Digital Library

[32]

Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,. 982--990.

[33]

Wojciech Zaremba and Ilya Sutskever. 2014. Learning to Execute. CoRR abs/1410.4615 (2014).

[34]

Kuo-Hao Zeng, Tseng-Hung Chen, Juan Carlos Niebles, and Min Sun. 2016. Title Generation for User Generated Videos. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II. 609--625.

[35]

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Summary Transfer: Exemplar-Based Subset Selection for Video Summarization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,. 1059--1067.

[36]

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video Summarization with Long Short-Term Memory. In Comput- er Vision - ECCV 2016 - 14th European Conference. 766--782.

[37]

Bin Zhao and Eric P. Xing. 2014. Quasi Real-Time Summarization for Consumer Videos. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 2513--2520.

Digital Library

[38]

Yueting Zhuang, Yong Rui, Thomas S. Huang, and Sharad Mehrotra. 1998. Adaptive Key Frame Extraction using Unsupervised Clustering. In Proceedings of the 1998 IEEE International Con- ference on Image Processing, ICIP-98, Chicago, Illinois, October 4-7, 1998. 866--870

Cited By

Li QZhan ZLi YBhanu B(2025)Spatial–temporal multi-scale interaction for few-shot video summarizationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109883142(109883)Online publication date: Feb-2025
https://doi.org/10.1016/j.engappai.2024.109883
Apostolidis EBalaouras GPatras IMezaris V(2024)Explainable Video Summarization for Advancing Media Content ProductionEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch065(1-24)Online publication date: 1-Jul-2024
https://doi.org/10.4018/978-1-6684-7366-5.ch065
Alharbi FHabib SAlbattah WJan ZAlanazi MIslam M(2024)Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder FrameworkSymmetry10.3390/sym1606068016:6(680)Online publication date: 1-Jun-2024
https://doi.org/10.3390/sym16060680
Show More Cited By

Index Terms

Hierarchical Recurrent Neural Network for Video Summarization
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Networks
  1. Network architectures
    1. Network design principles
      1. Layering
  2. Network properties
    1. Network structure
      1. Network topology types

Recommendations

Deep Semantic and Attentive Network for Unsupervised Video Summarization
With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated ...
Stacked Memory Network for Video Summarization
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

In recent years, supervised video summarization has achieved promising progress with various recurrent neural networks (RNNs) based methods, which treats video summarization as a sequence-to-sequence learning problem to exploit temporal dependency among ...
Self-attention binary neural tree for video summarization
Highlights
- A self-attention binary neural tree (SABTNet) is proposed for video summarization.
Abstract
In this paper, we address the problem of shot-level video summarization, which aims at selecting a subset of video shots as a summary to represent the original video contents compactly and completely. Most existing methods rely on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '17: Proceedings of the 25th ACM international conference on Multimedia

October 2017

2028 pages

ISBN:9781450349062

DOI:10.1145/3123266

General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

140
Total Citations
View Citations
853
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li QZhan ZLi YBhanu B(2025)Spatial–temporal multi-scale interaction for few-shot video summarizationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109883142(109883)Online publication date: Feb-2025
https://doi.org/10.1016/j.engappai.2024.109883
Apostolidis EBalaouras GPatras IMezaris V(2024)Explainable Video Summarization for Advancing Media Content ProductionEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch065(1-24)Online publication date: 1-Jul-2024
https://doi.org/10.4018/978-1-6684-7366-5.ch065
Alharbi FHabib SAlbattah WJan ZAlanazi MIslam M(2024)Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder FrameworkSymmetry10.3390/sym1606068016:6(680)Online publication date: 1-Jun-2024
https://doi.org/10.3390/sym16060680
Pang ZNakashima YOtani MNagahara H(2024)Unleashing the Power of Contrastive Learning for Zero-Shot Video SummarizationJournal of Imaging10.3390/jimaging1009022910:9(229)Online publication date: 14-Sep-2024
https://doi.org/10.3390/jimaging10090229
Kawamura KRekimoto J(2024)FastPerson: Enhancing Video-Based Learning through Video Summarization that Preserves Linguistic and Visual ContextsProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652922(205-216)Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1145/3652920.3652922
Zhu HHuang JRudinac SKanoulas EGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language ModelsProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658032(978-987)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658032
Huang JGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Multi-modal Video SummarizationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657582(1214-1218)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3657582
Lin JHua HChen MLi YHsiao JHo CLuo J(2024)VideoXum: Cross-Modal Visual and Textural Summarization of VideosIEEE Transactions on Multimedia10.1109/TMM.2023.333587526(5548-5560)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3335875
Zhang RQin BZhao JZhu YLv YDing S(2024)Locating X-Ray Coronary Angiogram Keyframes via Long Short-Term Spatiotemporal Attention With Image-to-Patch Contrastive LearningIEEE Transactions on Medical Imaging10.1109/TMI.2023.328685943:1(51-63)Online publication date: Jan-2024
https://doi.org/10.1109/TMI.2023.3286859
Sen DVivekraj V(2024)Multi-Reference Evaluation of Dynamic Video Summaries Using Granule-Aware F-MeasureIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33698558:4(3040-3054)Online publication date: Aug-2024
https://doi.org/10.1109/TETCI.2024.3369855
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten