skip to main content
research-article

A Top-Down Approach for Video Summarization

Published: 04 September 2014 Publication History

Abstract

While most existing video summarization approaches aim to identify important frames of a video from either a global or local perspective, we propose a top-down approach consisting of scene identification and scene summarization. For scene identification, we represent each frame with global features and utilize a scalable clustering method. We then formulate scene summarization as choosing those frames that best cover a set of local descriptors with minimal redundancy. In addition, we develop a visual word-based approach to make our approach more computationally scalable. Experimental results on two benchmark datasets demonstrate that our proposed approach clearly outperforms the state-of-the-art.

References

[1]
R. Achantay, S. Hemamiz, F. Estraday, and S. Susstrunky. 2009. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[2]
D. Besiris, A. Makedonas, G. Economou, and S. Fotopoulos. 2009. Combining graph connectivity and dominant set clustering for video summarization. Multimedia Tools Appl. 44, 161--186.
[3]
J. Bian, Y. Yang, and T.-S. Chua. 2013. Multimedia summarization for trending topics in microblogs. In Proceedings of the ACM International Conference on Conference on Information and Knowledge Management (CIKM'13). 1807--1812.
[4]
L. Cao, Y. Mu, A. Natsev, S.-F. Chang, G. Hua, and J. R. Smith. 2012. Scene aligned pooling for complex video recognition. In Proceedings of the European Conference on Computer Vision (ECCV'12).
[5]
J. G. Carbonell and J. Goldstein. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). 335--336.
[6]
S. A. Chatzichristofis and Y. S. Boutalis. 2008. CEDD: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. In Proceedings of the International Conference on Computer Vision Systems.
[7]
B.-W. Chen, J.-C. Wang, and J.-F. Wang. 2009. A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Trans. Multimedia 11, 295--312.
[8]
F. Chen, C. D. Vleeschouwer, and A. Cavallaro. 2014. Resource allocation for personalized video summarization. IEEE Trans. Multimedia 16, 2, 455--469.
[9]
Y. Cong, J. Yuan, and J. Luo. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Trans. Multimedia 14, 1, 66--75.
[10]
S. E. F. Devila, A. P. B. Lopes, A. Da Luz Jr, and A. De Lbuquerque Arajo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett. 32, 56--68.
[11]
D. F. Dementhon, V. Kobla, and D. Doermann. 1998. Video summarization by curve simplification. In Proceedings of the ACM International Conference on Multimedia.
[12]
G. Evangelopoulos, K. Rapantzikos, A. Potamianos, P. Maragos, A. Zlatintsi, and Y. Avrithis. 2008. Movie summarization based on audio-visual saliency detection. In Proceedings of the IEEE International Conference on Image Processing.
[13]
G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multimedia 15, 7, 1553--1568.
[14]
B. J. Frey and D. Dueck. 2007. Clustering by passing messages between data points. Science 315, 972--976.
[15]
M. Furini, F. Geraci, M. Montangero, and M. Pellegrini. 2010. STIMO: Still and moving video storyboard for the web scenario. Multimedia Tools Appl. 46, 47--69.
[16]
Y. Gong and X. Liu. 2000. Video summarization using singular value decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[17]
G. Guan, Z. Wang, J. D. Deng, and D. D. Feng. 2013. Keypoint based keyframe selection. IEEE Trans. Circ. Syst. Video Technol. 23, 4, 729--734.
[18]
G. Guan, Z. Wang, K. Yu, S. Mei, M. He, and D. Feng. 2012. Video summarization with global and local features. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshops.
[19]
R. Hong, J. Tang, H.-K. Tan, C.-W. Ngo, S. Yan, and T.-S. Chua. 2011. Beyond search: Event-driven summarization for web videos. ACM Trans. Multimedia Comput. Comm. Appl. 7, 4.
[20]
J. Li, Y. Ding, Y. Shi, and W. Li. 2010. A divide-and-rule scheme for shot boundary detection based on sift. Int. J. Digital Content Technol. Appl. 4, 202--214.
[21]
Y. Li, B. Merialdo, M. Rouvier, and G. Linares. 2011. Static and dynamic video summaries. In Proceedings of the ACM International Conference on Multimedia (MM'11). 1573--1576.
[22]
Z. Li, G. M. Schuster, and A. K. Katsaggelos. 2005. MINMAX optimal video summarization. IEEE Trans. Circ. Syst. Video Technol. 15, 1245--1256.
[23]
R. Lienhart, S. Pfeiffer, and W. Effelsberg. 1997. Video abstracting. Comm. ACM 40, 12, 54--62.
[24]
G. Liu, X. Wen, W. Zheng, and P. He. 2009. Shot boundary detection and keyframe extraction based on scale invariant feature transform. In Proceedings of the IEEE/ACIS International Conference on Computer and Information Science.
[25]
D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91--110.
[26]
S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng. 2014. A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans. Multimedia (to appear).
[27]
S. Lu, Z. Wang, Y. Song, T. Mei, and D. D. Feng. 2013. A bag-of-importance model for video summarization. In Proceedings of the ICME Workshop on Emerging Multimedia Systems and Applications (EMSA'13).
[28]
Z. Lu and K. Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13).
[29]
J. Luo, C. Papin, and K. Costello. 2009. Towards extracting semantically meaningful key frames from personal video clips: From humans to computers. IEEE Trans. Circ. Syst. Video Technol. 19, 289--301.
[30]
U. Luxburg. 2007. A tutorial on spectral clustering. J. Statist. Comput. 17, 4, 395--416.
[31]
Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang. 2005. A generic framework of user attention model and its application in video summarization. IEEE Trans. Multimedia 7, 907--919.
[32]
S. Mei, G. Guan, Z. Wang, M. He, X.-S. Hua, and D. D. Feng. 2014. l2,0 constrained sparse dictionary selection for video summarization. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'14).
[33]
T. Mei, L.-X. Tang, J. Tang, and X.-S. Hua. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Trans. Multimedia Comput. Comm. Appl. 9, 3.
[34]
K. Mikolajczyk and C. Schmid. 2005. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 10, 27, 1615--1630.
[35]
A. Money and H. Agius. 2008. Video summarisation: A conceptual framework and survey of the state of the art. J. Vis. Comm. Image Represent. 19, 121--143.
[36]
M. Muja and D. G. Lowe. 2009. Fast approximate nearest neighbors with automatic algorithm configuration. In Proceedings of the International Conference on Computer Vision Theory and Applications.
[37]
P. Mundur, Y. Rao, and Y. Yesha. 2006. Keyframe-based video summarization using delaunay clustering. Int. J. Digital Librar. 6, 2, 219--232.
[38]
C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang. 2005. Video summarization and scene detection by graph modeling. IEEE Trans. Circ. Syst. Video Technol. 15, 296--305.
[39]
C. Panagiotakis, A. Doulamis, and G. Tziritas. 2009. Equivalent key frames selection based on iso-content principles. IEEE Trans. Circ. Syst. Video Technol. 19, 447--451.
[40]
D. Pelleg and A. W. Moore. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning.
[41]
B. T. Truong and S. Venkatesh. 2007. Video abstraction: A systematic review and classification. ACM Trans. Multimedia Comput. Comm. Appl. 3, 1.
[42]
M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua. 2012. Event driven web video summarization by tag localization and key-shot identification. IEEE Trans. Multimedia 14, 4, 975--985.
[43]
YouTube Statistics. 2012. http://www.youtube.com/yt/press/statistics.html.
[44]
Y.-T. Zheng, S.-Y. Neo, T.-S. Chua, and Q. Tian. 2007. The use of temporal, semantic and visual partitioning model for efficient near duplicate keyframe detection in large scale news corpus. In Proceedings of the ACM International Conference on Image and Video Retrieval.
[45]
Y. Zhuang, Y. Rui, T. Huang, and S. Mehrotraw. 1998. Adaptive key frame extraction using unsupervised clustering. In Proceedings of the IEEE International Conference on Image Processing.

Cited By

View all
  • (2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 30-Mar-2024
  • (2024)Collaborative Multi-Agent Video Fast-ForwardingIEEE Transactions on Multimedia10.1109/TMM.2023.327585326(1041-1054)Online publication date: 2024
  • (2024)Automatic Generation of Interactive Nonlinear Video for Online Apparel Shopping NavigationIEEE Transactions on Multimedia10.1109/TMM.2023.326661526(474-486)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 11, Issue 1
August 2014
151 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/2665935
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2014
Accepted: 01 April 2014
Revised: 01 January 2014
Received: 01 October 2013
Published in TOMM Volume 11, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Keyframe extraction
  2. clustering
  3. keypoint
  4. local visual word
  5. scene identification

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Australian Research Council
  • Fundamental Research Funds for the Central Universities (3102014JCQ01054)
  • Natural Science Foundation of Shaanxi Province
  • National ICT Australia (NICTA)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 30-Mar-2024
  • (2024)Collaborative Multi-Agent Video Fast-ForwardingIEEE Transactions on Multimedia10.1109/TMM.2023.327585326(1041-1054)Online publication date: 2024
  • (2024)Automatic Generation of Interactive Nonlinear Video for Online Apparel Shopping NavigationIEEE Transactions on Multimedia10.1109/TMM.2023.326661526(474-486)Online publication date: 1-Jan-2024
  • (2023)Unsupervised video summarization using deep Non-Local video summarization networksNeurocomputing10.1016/j.neucom.2022.11.028519:C(26-35)Online publication date: 28-Jan-2023
  • (2023)Multi-scale deep feature fusion based sparse dictionary selection for video summarizationSignal Processing: Image Communication10.1016/j.image.2023.117006118(117006)Online publication date: Oct-2023
  • (2023)Towards machine vision-based video analysis in smart cities: a survey, framework, applications and open issuesMultimedia Tools and Applications10.1007/s11042-023-16434-283:22(62107-62158)Online publication date: 9-Aug-2023
  • (2023)Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directionsMultimedia Tools and Applications10.1007/s11042-023-14925-w82:21(32635-32709)Online publication date: 2-Mar-2023
  • (2022)An improved tube rearrangement strategy for choice-based surveillance video synopsis generationDigital Signal Processing10.1016/j.dsp.2022.103817132:COnline publication date: 1-Dec-2022
  • (2022)VSMCNN-dynamic summarization of videos using salient features from multi-CNN modelJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-022-04112-414:10(14071-14080)Online publication date: 25-Jun-2022
  • (2021)Residual Refinement Network with Attribute Guidance for Precise Saliency DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/344069417:3(1-19)Online publication date: 22-Jul-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media