Abstract
Generic video summarization algorithms are characterized by the uniqueness of the final video summary result, which cannot satisfy the different summary requirements of different users for the same video. This paper addresses the task of query-based video summarization, which takes users’ queries and long videos as inputs and aims to generate a query-based video summary. In this article, we propose a query-based video summarization algorithm with a multi-label classification network (MLC-SUM). Specifically, we treat video summarization as a target-based multi-label classification problem, and predict the correlation between video content and multi-concept labels by inputting convolutional features into a multi-layer perceptron, then use the cross-correlation of the labels to weight the predicted probability. Finally, we select the part of the video content with the highest relevance to the user’s query sentence as the video summary output. Experiments on three common datasets verify the effectiveness and superiority of the proposed algorithm.
Similar content being viewed by others
Data availability
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Avila S, Lopes A, Luz AD et al (2011) VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68. https://doi.org/10.1016/j.patrec.2010.08.004
Cizmeciler K, Erdem E, Erdem A (2022) Leveraging semantic saliency maps for query-specific video summarization[J]. Multimed Tools Appl 81(12):17457–17482
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE computer society conference on computer vision and pattern recognition. pp 886-893
Dataset, evaluation and a memory network-based approach (n.d.) . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 2127–2136. https://doi.org/10.1109/CVPR.2017.229
Ejaz N, Mehmood I, Baik SW (2013) Efficient visual attention based framework for extracting key frames from videos. Signal Process Image Commun 28(1):34–44. https://doi.org/10.1016/j.image.2012.10.002
Fajtl J, Sokeh HS, Argyriou V et al (2019) Summarizing Videos with Attention. Proceedings of the Asian Conference on Computer Vision Workshops. pp 39–54 https://doi.org/10.1007/978-3-030-21074-84
Fakhar B, Kanan HR, Behrad A (2019) Event detection in soccer videos using unsupervised learning of spatiotemporal features based on pooled spatial pyramid model. Multimed Tools Appl 78(12):16995–17025
Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Adv Neural Inf Proces Syst 3:2069–2077
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. European Conference on Computer Vision. pp 505–520. https://doi.org/10.1007/978-3-319-10584-0_33
Hussain T, Muhammad K, Ullah A, Cao Z, Baik SW, de Albuquerque VHC (2020) Cloud-assisted multiview video summarization using CNN and bidirectional LSTM. IEEE Trans Indust Inform 16(1):77–86
Ji Z, Xiong K, Pang Y, Li X (2020) Video summarization with attention-based encoder-decoder networks. IEEE Trans Circuits Syst Video Technol 30(6):1709–1717
Jiang Y, Cui K, Peng B and Xu C (2019) Comprehensive video understanding: video summarization with content-based video recommender design. 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). pp 1562-1569 https://doi.org/10.1109/ICCVW.2019.00195
Kanmani M, Narasimhan V (2018) Swarm intelligent based contrast enhancement algorithm with improved visual perception for color images 77. pp 12701–12724
Kanmani M, Narasimhan V (2019) An optimal weighted averaging fusion strategy for remotely sensed images[J]. Multidim Syst Sign Process 30(4):1911–1935
Kanmani M, Narasimhan V (2019) Particle swarm optimisation aided weighted averaging fusion strategy for CT and MRI medical images[J]. Int J Biomed Eng Technol 31(3):278–291
Kanmani M, Narasimhan V (2020) Optimal fusion aided face recognition from visible and thermal face images[J]. Multimed Tools Appl 79:25–26. https://doi.org/10.1007/s11042-020-08628-9
Kwon H, Shim W, Cho M (2019) Temporal U-nets for video summarization with scene and action recognition. Proceedings of the 2019 IEEE/CVF international conference on computer vision workshop. pp 1541-1544 https://doi.org/10.1109/ICCVW.2019.00192
Lee YJ, Grauman K (2015) Predicting important objects for egocentric video summarization. Int J Comput Vis 114(1):38–55
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Madheswari K, Venkateswaran N (2015) Swarm intelligence based optimization in thermal image fusion using dual tree discrete wavelet transform[C] quantitative infrared thermography Asia. pp 1-20 https://doi.org/10.21611/qirt.2015.0101
Mahasseni B, Lam M and Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2982-2991 https://doi.org/10.1109/CVPR.2017.318
Ngo CW, Ma YF, Zhang HJ (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circuits Syst Video Technol 15(2):296–305. https://doi.org/10.1109/TCSVT.2004.841694
Pfeioeer S, Lienhart R, Fischer S et al (1996) Abstracting digital movies automatically. J Vis Commun Image Represent 7(4):345–353
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. European Conference on Computer Vision. pp 540–555 https://doi.org/10.1007/978-3-319-10599-4_35
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. Proceedings of European conference on computer vision. pp 358-374. https://doi.org/10.1007/978-3-030-01258-8_22
Sharghi A, Gong B and Shah M (2016) Query-focused extractive video summarization. European conference on computer vision. pp 3-19. https://doi.org/10.1007/978-3-319-46484-8_1
Song Y, Vallmitjana J, Stent A (2015) TVSum: summarizing web videos using titles. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Uchihashi S, Foote J, Girgensohn A et al (1999) Video manga: generating semantically meaningful video summaries. Proceedings of the ACM international conference on multimedia. pp 383-392
Vasudevan AB, Gygli M, Volokitin A, Van Gool L (2017) Query-adaptive video summarization via quality aware relevance estimation. Proceedings of the 25th ACM international conference on multimedia. pp 582-590 https://doi.org/10.1145/3123266.3123297
Wang M, Hong R, Li G, Zha ZJ, Yan S, Chua TS (2012) Event driven web video summarization by tag localization and key-shot identification. IEEE Trans Multimed 14(4):975–985. https://doi.org/10.1109/TMM.2012.2185041
Wolf W (1996) Key frame selection by motion analysis. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2. pp 1228–1231
Xiao S, Zhao Z, Zhang Z et al (2020) Convolutional hierarchical attention network for query-focused video summarization. AAAI conference on artificial intelligence. pp 12426-12433 https://doi.org/10.1609/aaai.v34i07.6929
Xiao S, Zhao Z, Zhang Z et al (2020) Query-biased self-attentive network for query-focused video summarization. IEEE Trans Image Process 29:5889–5899. https://doi.org/10.1109/TIP.2020.2985868
Zeng M, Huang G Q (2011) Video summarization by motion analysis: using optical flow technique. Proceedings of the International Conference on Information Management, Innovation Management and Industrial Engineering, pp 205–208. https://doi.org/10.1109/ICIII.2011.332
Zhang Y (2021) Research on video summarization based on semantic content understanding. Shandong University, Thesis for Master Degree
Zhang K, ChaoWL SF, Grauman K (2016) Video summarization with long short-term memory. European Conference on Computer Vision. pp 766–782 https://doi.org/10.1007/978-3-319-46478-7_47
Zhang Y, Kampffmeyer M, Liang X et al (2018) Query-conditioned three-player adversarial network for video summarization. arXiv preprint arXiv:1807.06677.
Zhong R, Wang R, Zou YZ et al (2021) Graph attention networks adjusted bi-LSTM for video summarization. IEEE Sign Proc Lett 28:663–667. https://doi.org/10.1109/LSP.2021.3066349
Zhou K, Qiao Y and Xiang T (2017) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054
Zhuang Y, Rui Y, Huang TS et al (1988) Adaptive key frame extraction using unsupervised clustering. Proceedings of the international conference on image processing. pp 866-870. https://doi.org/10.1109/ICIP.1998.723655
Acknowledgments
The work is supported by National key Research and Development plan (2020YFC0832600).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No conflicts of interests about the publication by all authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, W., Zhang, Y., Li, Y. et al. Query-based video summarization with multi-label classification network. Multimed Tools Appl 82, 37529–37549 (2023). https://doi.org/10.1007/s11042-023-15126-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15126-1