Abstract
Within user-created videos, the constantly changing content among neighboring images brings more challenge for the prior video summarization methods. Assuming the images’ critical features are refined, one can obtain promising accuracy of keyframes’ selection which is key in video summarization. In our work, we innovatively proposed a Spatiotemporal two-stream LSTM network-based (ST-LSTM) model to enhance the images’ critical features with the combination of spatial saliency and temporal semantic dependencies which is referred to as the two-stream method. Motivated by the fact that sizable and moving objects attract more visual attention, we newly design a Saliency-area-based attention network to filter irrelative non-attractive information. We use the latest attention-based Bi-LSTM network to extract the temporal dependency on the semantic features. Furthermore, a multi-feature-based reward function is presented to reinforce the ST-LSTM model by integrating diversity, representativeness, and storyness. Last, the Deep Deterministic Policy Gradient (DDPG) algorithm is adopted to do the unsupervised training for the proposed method. Extensive experiments on the public datasets demonstrate that our method outperforms the state-of-the-art.
Similar content being viewed by others
References
Avila SEFD, Lopes APB, da Luz Jr A, de Albuquerque Arajo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68
Brown FP, et al. (1992) Class-based n-gram models of natural language. Comput Linguist:467–480
DeSimone R, Duncan J (1995) Neural mechanisms of selective visual attention. Ann Rev Neurosci 18(1):193–222
El-Ghoroury HN, Gupta SC (1972) Additive Bernoulli noise linear sequential circuits. IEEE Trans Comput 100(10):1119–1124
Elhamifar E, Sapiro G, Vidal R (2012) See all by looking at a few: Sparse modeling for finding representative objects. In: CVPR, pp 1600–1607
Elhamifar E, Sapiro G, Vidal R (2012) Sparse modeling for finding representative objects. Preparation 4(6):8
Fan G, Guo Y, Zheng J, Hong W (2020) A generalized regression model based on hybrid empirical mode decomposition and support vector regression with back propagation neural network for mid-short term load forecasting. J Forecast 39(5):737–756
Fan G, Peng LL, Hong W, et al. (2016) Electric load forecasting by the SVR model with differential empirical mode decomposition and auto regression. Neurocomputing 173:958–970
Fei M, Jiang W, Mao W (2017) A novel compact yet rich key frame creation method for compressed video summarization. Multimed Tools Appl 77 (2):1–21
Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems, pp 2069–2077
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European Conference On Computer Vision, pp pp 505–520
Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3090–3098
Han J, Li KM, Shao L, Hu XT, He S, Guo L, et al. (2014) And Video abstraction based on fMRI-driven visual attention model. Inf Sci 281:781–796
Itti L, Koch C (2001) Computational modelling of visual attention. In: Nature Rev Neurosci 2(3):194
Ji QG, Fang ZD, Xie ZH, Lu ZM (2013) Video abstraction based on the visual attention model and online clustering. Signal Process Image Commun 28(3):241–253
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder-decoder networks. IEEE Trans Circ Syst Video Technol 1(1):183–298
Ji Z, Zhao Y, Pang Y, Li X, Han J (2019) Deep attentive video summarization with distribution consistency learning. IEEE Trans Neural Netw Learn Syst 99:1–11
Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: Image caption with region-based attention and scene factorization, [Online]. Available: 1506.06272
Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. Thirty-third AAAI Conf Artif Intell 33:8537–8544
Kannan R, Swaminathan S, Ghinea G, Andres F, Anbananthen KASM (2019) Movie video summarization- generating personalized summaries using spatiotemporal salient region detection. Int J Multimed Data Eng Manag 10(3):1–26
Khosla A, Hamid R, Lin C-J, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2698– 2705
Li MW, Geng J, Hong W, Zhang LD (2019) Periodogram estimation based on LSSVR-CCPSO compensation for forecasting ship motion. Nonlinear Dyn 97(4):2579–2594
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, et al. (2015) Continuous control with deep reinforcement learning. Comput ence 23 (8):187
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning, [Online]. Available: arXiv:https://arxiv.org/abs/1506.000191506.00019
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 202–211
Mehmood I, Sajjad M, Ejaz W, Baik SW (2015) Saliency-directed prioritization of visual data in wireless surveillance networks. Inform Fusion 24:16–30
Qu S, Xi Y, Ding S (2017) Visual attention based on long-short term memory model for image caption generation. In: 2017 29Th chinese control and decision conference (CCDC). IEEE, pp 4789–4794
Salehin MM, Paul M (2016) Summarizing surveillance video by saliency transition and moving object information, International Conference on Digital Image Computing, Techniques & Applications. IEEE
Shih H (2013) Chia a novel attention-based key-frame determination method. IEEE TransBroadcast 59(3):556–562
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5179–5187
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4534– 4542
Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49
Wu G, et al. (2019) Unsupervised deep video hashing via balanced code for Large-Scale video retrieval. IEEE Trans Image Process 28(4):1993–2007
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4507–4515
Yi P, Wang ZY, Jiang K, Shao ZF, Ma J. y. (2020) Multi-Temporal Ultra Dense Memory Network For Video Super-Resolution. IEEE Trans Circ Syst Video Technol 30(8):2503–2516
Yuan Y, Li H, Wang Q (2019) Spatiotemporal modelling for video summarization using convolutional recurrent neural network. IEEE Access 7:64676–64685
Yuan Y, Li H, Wang Q, Qi (2019) Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network. In: IEEE Access, pp 1-1
Yuan L, Tay FEH, Li P, Feng J (2020) Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Trans Multimed 22(10):2711–2722
Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European Conference on Computer Vision, pp 766–782
Zhang K, Chao W. -L., Sha F, Grauman K (2016) Summary transfer: Exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1059–1067
Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201
Zhang Z, Hong W (2019) Electric load forecasting by complete ensemble empirical model decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn 98:1107–1136
Zhang Q, Huang N, Yao L, Zhang D, Shan C, Han J (2020) RGB-T salient object detection via fusing Multi-Level CNN features. IEEE Trans Image Process 29:3321–3335
Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2513–2520
Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-second AAAI Conference on Artificial Intelligence, pp 7582–7589
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, M., Hu, R., Wang, Z. et al. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimed Tools Appl 81, 40489–40510 (2022). https://doi.org/10.1007/s11042-022-12901-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12901-4