Spatiotemporal two-stream LSTM network for unsupervised video summarization

Hu, Min; Hu, Ruimin; Wang, Zhongyuan; Xiong, Zixiang; Zhong, Rui

doi:10.1007/s11042-022-12901-4

Spatiotemporal two-stream LSTM network for unsupervised video summarization

Published: 10 May 2022

Volume 81, pages 40489–40510, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Min Hu ORCID: orcid.org/0000-0003-1151-152X¹,
Ruimin Hu¹,
Zhongyuan Wang¹,
Zixiang Xiong² &
…
Rui Zhong³

380 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Within user-created videos, the constantly changing content among neighboring images brings more challenge for the prior video summarization methods. Assuming the images’ critical features are refined, one can obtain promising accuracy of keyframes’ selection which is key in video summarization. In our work, we innovatively proposed a Spatiotemporal two-stream LSTM network-based (ST-LSTM) model to enhance the images’ critical features with the combination of spatial saliency and temporal semantic dependencies which is referred to as the two-stream method. Motivated by the fact that sizable and moving objects attract more visual attention, we newly design a Saliency-area-based attention network to filter irrelative non-attractive information. We use the latest attention-based Bi-LSTM network to extract the temporal dependency on the semantic features. Furthermore, a multi-feature-based reward function is presented to reinforce the ST-LSTM model by integrating diversity, representativeness, and storyness. Last, the Deep Deterministic Policy Gradient (DDPG) algorithm is adopted to do the unsupervised training for the proposed method. Extensive experiments on the public datasets demonstrate that our method outperforms the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Unsupervised Temporal Attention Summarization Model for User Created Videos

Video summarization via global feature difference optimization

Article 28 September 2023

Transforming Multi-concept Attention into Video Summarization

References

Avila SEFD, Lopes APB, da Luz Jr A, de Albuquerque Arajo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68
Brown FP, et al. (1992) Class-based n-gram models of natural language. Comput Linguist:467–480
DeSimone R, Duncan J (1995) Neural mechanisms of selective visual attention. Ann Rev Neurosci 18(1):193–222
El-Ghoroury HN, Gupta SC (1972) Additive Bernoulli noise linear sequential circuits. IEEE Trans Comput 100(10):1119–1124
Elhamifar E, Sapiro G, Vidal R (2012) See all by looking at a few: Sparse modeling for finding representative objects. In: CVPR, pp 1600–1607
Elhamifar E, Sapiro G, Vidal R (2012) Sparse modeling for finding representative objects. Preparation 4(6):8
Fan G, Guo Y, Zheng J, Hong W (2020) A generalized regression model based on hybrid empirical mode decomposition and support vector regression with back propagation neural network for mid-short term load forecasting. J Forecast 39(5):737–756
Fan G, Peng LL, Hong W, et al. (2016) Electric load forecasting by the SVR model with differential empirical mode decomposition and auto regression. Neurocomputing 173:958–970
Fei M, Jiang W, Mao W (2017) A novel compact yet rich key frame creation method for compressed video summarization. Multimed Tools Appl 77 (2):1–21
Google Scholar
Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems, pp 2069–2077
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European Conference On Computer Vision, pp pp 505–520
Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3090–3098
Han J, Li KM, Shao L, Hu XT, He S, Guo L, et al. (2014) And Video abstraction based on fMRI-driven visual attention model. Inf Sci 281:781–796
Itti L, Koch C (2001) Computational modelling of visual attention. In: Nature Rev Neurosci 2(3):194
Ji QG, Fang ZD, Xie ZH, Lu ZM (2013) Video abstraction based on the visual attention model and online clustering. Signal Process Image Commun 28(3):241–253
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder-decoder networks. IEEE Trans Circ Syst Video Technol 1(1):183–298
Ji Z, Zhao Y, Pang Y, Li X, Han J (2019) Deep attentive video summarization with distribution consistency learning. IEEE Trans Neural Netw Learn Syst 99:1–11
Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: Image caption with region-based attention and scene factorization, [Online]. Available: 1506.06272
Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. Thirty-third AAAI Conf Artif Intell 33:8537–8544
Google Scholar
Kannan R, Swaminathan S, Ghinea G, Andres F, Anbananthen KASM (2019) Movie video summarization- generating personalized summaries using spatiotemporal salient region detection. Int J Multimed Data Eng Manag 10(3):1–26
Khosla A, Hamid R, Lin C-J, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2698– 2705
Li MW, Geng J, Hong W, Zhang LD (2019) Periodogram estimation based on LSSVR-CCPSO compensation for forecasting ship motion. Nonlinear Dyn 97(4):2579–2594
Article Google Scholar
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, et al. (2015) Continuous control with deep reinforcement learning. Comput ence 23 (8):187
Google Scholar
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning, [Online]. Available: arXiv:https://arxiv.org/abs/1506.000191506.00019
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 202–211
Mehmood I, Sajjad M, Ejaz W, Baik SW (2015) Saliency-directed prioritization of visual data in wireless surveillance networks. Inform Fusion 24:16–30
Qu S, Xi Y, Ding S (2017) Visual attention based on long-short term memory model for image caption generation. In: 2017 29Th chinese control and decision conference (CCDC). IEEE, pp 4789–4794
Salehin MM, Paul M (2016) Summarizing surveillance video by saliency transition and moving object information, International Conference on Digital Image Computing, Techniques & Applications. IEEE
Shih H (2013) Chia a novel attention-based key-frame determination method. IEEE TransBroadcast 59(3):556–562
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5179–5187
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4534– 4542
Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49
Wu G, et al. (2019) Unsupervised deep video hashing via balanced code for Large-Scale video retrieval. IEEE Trans Image Process 28(4):1993–2007
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4507–4515
Yi P, Wang ZY, Jiang K, Shao ZF, Ma J. y. (2020) Multi-Temporal Ultra Dense Memory Network For Video Super-Resolution. IEEE Trans Circ Syst Video Technol 30(8):2503–2516
Article Google Scholar
Yuan Y, Li H, Wang Q (2019) Spatiotemporal modelling for video summarization using convolutional recurrent neural network. IEEE Access 7:64676–64685
Yuan Y, Li H, Wang Q, Qi (2019) Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network. In: IEEE Access, pp 1-1
Yuan L, Tay FEH, Li P, Feng J (2020) Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Trans Multimed 22(10):2711–2722
Article Google Scholar
Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European Conference on Computer Vision, pp 766–782
Zhang K, Chao W. -L., Sha F, Grauman K (2016) Summary transfer: Exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1059–1067
Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201
Zhang Z, Hong W (2019) Electric load forecasting by complete ensemble empirical model decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn 98:1107–1136
Zhang Q, Huang N, Yao L, Zhang D, Shan C, Han J (2020) RGB-T salient object detection via fusing Multi-Level CNN features. IEEE Trans Image Process 29:3321–3335
Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2513–2520
Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-second AAAI Conference on Artificial Intelligence, pp 7582–7589

Download references

Author information

Authors and Affiliations

Wuhan University, Hubei, China
Min Hu, Ruimin Hu & Zhongyuan Wang
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
Zixiang Xiong
Central China Normal University, Wuhan, China
Rui Zhong

Authors

Min Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ruimin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zixiang Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, M., Hu, R., Wang, Z. et al. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimed Tools Appl 81, 40489–40510 (2022). https://doi.org/10.1007/s11042-022-12901-4

Download citation

Received: 14 November 2020
Revised: 26 February 2021
Accepted: 09 March 2022
Published: 10 May 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11042-022-12901-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal two-stream LSTM network for unsupervised video summarization

Abstract

Access this article

Similar content being viewed by others

Unsupervised Temporal Attention Summarization Model for User Created Videos

Video summarization via global feature difference optimization

Transforming Multi-concept Attention into Video Summarization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatiotemporal two-stream LSTM network for unsupervised video summarization

Abstract

Access this article

Similar content being viewed by others

Unsupervised Temporal Attention Summarization Model for User Created Videos

Video summarization via global feature difference optimization

Transforming Multi-concept Attention into Video Summarization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation