Skip to main content
Log in

Spatiotemporal two-stream LSTM network for unsupervised video summarization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Within user-created videos, the constantly changing content among neighboring images brings more challenge for the prior video summarization methods. Assuming the images’ critical features are refined, one can obtain promising accuracy of keyframes’ selection which is key in video summarization. In our work, we innovatively proposed a Spatiotemporal two-stream LSTM network-based (ST-LSTM) model to enhance the images’ critical features with the combination of spatial saliency and temporal semantic dependencies which is referred to as the two-stream method. Motivated by the fact that sizable and moving objects attract more visual attention, we newly design a Saliency-area-based attention network to filter irrelative non-attractive information. We use the latest attention-based Bi-LSTM network to extract the temporal dependency on the semantic features. Furthermore, a multi-feature-based reward function is presented to reinforce the ST-LSTM model by integrating diversity, representativeness, and storyness. Last, the Deep Deterministic Policy Gradient (DDPG) algorithm is adopted to do the unsupervised training for the proposed method. Extensive experiments on the public datasets demonstrate that our method outperforms the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Avila SEFD, Lopes APB, da Luz Jr A, de Albuquerque Arajo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

  2. Brown FP, et al. (1992) Class-based n-gram models of natural language. Comput Linguist:467–480

  3. DeSimone R, Duncan J (1995) Neural mechanisms of selective visual attention. Ann Rev Neurosci 18(1):193–222

  4. El-Ghoroury HN, Gupta SC (1972) Additive Bernoulli noise linear sequential circuits. IEEE Trans Comput 100(10):1119–1124

  5. Elhamifar E, Sapiro G, Vidal R (2012) See all by looking at a few: Sparse modeling for finding representative objects. In: CVPR, pp 1600–1607

  6. Elhamifar E, Sapiro G, Vidal R (2012) Sparse modeling for finding representative objects. Preparation 4(6):8

  7. Fan G, Guo Y, Zheng J, Hong W (2020) A generalized regression model based on hybrid empirical mode decomposition and support vector regression with back propagation neural network for mid-short term load forecasting. J Forecast 39(5):737–756

  8. Fan G, Peng LL, Hong W, et al. (2016) Electric load forecasting by the SVR model with differential empirical mode decomposition and auto regression. Neurocomputing 173:958–970

  9. Fei M, Jiang W, Mao W (2017) A novel compact yet rich key frame creation method for compressed video summarization. Multimed Tools Appl 77 (2):1–21

    Google Scholar 

  10. Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems, pp 2069–2077

  11. Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European Conference On Computer Vision, pp pp 505–520

  12. Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3090–3098

  13. Han J, Li KM, Shao L, Hu XT, He S, Guo L, et al. (2014) And Video abstraction based on fMRI-driven visual attention model. Inf Sci 281:781–796

  14. Itti L, Koch C (2001) Computational modelling of visual attention. In: Nature Rev Neurosci 2(3):194

  15. Ji QG, Fang ZD, Xie ZH, Lu ZM (2013) Video abstraction based on the visual attention model and online clustering. Signal Process Image Commun 28(3):241–253

  16. Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder-decoder networks. IEEE Trans Circ Syst Video Technol 1(1):183–298

  17. Ji Z, Zhao Y, Pang Y, Li X, Han J (2019) Deep attentive video summarization with distribution consistency learning. IEEE Trans Neural Netw Learn Syst 99:1–11

  18. Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: Image caption with region-based attention and scene factorization, [Online]. Available: 1506.06272

  19. Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. Thirty-third AAAI Conf Artif Intell 33:8537–8544

    Google Scholar 

  20. Kannan R, Swaminathan S, Ghinea G, Andres F, Anbananthen KASM (2019) Movie video summarization- generating personalized summaries using spatiotemporal salient region detection. Int J Multimed Data Eng Manag 10(3):1–26

  21. Khosla A, Hamid R, Lin C-J, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2698– 2705

  22. Li MW, Geng J, Hong W, Zhang LD (2019) Periodogram estimation based on LSSVR-CCPSO compensation for forecasting ship motion. Nonlinear Dyn 97(4):2579–2594

    Article  Google Scholar 

  23. Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664

  24. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, et al. (2015) Continuous control with deep reinforcement learning. Comput ence 23 (8):187

    Google Scholar 

  25. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning, [Online]. Available: arXiv:https://arxiv.org/abs/1506.000191506.00019

  26. Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 202–211

  27. Mehmood I, Sajjad M, Ejaz W, Baik SW (2015) Saliency-directed prioritization of visual data in wireless surveillance networks. Inform Fusion 24:16–30

  28. Qu S, Xi Y, Ding S (2017) Visual attention based on long-short term memory model for image caption generation. In: 2017 29Th chinese control and decision conference (CCDC). IEEE, pp 4789–4794

  29. Salehin MM, Paul M (2016) Summarizing surveillance video by saliency transition and moving object information, International Conference on Digital Image Computing, Techniques & Applications. IEEE

  30. Shih H (2013) Chia a novel attention-based key-frame determination method. IEEE TransBroadcast 59(3):556–562

  31. Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5179–5187

  32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9

  33. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4534– 4542

  34. Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49

  35. Wu G, et al. (2019) Unsupervised deep video hashing via balanced code for Large-Scale video retrieval. IEEE Trans Image Process 28(4):1993–2007

  36. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4507–4515

  37. Yi P, Wang ZY, Jiang K, Shao ZF, Ma J. y. (2020) Multi-Temporal Ultra Dense Memory Network For Video Super-Resolution. IEEE Trans Circ Syst Video Technol 30(8):2503–2516

    Article  Google Scholar 

  38. Yuan Y, Li H, Wang Q (2019) Spatiotemporal modelling for video summarization using convolutional recurrent neural network. IEEE Access 7:64676–64685

  39. Yuan Y, Li H, Wang Q, Qi (2019) Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network. In: IEEE Access, pp 1-1

  40. Yuan L, Tay FEH, Li P, Feng J (2020) Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Trans Multimed 22(10):2711–2722

    Article  Google Scholar 

  41. Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European Conference on Computer Vision, pp 766–782

  42. Zhang K, Chao W. -L., Sha F, Grauman K (2016) Summary transfer: Exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1059–1067

  43. Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201

  44. Zhang Z, Hong W (2019) Electric load forecasting by complete ensemble empirical model decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn 98:1107–1136

  45. Zhang Q, Huang N, Yao L, Zhang D, Shan C, Han J (2020) RGB-T salient object detection via fusing Multi-Level CNN features. IEEE Trans Image Process 29:3321–3335

  46. Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2513–2520

  47. Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-second AAAI Conference on Artificial Intelligence, pp 7582–7589

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, M., Hu, R., Wang, Z. et al. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimed Tools Appl 81, 40489–40510 (2022). https://doi.org/10.1007/s11042-022-12901-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12901-4

Keywords

Navigation