Abstract
In order to summarize a content video properly, it is important to grasp the sequential structure of video as well as the long-term dependency between frames. The necessity of them is more obvious, especially for unsupervised learning. One possible solution is to utilize a well-known technique in the field of natural language processing for long-term dependency and sequential property: self-attention with relative position embedding (RPE). However, compared to natural language processing, video summarization requires capturing a much longer length of the global context. In this paper, we therefore present a novel input decomposition strategy, which samples the input both globally and locally. This provides an effective temporal window for RPE to operate and improves overall computational efficiency significantly. By combining both Global-and-Local input decomposition and RPE together, we come up with GL-RPE. Our approach allows the network to capture both local and global interdependencies between video frames effectively. Since GL-RPE can be easily integrated into the existing methods, we apply it to two different unsupervised backbones. We provide extensive ablation studies and visual analysis to verify the effectiveness of the proposals. We demonstrate our approach achieves new state-of-the-art performance using the recently proposed rank order-based metrics: Kendall’s \(\tau \) and Spearman’s \(\rho \). Furthermore, despite our method is unsupervised, we show ours perform on par with the fully-supervised method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)
De Avila, S.E.F., Lopes, A.P.B., da Luz Jr, A., de Albuquerque Araújo, A.: VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett. 32(1), 56–68 (2011)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings of Neural Information Processing Systems (NeurIPS), pp. 2069–2077 (2014)
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_33
Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098 (2015)
Joshi, N., Kienzle, W., Toelle, M., Uyttendaele, M., Cohen, M.F.: Real-time hyperlapse creation via optimal frame selection. ACM Trans. Graph. (TOG) 34(4), 63 (2015)
Jung, Y., Cho, D., Kim, D., Woo, S., Kweon, I.S.: Discriminative feature learning for unsupervised video summarization. In: Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), vol. 33, pp. 8537–8544 (2019)
Kang, H.W., Matsushita, Y., Tang, X., Chen, X.Q.: Space-time video montage. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 1331–1338. IEEE (2006)
Kendall, M.G.: The treatment of ties in ranking problems. Biometrika 33(3), 239–251 (1945)
Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2698–2705 (2013)
Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 3882–3889 (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations (ICLR) (2015)
Kopf, J., Cohen, M.F., Szeliski, R.: First-person hyper-lapse videos. ACM Trans. Graph. (TOG) 33(4), 78 (2014)
Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 1346–1353. IEEE (2012)
Liu, D., Hua, G., Chen, T.: A hierarchical visual model for video object summarization. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 32(12), 2178–2190 (2010)
Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2714–2721 (2013)
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), vol. 1 (2017)
Ngo, C.W., Ma, Y.F., Zhang, H.J.: Automatic video summarization by graph modeling. In: Ninth IEEE International Conference on Computer Vision 2003, Proceedings, pp. 104–109. IEEE (2003)
Otani, M., Nakashima, Y., Rahtu, E., Heikkila, J.: Rethinking the evaluation of video summaries. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7596–7604 (2019)
Paszke, A., et al.: Automatic differentiation in PyTorch. In: Proceedings of Neural Information Processing Systems Workshop (NIPS-W) (2017)
Poleg, Y., Halperin, T., Arora, C., Peleg, S.: EgoSampling: fast-forward and stereo for egocentric videos. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 4768–4776 (2015)
Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_35
Pritch, Y., Rav-Acha, A., Peleg, S.: Nonchronological video synopsis and indexing. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 30(11), 1971–1984 (2008)
Rochan, M., Wang, Y.: Video summarization by learning from unpaired data. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7902–7911 (2019)
Rochan, M., Ye, L., Wang, Y.: Video summarization using fully convolutional sequence networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 358–374. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_22
Sharghi, A., Laurel, J.S., Gong, B.: Query-focused video summarization: dataset, evaluation, and a memory network based approach. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 2127–2136 (2017)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. Proceedings of North American Chapter of the Association for Computational Linguistics (2018)
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 5179–5187 (2015)
Sun, M., Farhadi, A., Taskar, B., Seitz, S.: Salient montages from unconstrained videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 472–488. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_31
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)
Wei, H., Ni, B., Yan, Y., Yu, H., Yang, X., Yao, C.: Video summarization via semantic attended networks. In: Proceedings of Association for the Advancement of Artificial Intelligence (AAAI) (2018)
Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 4633–4641 (2015)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, pp. 5754–5764 (2019)
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018)
Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 1059–1067 (2016)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Zhang, Y., Li, K., Li, K., Zhong, B., Fu, Y.: Residual non-local attention networks for image restoration. In: Proceedings of International Conference on Learning Representations (ICLR) (2019)
Zhao, B., Li, X., Lu, X.: Hierarchical recurrent neural network for video summarization. In: Proceedings of Multimedia Conference (MM), pp. 863–871. ACM (2017)
Zhao, B., Li, X., Lu, X.: HSA-RNN: hierarchical structure-adaptive RNN for video summarization. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 7405–7414 (2018)
Zhou, K., Qiao, Y.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of Association for the Advancement of Artificial Intelligence (AAAI) (2018)
Zwillinger, D., Kokoska, S.: CRC Standard Probability and Statistics Tables and Formulae. CRC Press, Boca Raton (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jung, Y., Cho, D., Woo, S., Kweon, I.S. (2020). Global-and-Local Relative Position Embedding for Unsupervised Video Summarization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12370. Springer, Cham. https://doi.org/10.1007/978-3-030-58595-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-58595-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58594-5
Online ISBN: 978-3-030-58595-2
eBook Packages: Computer ScienceComputer Science (R0)