Abstract
We usually predict how objects will move in the near future in our daily lives. However, how do we predict? In this paper, to address this problem, we propose a GAN-based network to predict the near future for fluid object domains such as cloud and beach scenes. Our model takes one frame and predict future frames. Inspired by the self-attention mechanism [25], we propose introducing the spatial self-attention mechanism into the model. The self-attention mechanism calculates the reaction at a certain position as a weighted sum of the features at all positions, which enables us to learn the model efficiently in one-stage learning. In the experiment, we show that our model is comparable compared with the state-of-the-art method which performs two-stage learning.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML) (2017)
Cai, H., Bai, C., Tai, Y., Tang, C.: Deep video generation, prediction and completion of human action sequences. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Cheng, X., Dale, C., Liu, J.: Understanding the characteristics of internet short video sharing: Youtube as a case study. In: 2012 IEEE International Symposium on Multimedia (ISM) (2007)
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Proceedings of Neural Information Processing Systems (2016)
Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings of Neural Information Processing Systems (2014)
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Proceedings of the Neural Information Processing Systems (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML) (2015)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2017)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representation (ICLR) (2017)
Krishna, R., Hata, K., Ren, F., Li, F., Niebles, J.C.: Dense-captioning events in videos. In: Proceedings of IEEE International Conference on Computer Vision (ICCV) (2017)
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.: Flow-grounded spatial-temporal video prediction from still images. In: Proceedings of European Conference on Computer Vision (ECCV) (2018)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: Proceedings of the International Conference on Learning Representation (ICLR) (2016)
Ohnishi, K., Yamamoto, S., Ushiku, Y., Harada, T.: Hierarchical video generation from orthogonal information: optical flow and texture. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2018)
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2016)
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2016)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the International Conference on Learning Representation (ICLR) (2016)
Saito, M., Matsumoto, E.: Temporal generative adversarial nets. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: Proceeding of the IEEE Computer Vision and Pattern Recognition (CVPR) (2016)
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)
Tulyakov, S., Liu, M., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of Neural Information Processing Systems (2017)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Proceedings of the Neural Information Processing Systems (2016)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2013)
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Xiong, W., Luo, W., Ma, L., Liu, W., Luo, J.: Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., Lin, D.: Pose guided human video generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Learning to forecast and refine residual motion for image-to-video generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Zhou, Y., Berg, T.L.: Learning temporal transformations from time-lapse videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 262–277. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_16
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number 15H05915, 17H01745, 17H06100 and 19H04929.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Horita, D., Yanai, K. (2020). SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12046. Springer, Cham. https://doi.org/10.1007/978-3-030-41404-7_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-41404-7_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41403-0
Online ISBN: 978-3-030-41404-7
eBook Packages: Computer ScienceComputer Science (R0)