SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention

Horita, Daichi; Yanai, Keiji

doi:10.1007/978-3-030-41404-7_44

SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention

Daichi Horita¹² &
Keiji Yanai¹²

Conference paper
First Online: 23 February 2020

1432 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12046))

Abstract

We usually predict how objects will move in the near future in our daily lives. However, how do we predict? In this paper, to address this problem, we propose a GAN-based network to predict the near future for fluid object domains such as cloud and beach scenes. Our model takes one frame and predict future frames. Inspired by the self-attention mechanism [25], we propose introducing the spatial self-attention mechanism into the model. The self-attention mechanism calculates the reaction at a certain position as a weighted sum of the features at all positions, which enables us to learn the model efficiently in one-stage learning. In the experiment, we show that our model is comparable compared with the state-of-the-art method which performs two-stage learning.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML) (2017)
Google Scholar
Cai, H., Bai, C., Tai, Y., Tang, C.: Deep video generation, prediction and completion of human action sequences. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Cheng, X., Dale, C., Liu, J.: Understanding the characteristics of internet short video sharing: Youtube as a case study. In: 2012 IEEE International Symposium on Multimedia (ISM) (2007)
Google Scholar
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Proceedings of Neural Information Processing Systems (2016)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings of Neural Information Processing Systems (2014)
Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Proceedings of the Neural Information Processing Systems (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representation (ICLR) (2017)
Google Scholar
Krishna, R., Hata, K., Ren, F., Li, F., Niebles, J.C.: Dense-captioning events in videos. In: Proceedings of IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.: Flow-grounded spatial-temporal video prediction from still images. In: Proceedings of European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: Proceedings of the International Conference on Learning Representation (ICLR) (2016)
Google Scholar
Ohnishi, K., Yamamoto, S., Ushiku, Y., Harada, T.: Hierarchical video generation from orthogonal information: optical flow and texture. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2018)
Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the International Conference on Learning Representation (ICLR) (2016)
Google Scholar
Saito, M., Matsumoto, E.: Temporal generative adversarial nets. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: Proceeding of the IEEE Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Tulyakov, S., Liu, M., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of Neural Information Processing Systems (2017)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Proceedings of the Neural Information Processing Systems (2016)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2013)
Google Scholar
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Xiong, W., Luo, W., Ma, L., Liu, W., Luo, J.: Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., Lin, D.: Pose guided human video generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Learning to forecast and refine residual motion for image-to-video generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Zhou, Y., Berg, T.L.: Learning temporal transformations from time-lapse videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 262–277. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_16
Chapter Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 15H05915, 17H01745, 17H06100 and 19H04929.

Author information

Authors and Affiliations

The University of Electro-Communications, Tokyo, Japan
Daichi Horita & Keiji Yanai

Authors

Daichi Horita
View author publications
You can also search for this author in PubMed Google Scholar
Keiji Yanai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

University of Malaya, Kuala Lumpur, Malaysia
Shivakumara Palaiahnakote
Consiglio Nazionale delle Ricerche, ICAR, Naples, Italy
Gabriella Sanniti di Baja
Chinese Academy of Sciences, Beijing, China
Liang Wang
Auckland University of Technology, Auckland, New Zealand
Wei Qi Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Horita, D., Yanai, K. (2020). SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12046. Springer, Cham. https://doi.org/10.1007/978-3-030-41404-7_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-41404-7_44
Published: 23 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41403-0
Online ISBN: 978-3-030-41404-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics