Skip to main content

SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12046))

Abstract

We usually predict how objects will move in the near future in our daily lives. However, how do we predict? In this paper, to address this problem, we propose a GAN-based network to predict the near future for fluid object domains such as cloud and beach scenes. Our model takes one frame and predict future frames. Inspired by the self-attention mechanism [25], we propose introducing the spatial self-attention mechanism into the model. The self-attention mechanism calculates the reaction at a certain position as a weighted sum of the features at all positions, which enables us to learn the model efficiently in one-stage learning. In the experiment, we show that our model is comparable compared with the state-of-the-art method which performs two-stage learning.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://sites.google.com/site/whluoimperial/mdgan.

  2. 2.

    http://www.cs.columbia.edu/~vondrick/tinyvideo/.

References

  1. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML) (2017)

    Google Scholar 

  2. Cai, H., Bai, C., Tai, Y., Tang, C.: Deep video generation, prediction and completion of human action sequences. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  3. Cheng, X., Dale, C., Liu, J.: Understanding the characteristics of internet short video sharing: Youtube as a case study. In: 2012 IEEE International Symposium on Multimedia (ISM) (2007)

    Google Scholar 

  4. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49

    Chapter  Google Scholar 

  5. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Proceedings of Neural Information Processing Systems (2016)

    Google Scholar 

  6. Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings of Neural Information Processing Systems (2014)

    Google Scholar 

  7. Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  8. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Proceedings of the Neural Information Processing Systems (2018)

    Google Scholar 

  9. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016)

    Google Scholar 

  10. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  11. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  12. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  13. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representation (ICLR) (2017)

    Google Scholar 

  14. Krishna, R., Hata, K., Ren, F., Li, F., Niebles, J.C.: Dense-captioning events in videos. In: Proceedings of IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  15. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.: Flow-grounded spatial-temporal video prediction from still images. In: Proceedings of European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  16. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: Proceedings of the International Conference on Learning Representation (ICLR) (2016)

    Google Scholar 

  17. Ohnishi, K., Yamamoto, S., Ushiku, Y., Harada, T.: Hierarchical video generation from orthogonal information: optical flow and texture. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2018)

    Google Scholar 

  18. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  19. Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  20. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the International Conference on Learning Representation (ICLR) (2016)

    Google Scholar 

  21. Saito, M., Matsumoto, E.: Temporal generative adversarial nets. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  22. Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: Proceeding of the IEEE Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  23. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  24. Tulyakov, S., Liu, M., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. In: Proceedings of Neural Information Processing Systems (2017)

    Google Scholar 

  26. Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Proceedings of the Neural Information Processing Systems (2016)

    Google Scholar 

  27. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2013)

    Google Scholar 

  28. Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  29. Xiong, W., Luo, W., Ma, L., Liu, W., Luo, J.: Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  30. Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., Lin, D.: Pose guided human video generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  31. Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)

  32. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Learning to forecast and refine residual motion for image-to-video generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  33. Zhou, Y., Berg, T.L.: Learning temporal transformations from time-lapse videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 262–277. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_16

    Chapter  Google Scholar 

  34. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 15H05915, 17H01745, 17H06100 and 19H04929.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Horita, D., Yanai, K. (2020). SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12046. Springer, Cham. https://doi.org/10.1007/978-3-030-41404-7_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-41404-7_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-41403-0

  • Online ISBN: 978-3-030-41404-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics