High-Quality Video Generation from Static Structural Annotations

Sheng, Lu; Pan, Junting; Guo, Jiaming; Shao, Jing; Loy, Chen Change

doi:10.1007/s11263-020-01334-x

High-Quality Video Generation from Static Structural Annotations

Published: 28 May 2020

Volume 128, pages 2552–2569, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Lu Sheng ORCID: orcid.org/0000-0002-8525-9163¹,
Junting Pan²,
Jiaming Guo³,
Jing Shao⁴ &
…
Chen Change Loy⁵

819 Accesses
3 Citations
Explore all metrics

Abstract

This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released: https://github.com/junting/seg2vid.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating Synthetic Video Sequences by Explicitly Modeling Object Motion

DTVNet: Dynamic Time-Lapse Video Generation via Single Still Image

Controllable Video Generation Through Global and Local Motion Dynamics

Notes

Or other kinds of structural annotation maps, such as sketches or human skeletons. Intuitively a reference image works as well.
Note that an early version of this work was published in (Pan et al. 2019). Compared to it, this paper has made substantial extensions including a new probabilistic formulation of the whole framework, a more comprehensive bi-directional flow-based video generation and its occlusion-aware image synthesis, a new section of ablation study and more experiments on human action datasets.
The forward flow \(\mathbf {W}_t^f\) is warped according to the backward flow \(\mathbf {W}_t^b\), via the backward warping operation (2).
UCF-101 and KTH Action datasets are at first divided into per-class sub-datasets.
Our results are rescaled to match the resolutions of those by Wang et al. (2018a).

References

Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.
Article Google Scholar
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S. (2017). Stochastic variational video prediction. ICLR
Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., Guttag, J. (2018). Synthesizing images of humans in unseen poses. In: CVPR, IEEE.
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR, IEEE.
Carreira, J., Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, IEEE.
Chen, B., Wang, W., Wang, J. (2017). Video imagination from a single image with transformation generation. In: ACM MM, ACM, pp 358–366.
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR, IEEE.
Denton, E., Fergus, R. (2018). Stochastic video generation with a learned prior. ICML
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In: ICCV, IEEE, pp 2758–2766.
Finn, C., Goodfellow, I., Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In: NIPS, pp 64–72.
Ganin, Y., Kononenko, D., Sungatullina, D., Lempitsky, V. (2016). Deepwarp: Photorealistic image resynthesis for gaze manipulation. In: ECCV, Springer, pp 311–326.
Geiger, A., Lenz, P., Stiller, C., Urtasun, R. (2013). Vision meets robotics: The kitti dataset. IJRR.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. In: NIPS, pp 2672–2680.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S. (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS, pp 6626–6637
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. CVPR.
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In: NIPS, pp 2017–2025.
Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J. (2018). Super SloMo: High quality estimation of multiple intermediate frames for video interpolation. In: CVPR, IEEE.
Johnson, J., Alahi, A., Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In: ECCV, Springer, pp 694–711.
Johnson, J., Gupta, A., Fei-Fei, L. (2018). Image generation from scene graphs. In: CVPR, IEEE.
Kalchbrenner, N., Oord, Avd., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527.
Kingma, D.P., Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Laptev, I., Caputo, B., et al. (2004) Recognizing human actions: a local svm approach. In: ICPR, IEEE, pp 32–36.
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H. (2018). Flow-grounded spatial-temporal video prediction from still images. In: ECCV, Springer.
Liang, X., Lee, L., Dai, W., Xing, E.P. (2017). Dual motion GAN for future-flow embedded video prediction. In: ICCV, IEEE.
Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B. (2018). Image inpainting for irregular holes using partial convolutions. In: ECCV, Springer.
Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In: ICCV, IEEE.
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L. (2017). Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821.
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L. (2017). Pose guided person image generation. In: NIPS, pp 406–416.
Mathieu, M., Couprie, C., LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.
Meister, S., Hur, J., Roth, S. (2018). UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In: AAAI, New Orleans, Louisiana.
Oord, Avd, Kalchbrenner, N., Kavukcuoglu, K. (2016). Pixel recurrent neural networks. ICML.
Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., Wang, X. (2019). Video generation from single semantic label map. In: CVPR, IEEE.
Patraucean, V., Handa, A., Cipolla, R. (2015). Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309.
Pintea, S.L., van Gemert, J.C., Smeulders, A.W.M. (2014). Dejavu: Motion prediction in static images. In: ECCV, Springer.
Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Saito, M., Matsumoto, E., Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In: ICCV, IEEE.
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. In: CVPR, IEEE.
Sohn, K., Lee, H., Yan, X. (2015). Learning structured output representation using deep conditional generative models. In: NIPS, pp 3483–3491.
Soomro, K., Zamir, A.R., Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using LSTMs. In: ICML, pp 843–852
Sun D, Yang X, Liu MY, Kautz J (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In: CVPR, IEEE.
Uria, B., Côté, M. A., Gregor, K., Murray, I., & Larochelle, H. (2016). Neural Autoregressive Distribution Estimation. JLMR, 17(1), 7184–7220.
Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee , H. (2017b). Learning to generate long-term future via hierarchial prediction. In: ICML.
Vondrick, C., Torralba, A. (2017). Generating the future with adversarial transformers. In: CVPR, IEEE.
Vondrick, C., Pirsiavash, H., Torralba, A. (2016a). Anticipating visual representations from unlabeled video. In: CVPR, IEEE, pp 98–106.
Vondrick, C., Pirsiavash, H., Torralba, A. (2016b). Generating videos with scene dynamics. In: NIPS, pp 613–621.
Walker, J., Doersch, C., Gupta, A., Hebert, M. (2016). An uncertain future: Forecasting from static images using variational autoencoders. In: ECCV, Springer, pp 835–851
Walker, J., Gupta, A., Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In: CVPR, IEEE, pp 3302–3309.
Walker, J., Gupta, A., Hebert, M. (2015). Dense optical flow prediction from a static image. In: ICCV, IEEE, pp 2443–2451.
Wang, T.C., Liu, M.Y., Zhu, J.Y., Liu, G., Tao, A., Kautz, J., Catanzaro, B. (2018a). Video-to-video synthesis. In: NeurIPS.
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR, IEEE.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. TIP, 13(4), 600–612.
Google Scholar
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T. (2017). Video enhancement with task-oriented flow. arXiv preprint arXiv:1711.09078.
Xue, T., Wu, J., Bouman, K., Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: NIPS, pp 91–99.
Yin, Z., Shi, J. (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, IEEE.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N. (2017). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV, pp 5907–5915.
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In: ECCV, Springer.
Zheng, Z., Zheng, L., Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: ICCV, pp 3754–3762.
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A. (2016). View synthesis by appearance flow. In: ECCV, Springer, pp 286–301.
Zhu, J.Y., Park, T., Isola, P., Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, IEEE.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61906012, and in part by Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, and NTU NAP.

Author information

Authors and Affiliations

College of Software, Beihang University, Beijing, China
Lu Sheng
CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Hong Kong SAR, China
Junting Pan
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jiaming Guo
SenseTime Research, Shenzhen, China
Jing Shao
SenseTime-NTU Joint AI Research Centre, Nanyang Technological University, Singapore, Singapore
Chen Change Loy

Authors

Lu Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Junting Pan
View author publications
You can also search for this author in PubMed Google Scholar
Jiaming Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jing Shao
View author publications
You can also search for this author in PubMed Google Scholar
Chen Change Loy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Sheng.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Lu Sheng and Junting Pan contributed equally.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sheng, L., Pan, J., Guo, J. et al. High-Quality Video Generation from Static Structural Annotations. Int J Comput Vis 128, 2552–2569 (2020). https://doi.org/10.1007/s11263-020-01334-x

Download citation

Received: 15 May 2019
Accepted: 24 April 2020
Published: 28 May 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11263-020-01334-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-Quality Video Generation from Static Structural Annotations

Abstract

Access this article

Similar content being viewed by others

Generating Synthetic Video Sequences by Explicitly Modeling Object Motion

DTVNet: Dynamic Time-Lapse Video Generation via Single Still Image

Controllable Video Generation Through Global and Local Motion Dynamics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High-Quality Video Generation from Static Structural Annotations

Abstract

Access this article

Similar content being viewed by others

Generating Synthetic Video Sequences by Explicitly Modeling Object Motion

DTVNet: Dynamic Time-Lapse Video Generation via Single Still Image

Controllable Video Generation Through Global and Local Motion Dynamics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation