Photorealistic Video Generation with Diffusion Models

Gupta, Agrim; Yu, Lijun; Sohn, Kihyuk; Gu, Xiuye; Hahn, Meera; Li, Fei-Fei; Essa, Irfan; Jiang, Lu; Lezama, José

doi:10.1007/978-3-031-72986-7_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15137))

Included in the following conference series:

European Conference on Computer Vision

808 Accesses

Abstract

We present W.A.L.T, a diffusion transformer for photorealistic video generation from text prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at 8 frames per second.

A. Gupta—Work partially done during an internship at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Factorizing Text-to-Video Generation by Explicit Image Conditioning

HARIVO: Harnessing Text-to-Image Models for Video Generation

Pix2Gif: Motion-Guided Diffusion for GIF Generation

References

Agostinelli, A., et al.: MusicLM: generating music from text (2023). arXiv:2301.11325
Bao, F., Li, C., Cao, Y., Zhu, J.: All are worth words: a ViT backbone for score-based diffusion models. In: NeurIPS 2022 Workshop on Score-Based Methods (2022)
Google Scholar
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)
Google Scholar
Bousmalis, K., et al.: RoboCat: a self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706 (2023)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2018)
Google Scholar
Brohan, A., et al.: RT-1: robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about Kinetics-600 (2018). arXiv:1808.01340
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. In: ICML (2023)
Google Scholar
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: CVPR (2022)
Google Scholar
Chen, J., et al.: PixArt-$\alpha $: fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)
Chen, T., Zhang, R., Hinton, G.: Analog bits: generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202 (2022)
Dehghani, M., et al.: Scaling vision transformers to 22 billion parameters. In: International Conference on Machine Learning, pp. 7480–7512. PMLR (2023)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Google Scholar
Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NeurIPS (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Google Scholar
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 89–106. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_6
Chapter Google Scholar
Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer (2023). arXiv:2303.14389
Ge, S., et al.: Long video generation with time-agnostic VQGAN and time-sensitive transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 102–118. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_7
Chapter Google Scholar
Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. arXiv preprint arXiv:2305.10474 (2023)
Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Google: PaLM 2 technical report (2023). arXiv:2305.10403
Gupta, A., Tian, S., Zhang, Y., Wu, J., Martín-Martín, R., Fei-Fei, L.: MaskViT: masked visual pre-training for video prediction. In: ICLR (2022)
Google Scholar
Gupta, A., Wu, J., Deng, J., Fei-Fei, L.: Siamese masked autoencoders. arXiv preprint arXiv:2305.14344 (2023)
Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffusion modeling of long videos. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27953–27965 (2022)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221 (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., et al.: Imagen video: high definition video generation with diffusion models (2022). arXiv:2210.02303
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. JMLR 23(1), 2249–2281 (2022)
MathSciNet Google Scholar
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: ICLR Workshops (2022)
Google Scholar
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: large-scale pretraining for text-to-video generation via transformers (2022). arXiv:2205.15868
Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: end-to-end diffusion for high resolution images. In: ICML (2023)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: ICLR (2021)
Google Scholar
Jabri, A., Fleet, D.J., Chen, T.: Scalable adaptive computation for iterative generation. In: ICML (2023)
Google Scholar
Jiang, Y., Chang, S., Wang, Z.: TransGAN: two pure transformers can make one strong GAN, and that can scale up. In: Advances in Neural Information Processing Systems, vol. 34, pp. 14745–14758 (2021)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Kingma, D.P., Gao, R.: Understanding the diffusion objective as a weighted integral of ELBOs (2023). arXiv:2303.00848
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., Liu, C.: ViTGAN: training GANs with vision transformers. arXiv preprint arXiv:2107.04589 (2021)
Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891 (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Lu, H., et al.: VDT: general-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:2305.13311 (2023)
Luc, P., et al.: Transformation-based adversarial video prediction on large-scale data (2020). arXiv:2003.04035
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Google Scholar
Peebles, W., Xie, S.: Scalable diffusion models with transformers (2022). arXiv:2212.09748
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. PMLR (2023)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NeurIPS (2016)
Google Scholar
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)
Savinov, N., Chung, J., Binkowski, M., Elsen, E., van den Oord, A.: Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749 (2021)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data (2022). arXiv:2209.14792
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges (2018). arXiv:1812.01717
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Villegas, R., et al.: Phenaki: variable length video generation from open domain textual description (2022). arXiv:2210.02399
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Google Scholar
Wu, C., et al.: GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)
Wu, C., et al.: NÜWA: visual synthesis pre-training for neural visual world creation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 720–736. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_41
Chapter Google Scholar
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers (2021). arXiv:2104.10157
Yu, J., et al.: Vector-quantized image modeling with improved VQGAN. In: ICLR (2022)
Google Scholar
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022). arXiv:2206.10789
Yu, L., et al.: MAGVIT: masked generative video transformer. In: CVPR (2023)
Google Scholar
Yu, L., et al.: Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737 (2023)
Yu, S., Sohn, K., Kim, S., Shin, J.: Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18456–18466 (2023)
Google Scholar
Zeng, Y., et al.: Make pixels dance: high-dynamic video generation. arXiv preprint arXiv:2311.10982 (2023)
Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113 (2022)
Google Scholar
Zhang, B., et al.: StyleSwin: transformer-based GAN for high-resolution image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11304–11314 (2022)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zheng, H., Nie, W., Vahdat, A., Anandkumar, A.: Fast training of diffusion models with masked transformers (2023). arXiv:2306.09305
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: MagicVideo: efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)
Zitkovich, B., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. In: CoRL (2023)
Google Scholar

Download references

Acknowledgements

We thank Bryan Seybold, Dan Kondratyuk, David Ross, Hartwig Adam, Huisheng Wang, Jason Baldridge, Mauricio Delbracio and Orly Liba for helpful discussions and feedback.

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Agrim Gupta & Fei-Fei Li
Google Research, San Francisco, USA
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Irfan Essa, Lu Jiang & José Lezama
Georgia Institute of Technology, Atlanta, USA
Irfan Essa

Authors

Agrim Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kihyuk Sohn
View author publications
You can also search for this author in PubMed Google Scholar
Xiuye Gu
View author publications
You can also search for this author in PubMed Google Scholar
Meera Hahn
View author publications
You can also search for this author in PubMed Google Scholar
Fei-Fei Li
View author publications
You can also search for this author in PubMed Google Scholar
Irfan Essa
View author publications
You can also search for this author in PubMed Google Scholar
Lu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
José Lezama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Agrim Gupta .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3452 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, A. et al. (2025). Photorealistic Video Generation with Diffusion Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15137. Springer, Cham. https://doi.org/10.1007/978-3-031-72986-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72986-7_23
Published: 02 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72985-0
Online ISBN: 978-3-031-72986-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Photorealistic Video Generation with Diffusion Models