Skip to main content

Photorealistic Video Generation with Diffusion Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We present W.A.L.T, a diffusion transformer for photorealistic video generation from text prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of \(512 \times 896\) resolution at 8 frames per second.

A. Gupta—Work partially done during an internship at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agostinelli, A., et al.: MusicLM: generating music from text (2023). arXiv:2301.11325

  2. Bao, F., Li, C., Cao, Y., Zhu, J.: All are worth words: a ViT backbone for score-based diffusion models. In: NeurIPS 2022 Workshop on Score-Based Methods (2022)

    Google Scholar 

  3. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  4. Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  5. Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)

    Google Scholar 

  6. Bousmalis, K., et al.: RoboCat: a self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706 (2023)

  7. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2018)

    Google Scholar 

  8. Brohan, A., et al.: RT-1: robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

  9. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  10. Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about Kinetics-600 (2018). arXiv:1808.01340

  11. Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. In: ICML (2023)

    Google Scholar 

  12. Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: CVPR (2022)

    Google Scholar 

  13. Chen, J., et al.: PixArt-\(\alpha \): fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)

  14. Chen, T., Zhang, R., Hinton, G.: Analog bits: generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202 (2022)

  15. Dehghani, M., et al.: Scaling vision transformers to 22 billion parameters. In: International Conference on Machine Learning, pp. 7480–7512. PMLR (2023)

    Google Scholar 

  16. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  17. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)

    Google Scholar 

  18. Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NeurIPS (2021)

    Google Scholar 

  19. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  20. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)

  21. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)

    Google Scholar 

  22. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 89–106. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_6

    Chapter  Google Scholar 

  23. Gao, S., Zhou, P., Cheng, M.M., Yan, S.: Masked diffusion transformer is a strong image synthesizer (2023). arXiv:2303.14389

  24. Ge, S., et al.: Long video generation with time-agnostic VQGAN and time-sensitive transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 102–118. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_7

    Chapter  Google Scholar 

  25. Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. arXiv preprint arXiv:2305.10474 (2023)

  26. Girdhar, R., et al.: Emu video: factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023)

  27. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  28. Google: PaLM 2 technical report (2023). arXiv:2305.10403

  29. Gupta, A., Tian, S., Zhang, Y., Wu, J., Martín-Martín, R., Fei-Fei, L.: MaskViT: masked visual pre-training for video prediction. In: ICLR (2022)

    Google Scholar 

  30. Gupta, A., Wu, J., Deng, J., Fei-Fei, L.: Siamese masked autoencoders. arXiv preprint arXiv:2305.14344 (2023)

  31. Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffusion modeling of long videos. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27953–27965 (2022)

    Google Scholar 

  32. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

    Google Scholar 

  33. He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221 (2023)

  34. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  35. Ho, J., et al.: Imagen video: high definition video generation with diffusion models (2022). arXiv:2210.02303

  36. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  37. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. JMLR 23(1), 2249–2281 (2022)

    MathSciNet  Google Scholar 

  38. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: ICLR Workshops (2022)

    Google Scholar 

  39. Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: large-scale pretraining for text-to-video generation via transformers (2022). arXiv:2205.15868

  40. Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: end-to-end diffusion for high resolution images. In: ICML (2023)

    Google Scholar 

  41. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: ICLR (2021)

    Google Scholar 

  42. Jabri, A., Fleet, D.J., Chen, T.: Scalable adaptive computation for iterative generation. In: ICML (2023)

    Google Scholar 

  43. Jiang, Y., Chang, S., Wang, Z.: TransGAN: two pure transformers can make one strong GAN, and that can scale up. In: Advances in Neural Information Processing Systems, vol. 34, pp. 14745–14758 (2021)

    Google Scholar 

  44. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  45. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)

    Google Scholar 

  46. Kingma, D.P., Gao, R.: Understanding the diffusion objective as a weighted integral of ELBOs (2023). arXiv:2303.00848

  47. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  48. Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., Liu, C.: ViTGAN: training GANs with vision transformers. arXiv preprint arXiv:2107.04589 (2021)

  49. Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891 (2023)

  50. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  51. Lu, H., et al.: VDT: general-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:2305.13311 (2023)

  52. Luc, P., et al.: Transformation-based adversarial video prediction on large-scale data (2020). arXiv:2003.04035

  53. Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)

    Google Scholar 

  54. Peebles, W., Xie, S.: Scalable diffusion models with transformers (2022). arXiv:2212.09748

  55. Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  56. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  57. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. PMLR (2023)

    Google Scholar 

  58. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  59. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  60. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  61. Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019)

    Google Scholar 

  62. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  63. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  64. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NeurIPS (2016)

    Google Scholar 

  65. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)

  66. Savinov, N., Chung, J., Binkowski, M., Elsen, E., van den Oord, A.: Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749 (2021)

  67. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data (2022). arXiv:2209.14792

  68. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)

    Google Scholar 

  69. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  70. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)

    Google Scholar 

  71. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402

  72. Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges (2018). arXiv:1812.01717

  73. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)

    Google Scholar 

  74. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  75. Villegas, R., et al.: Phenaki: variable length video generation from open domain textual description (2022). arXiv:2210.02399

  76. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)

    Google Scholar 

  77. Wu, C., et al.: GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)

  78. Wu, C., et al.: NÜWA: visual synthesis pre-training for neural visual world creation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 720–736. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_41

    Chapter  Google Scholar 

  79. Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers (2021). arXiv:2104.10157

  80. Yu, J., et al.: Vector-quantized image modeling with improved VQGAN. In: ICLR (2022)

    Google Scholar 

  81. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation (2022). arXiv:2206.10789

  82. Yu, L., et al.: MAGVIT: masked generative video transformer. In: CVPR (2023)

    Google Scholar 

  83. Yu, L., et al.: Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737 (2023)

  84. Yu, S., Sohn, K., Kim, S., Shin, J.: Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18456–18466 (2023)

    Google Scholar 

  85. Zeng, Y., et al.: Make pixels dance: high-dynamic video generation. arXiv preprint arXiv:2311.10982 (2023)

  86. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113 (2022)

    Google Scholar 

  87. Zhang, B., et al.: StyleSwin: transformer-based GAN for high-resolution image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11304–11314 (2022)

    Google Scholar 

  88. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  89. Zheng, H., Nie, W., Vahdat, A., Anandkumar, A.: Fast training of diffusion models with masked transformers (2023). arXiv:2306.09305

  90. Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: MagicVideo: efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)

  91. Zitkovich, B., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. In: CoRL (2023)

    Google Scholar 

Download references

Acknowledgements

We thank Bryan Seybold, Dan Kondratyuk, David Ross, Hartwig Adam, Huisheng Wang, Jason Baldridge, Mauricio Delbracio and Orly Liba for helpful discussions and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Agrim Gupta .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3452 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gupta, A. et al. (2025). Photorealistic Video Generation with Diffusion Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15137. Springer, Cham. https://doi.org/10.1007/978-3-031-72986-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72986-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72985-0

  • Online ISBN: 978-3-031-72986-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics