Skip to main content

DrivingDiffusion: Layout-Guided Multi-view Driving Scenarios Video Generation with Latent Diffusion Model

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

With the surge in autonomous driving technologies, the reliance on comprehensive and high-definition bird’s-eye-view (BEV) representations has become paramount. This burgeoning need underscores the demand for extensive multi-view video datasets, meticulously annotated to facilitate advanced research and development. Nonetheless, the acquisition of such datasets is impeded by prohibitive costs associated with data collection and annotation. There are two challenges when synthesizing multi-view videos given a 3D layout: Generating multi-view videos involves handling both view and temporal dimensions. How to generate videos while ensuring cross-view consistency and cross-frame consistency? 2) How to ensure the precision of layout control and the quality of the generated instances? Addressing this critical bottleneck, we introduce a novel spatial-temporal consistent diffusion framework, DrivingDiffusion, engineered to synthesize realistic multi-view videos governed by 3D spatial layouts. DrivingDiffusion adeptly navigates the dual challenges of maintaining cross-view and cross-frame coherence, along with meeting the exacting standards of layout fidelity and visual quality. The framework operates through a tripartite methodology: initiating with the generation of multi-view single-frame images, followed by the synthesis of single-view videos across multiple cameras, and culminating with a post-processing phase. We corroborate the efficacy of DrivingDiffusion through rigorous quantitative and qualitative evaluations, demonstrating its potential to significantly enhance autonomous driving tasks without incurring additional costs https://drivingdiffusion.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. arXiv preprint arXiv:2304.08818 (2023)

  2. Cheng, J., Liang, X., Shi, X., He, T., Xiao, T., Li, M.: Layoutdiffuse: adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908 (2023)

  3. Ding, M., Zheng, W., Hong, W., Tang, J.: Cogview2: faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217 (2022)

  4. Fang, S., Wang, Z., Zhong, Y., Ge, J., Chen, S., Wang, Y.: TBP-former: learning temporal bird’s-eye-view pyramid for joint perception and prediction in vision-centric autonomous driving. arXiv preprint arXiv:2303.09998 (2023)

  5. Gao, R., et al.: Magicdrive: street view generation with diverse 3D geometry control (2023)

    Google Scholar 

  6. Gu, J., et al.: ViP3D: end-to-end visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582 (2022)

  7. He, S., et al.: Context-aware layout to image generation with enhanced object appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15049–15058 (2021)

    Google Scholar 

  8. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  9. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)

  10. Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  11. Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15273–15282 (2021)

    Google Scholar 

  12. Hu, H.N., Yang, Y.H., Fischer, T., Darrell, T., Yu, F., Sun, M.: Monocular quasi-dense 3D object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1992–2008 (2022)

    Article  Google Scholar 

  13. Jiang, B., et al.: Perceive, interact, predict: learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181 (2022)

  14. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR, pp. 12697–12705 (2019)

    Google Scholar 

  15. Li, Z., Wu, J., Koh, I., Tang, Y., Sun, L.: Image synthesis from layout with locality-aware mask adaption. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13819–13828 (2021)

    Google Scholar 

  16. Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270 (2022)

  17. Liang, M., et al.: PnPNet: end-to-end perception and prediction with tracking in the loop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11553–11562 (2020)

    Google Scholar 

  18. Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3D object detection. arXiv preprint arXiv:2203.05625 (2022)

  19. Liu, Z., et al.: BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542 (2022)

  20. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  21. Pang, Z., Li, J., Tokmakov, P., Chen, D., Zagoruyko, S., Wang, Y.X.: Standing between past and future: spatio-temporal modeling for multi-camera 3D multi-object tracking (2023)

    Google Scholar 

  22. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)

    Google Scholar 

  23. Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in GAN evaluation. In: CVPR (2022)

    Google Scholar 

  24. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  25. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  26. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  27. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  28. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)

    Google Scholar 

  29. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)

  30. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  31. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  32. Swerdlow, A., Xu, R., Zhou, B.: Street-view image generation from a bird’s-eye view layout. arXiv preprint arXiv:2301.04634 (2023)

  33. Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R.D., Sharma, S.: Object-centric image generation from layouts. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2647–2655 (2021)

    Google Scholar 

  34. Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  35. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  36. Wang, X., Zhu, Z., Huang, G., Chen, X., Lu, J.: DriveDreamer: towards real-world-driven world models for autonomous driving (2023)

    Google Scholar 

  37. Wu, C., et al.: GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)

  38. Wu, C., et al.: NÜWA: visual synthesis pre-training for neural visual world creation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 720–736. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_41

    Chapter  Google Scholar 

  39. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)

  40. Xiong, K., et al.: Cape: camera view position embedding for multi-view 3D object detection. arXiv preprint arXiv:2303.10209 (2023)

  41. Yang, K., Ma, E., Peng, J., Guo, Q., Lin, D., Yu, K.: BEVControl: accurately controlling street-view elements with multi-perspective consistency via BEV sketch layout (2023)

    Google Scholar 

  42. Yu, J., et al.: Vector-quantized image modeling with improved VQGAN. arXiv preprint arXiv:2110.04627 (2021)

  43. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)

  44. Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)

  45. Zhang, T., Chen, X., Wang, Y., Wang, Y., Zhao, H.: Mutr3D: a multi-camera tracking framework via 3D-to-2D queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4537–4546 (2022)

    Google Scholar 

  46. Zhang, Y., et al.: Bytetrackv2: 2D and 3D multi-object tracking by associating every detection box. arXiv preprint arXiv:2303.15334 (2023)

  47. Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y., Li, X.: Layoutdiffusion: controllable diffusion model for layout-to-image generation. arXiv preprint arXiv:2303.17189 (2023)

  48. Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13760–13769 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, X., Zhang, Y., Ye, X. (2025). DrivingDiffusion: Layout-Guided Multi-view Driving Scenarios Video Generation with Latent Diffusion Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15136. Springer, Cham. https://doi.org/10.1007/978-3-031-73229-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73229-4_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73228-7

  • Online ISBN: 978-3-031-73229-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics