Abstract
With the surge in autonomous driving technologies, the reliance on comprehensive and high-definition bird’s-eye-view (BEV) representations has become paramount. This burgeoning need underscores the demand for extensive multi-view video datasets, meticulously annotated to facilitate advanced research and development. Nonetheless, the acquisition of such datasets is impeded by prohibitive costs associated with data collection and annotation. There are two challenges when synthesizing multi-view videos given a 3D layout: Generating multi-view videos involves handling both view and temporal dimensions. How to generate videos while ensuring cross-view consistency and cross-frame consistency? 2) How to ensure the precision of layout control and the quality of the generated instances? Addressing this critical bottleneck, we introduce a novel spatial-temporal consistent diffusion framework, DrivingDiffusion, engineered to synthesize realistic multi-view videos governed by 3D spatial layouts. DrivingDiffusion adeptly navigates the dual challenges of maintaining cross-view and cross-frame coherence, along with meeting the exacting standards of layout fidelity and visual quality. The framework operates through a tripartite methodology: initiating with the generation of multi-view single-frame images, followed by the synthesis of single-view videos across multiple cameras, and culminating with a post-processing phase. We corroborate the efficacy of DrivingDiffusion through rigorous quantitative and qualitative evaluations, demonstrating its potential to significantly enhance autonomous driving tasks without incurring additional costs https://drivingdiffusion.github.io.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. arXiv preprint arXiv:2304.08818 (2023)
Cheng, J., Liang, X., Shi, X., He, T., Xiao, T., Li, M.: Layoutdiffuse: adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908 (2023)
Ding, M., Zheng, W., Hong, W., Tang, J.: Cogview2: faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217 (2022)
Fang, S., Wang, Z., Zhong, Y., Ge, J., Chen, S., Wang, Y.: TBP-former: learning temporal bird’s-eye-view pyramid for joint perception and prediction in vision-centric autonomous driving. arXiv preprint arXiv:2303.09998 (2023)
Gao, R., et al.: Magicdrive: street view generation with diverse 3D geometry control (2023)
Gu, J., et al.: ViP3D: end-to-end visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582 (2022)
He, S., et al.: Context-aware layout to image generation with enhanced object appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15049–15058 (2021)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)
Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15273–15282 (2021)
Hu, H.N., Yang, Y.H., Fischer, T., Darrell, T., Yu, F., Sun, M.: Monocular quasi-dense 3D object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1992–2008 (2022)
Jiang, B., et al.: Perceive, interact, predict: learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181 (2022)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR, pp. 12697–12705 (2019)
Li, Z., Wu, J., Koh, I., Tang, Y., Sun, L.: Image synthesis from layout with locality-aware mask adaption. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13819–13828 (2021)
Li, Z., et al.: BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270 (2022)
Liang, M., et al.: PnPNet: end-to-end perception and prediction with tracking in the loop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11553–11562 (2020)
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3D object detection. arXiv preprint arXiv:2203.05625 (2022)
Liu, Z., et al.: BEVFusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542 (2022)
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Pang, Z., Li, J., Tokmakov, P., Chen, D., Zagoruyko, S., Wang, Y.X.: Standing between past and future: spatio-temporal modeling for multi-camera 3D multi-object tracking (2023)
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in GAN evaluation. In: CVPR (2022)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Swerdlow, A., Xu, R., Zhou, B.: Street-view image generation from a bird’s-eye view layout. arXiv preprint arXiv:2301.04634 (2023)
Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R.D., Sharma, S.: Object-centric image generation from layouts. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2647–2655 (2021)
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, X., Zhu, Z., Huang, G., Chen, X., Lu, J.: DriveDreamer: towards real-world-driven world models for autonomous driving (2023)
Wu, C., et al.: GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)
Wu, C., et al.: NÜWA: visual synthesis pre-training for neural visual world creation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 720–736. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_41
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)
Xiong, K., et al.: Cape: camera view position embedding for multi-view 3D object detection. arXiv preprint arXiv:2303.10209 (2023)
Yang, K., Ma, E., Peng, J., Guo, Q., Lin, D., Yu, K.: BEVControl: accurately controlling street-view elements with multi-perspective consistency via BEV sketch layout (2023)
Yu, J., et al.: Vector-quantized image modeling with improved VQGAN. arXiv preprint arXiv:2110.04627 (2021)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
Zhang, T., Chen, X., Wang, Y., Wang, Y., Zhao, H.: Mutr3D: a multi-camera tracking framework via 3D-to-2D queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4537–4546 (2022)
Zhang, Y., et al.: Bytetrackv2: 2D and 3D multi-object tracking by associating every detection box. arXiv preprint arXiv:2303.15334 (2023)
Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y., Li, X.: Layoutdiffusion: controllable diffusion model for layout-to-image generation. arXiv preprint arXiv:2303.17189 (2023)
Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13760–13769 (2022)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X., Zhang, Y., Ye, X. (2025). DrivingDiffusion: Layout-Guided Multi-view Driving Scenarios Video Generation with Latent Diffusion Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15136. Springer, Cham. https://doi.org/10.1007/978-3-031-73229-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-73229-4_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73228-7
Online ISBN: 978-3-031-73229-4
eBook Packages: Computer ScienceComputer Science (R0)