Skip to main content

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15094))

Included in the following conference series:

  • 548 Accesses

Abstract

Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bahmani, S., et al.: 4D-FY: text-to-4D generation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7996–8006 (2024)

    Google Scholar 

  2. Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets (2023)

    Google Scholar 

  3. Du, Y., Zhang, Y., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4d view synthesis and video processing. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14304–14314. IEEE Computer Society (2021)

    Google Scholar 

  4. Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5712–5721 (2021)

    Google Scholar 

  5. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  6. Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4d: consistent 360 \(\{\)\(\backslash \)deg\(\}\) dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848 (2023)

  7. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023)

    Google Scholar 

  8. Khachatryan, L., et al.: Text2video-zero: text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)

  9. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  10. Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

    Google Scholar 

  11. Lin, Y., Dai, Z., Zhu, S., Yao, Y.: Gaussian-flow: 4D reconstruction with dynamic 3D gaussian particle. arXiv preprint arXiv:2312.03431 (2023)

  12. Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: text-to-4D with dynamic 3D Gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)

  13. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)

    Google Scholar 

  14. Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  15. Long, X., et al.: Wonder3d: single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008 (2023)

  16. Lu, Y., et al.: Direct2. 5: diverse text-to-3D generation via multi-view 2.5 d diffusion. arXiv preprint arXiv:2311.15980 (2023)

  17. Lu, Y., et al.: Direct2.5: diverse text-to-3d generation via multi-view 2.5d diffusion (2023)

    Google Scholar 

  18. Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d Gaussians: tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023)

  19. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  20. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15 (2022). https://doi.org/10.1145/3528223.3530127

  21. Podell, D., et al.: Sdxl: improving latent diffusion models for high-resolution image synthesis (2023)

    Google Scholar 

  22. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  23. Ren, J., et al.: Dreamgaussian4d: generative 4D Gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)

  24. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2022)

    Google Scholar 

  25. Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  26. Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

  27. Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)

  28. Singer, U., et al.: Text-to-4D dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023)

  29. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)

  30. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2d diffusion models for 3D generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

    Google Scholar 

  31. Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

  32. Wang, X., et al.: Animatabledreamer: text-guided non-rigid 3d model generation and reconstruction with canonical score distillation. arXiv preprint arXiv:2312.03795 (2023)

  33. Wu, G., et al.: 4D Gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)

  34. Xian, W., Huang, J.B., Kopf, J., Kim, C.: Space-time neural irradiance fields for free-viewpoint video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9421–9431 (2021)

    Google Scholar 

  35. Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023)

  36. Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)

  37. Zhang, L.: Reference-only control (2023). https://github.com/Mikubill/sd-webui-controlnet/discussions/1236

  38. Zhang, S., et al.: I2vgen-xl: high-quality image-to-video synthesis via cascaded diffusion models (2023)

    Google Scholar 

  39. Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)

  40. Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., De Mello, S.: A unified approach for text-and image-guided 4d scene generation. arXiv preprint arXiv:2311.16854 (2023)

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2022YFF0902200) and NSFC grant 62441204.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yao Yao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2796 KB)

Supplementary material 2 (mp4 47590 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zeng, Y. et al. (2025). STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15094. Springer, Cham. https://doi.org/10.1007/978-3-031-72764-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72764-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72763-4

  • Online ISBN: 978-3-031-72764-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics