Skip to main content

VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

This paper presents Invariant Score Distillation (ISD), a novel method for high-fidelity text-to-3D generation. ISD aims to tackle the over-saturation and over-smoothing problems in Score Distillation Sampling (SDS). In this paper, SDS is decoupled into a weighted sum of two components: the reconstruction term and the classifier-free guidance term. We experimentally found that over-saturation stems from the large classifier-free guidance scale and over-smoothing comes from the reconstruction term. To overcome these problems, ISD utilizes an invariant score term derived from DDIM sampling to replace the reconstruction term in SDS. This operation allows the utilization of a medium classifier-free guidance scale and mitigates the reconstruction-related errors, thus preventing the over-smoothing and over-saturation of results. Extensive experiments demonstrate that our method greatly enhances SDS and produces realistic 3D objects through single-stage optimization.

Our code is available at https://github.com/SupstarZh/VividDreamer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Armandpour, M., Zheng, H., Sadeghian, A., Sadeghian, A., Zhou, M.: Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968 (2023)

  2. Chang, A.X., et al.: ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)

  3. Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)

  4. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)

  5. Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting (2023)

    Google Scholar 

  6. Deitke, M., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Adv. Neural Inform. Process. Syst. 36 (2024)

    Google Scholar 

  7. Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)

    Google Scholar 

  8. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)

    Google Scholar 

  9. Gao, J., Fu, Y., Wang, Y., Qian, X., Feng, J., Fu, Y.: Mind-3d: Reconstruct high-quality 3d objects in human brain (2024). https://arxiv.org/abs/2312.07485

  10. Gao, J., et al.: Coarse-to-fine amodal segmentation with shape prior (2023). https://arxiv.org/abs/2308.16825

  11. Guo, Y.C., et al.: threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio (2023)

  12. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning (2022)

    Google Scholar 

  13. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  14. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  15. Hong, Y., et al.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

  16. Hu, E.J., et al.: Lora: Low-rank adaptation of large language models (2021)

    Google Scholar 

  17. Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422 (2023)

  18. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields (2022)

    Google Scholar 

  19. Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)

  20. Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. Adv. Neural. Inf. Process. Syst. 35, 26565–26577 (2022)

    Google Scholar 

  21. Katzir, O., Patashnik, O., Cohen-Or, D., Lischinski, D.: Noise-free score distillation. arXiv preprint arXiv:2310.17590 (2023)

  22. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  23. Li, J., et al.: Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model (2023)

    Google Scholar 

  24. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation (2022)

    Google Scholar 

  25. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching (2023)

    Google Scholar 

  26. Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

    Google Scholar 

  27. Liu, Met al.: One-2-3-45: any single image to 3d mesh in 45 seconds without per-shape optimization. Adv. Neural Inform. Process. Syst. 36 (2024)

    Google Scholar 

  28. Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object (2023)

    Google Scholar 

  29. Liu, X., Zhang, X., Ma, J., Peng, J., et al.: Instaflow: one step is enough for high-quality diffusion-based text-to-image generation. In: The Twelfth International Conference on Learning Representations (2023)

    Google Scholar 

  30. Liu, Y., et al.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  31. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)

    Google Scholar 

  32. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural. Inf. Process. Syst. 35, 5775–5787 (2022)

    Google Scholar 

  33. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models (2023)

    Google Scholar 

  34. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis (2020)

    Google Scholar 

  35. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15 (2022). https://doi.org/10.1145/3528223.3530127

  36. Nichol, A., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  37. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)

  38. van den Oord, A., et al.: Parallel wavenet: Fast high-fidelity speech synthesis (2017)

    Google Scholar 

  39. Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021)

    Google Scholar 

  40. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

  41. Qian, G., et al.: Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023)

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  43. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  44. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  45. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)

  46. Sanghi, A., et al.: Clip-forge: Towards zero-shot text-to-shape generation (2022)

    Google Scholar 

  47. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  48. Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023)

    Google Scholar 

  49. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  50. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)

  51. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

    Google Scholar 

  52. Wang, X., et al.: Animatabledreamer: Text-guided non-rigid 3d model generation and reconstruction with canonical score distillation. arXiv preprint arXiv:2312.03795 (2023)

  53. Wang, Z., et al.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems 36 (2024)

    Google Scholar 

  54. Wu, Z., Zhou, P., Yi, X., Yuan, X., Zhang, H.: Consistent3d: Towards consistent high-fidelity text-to-3d generation with deterministic sampling prior (2024)

    Google Scholar 

  55. Xu, Y., Yang, Z., Yang, Y.: Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. arXiv preprint arXiv:2312.08889 (2023)

  56. Yang, Z., Chen, G., Li, X., Wang, W., Yang, Y.: Doraemongpt: Toward understanding dynamic scenes with large language models (exemplified as a video agent) (2024)

    Google Scholar 

  57. Ye, J., et al.: Dreamreward: Text-to-3d generation with human preference. arXiv preprint arXiv:2403.14613 (2024)

  58. Yi, T., et al.: Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)

  59. Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3d with classifier score distillation. arXiv preprint arXiv:2310.19415 (2023)

  60. Zeng, X., et al.: Lion: Latent point diffusion models for 3d shape generation (2022)

    Google Scholar 

  61. Zhou, D., Li, Y., Ma, F., Zhang, X., Yang, Y.: Migc: multi-instance generation controller for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6818–6828 (2024)

    Google Scholar 

  62. Zhou, Z., Ma, F., Fan, H., Yang, Y.: Headstudio: Text to animatable head avatars with 3d gaussian splatting. arXiv preprint arXiv:2402.06149 (2024)

  63. Zhu, J., Zhuang, P.: Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766 (2023)

  64. Zhuo, W., Sun, Y., Wang, X., Zhu, L., Yang, Y.: Whitenedcse: whitening-based contrastive learning of sentence embeddings. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12135–12148 (2023)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the Major program of the National Natural Science Foundation of China (T2293720/T2293723), the National Natural Science Foundation of China (U2336212), the National Natural Science Foundation of China (62293554), the Fundamental Research Funds for the Central Universities (No. 226-2024-00058), the Fundamental Research Funds for the Zhejiang Provincial Universities (226-2024-00208), the “Leading Goose” R&D Program of Zhejiang Province under Grant 2024C01101, and the China Postdoctoral Science Foundation (524000-X92302).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6492 KB)

Supplementary material 2 (mp4 7693 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhuo, W., Ma, F., Fan, H., Yang, Y. (2025). VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15146. Springer, Cham. https://doi.org/10.1007/978-3-031-73223-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73223-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73222-5

  • Online ISBN: 978-3-031-73223-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics