Skip to main content

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

By leveraging the text-to-image diffusion prior, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model’s comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments using different text-to-3D architectures, including Hyper-iNGP and 3DConv-Net. The results demonstrate ASD’s effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus. Code is available at https://github.com/theEricMa/ScaleDreamer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Stable-diffusion-v2.1-base. https://huggingface.co/stabilityai/stable-diffusion-2-1-base

  2. Armandpour, M., Zheng, H., Sadeghian, A., Sadeghian, A., Zhou, M.: Re-imagine the negative prompt algorithm: transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968 (2023)

  3. Babu, S., Liu, R., Zhou, A., Maire, M., Shakhnarovich, G., Hanocka, R.: Hyperfields: towards zero-shot generation of nerfs from text. arXiv preprint arXiv:2310.17075 (2023)

  4. Bahmani, S., et al.: Cc3d: layout-conditioned generation of compositional 3d scenes. arXiv preprint arXiv:2303.12074 (2023)

  5. Balaji, Y., et al.: ediffi: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)

  6. Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer. arXiv preprint arXiv:2309.07920 (2023)

  7. Chan, E.R., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)

    Google Scholar 

  8. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)

  9. Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023)

  10. Deitke, M., et al.: Objaverse-xl: a universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663 (2023)

  11. Ding, L., et al.: Text-to-3d generation with bidirectional diffusion using both 2d and 3d priors. arXiv preprint arXiv:2312.04963 (2023)

  12. Guo, P., et al.: Stabledreamer: taming noisy score distillation sampling for text-to-3d. arXiv preprint arXiv:2312.02189 (2023)

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. He, Y., et al.: T\(^3\)bench: benchmarking current progress in text-to-3d generation. arXiv preprint arXiv:2310.02977 (2023)

  15. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  16. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  17. Hong, Y., et al.: Lrm: large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

  18. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  19. Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422 (2023)

  20. Jiang, L., Wang, L.: Brightdreamer: generic 3d gaussian generative framework for fast text-to-3d synthesis. arXiv preprint arXiv:2403.11273 (2024)

  21. Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)

  22. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. (ToG) 42(4), 1–14 (2023)

    Article  Google Scholar 

  23. Koster, R.: Theory of fun for game design. " O’Reilly Media, Inc." (2013)

    Google Scholar 

  24. Lee, K., Sohn, K., Shin, J.: Dreamflow: High-quality text-to-3d generation by approximating probability flow. arXiv preprint arXiv:2403.14966 (2024)

  25. Li, M., et al.: Instant3d: instant text-to-3d generation. arXiv preprint arXiv:2311.08403 (2023)

  26. Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023)

  27. Li, Z., Chen, Y., Zhao, L., Liu, P.: Mvcontrol: adding conditional control to multi-view diffusion for controllable text-to-3d generation. arXiv preprint arXiv:2311.14494 (2023)

  28. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint arXiv:2311.11284 (2023)

  29. Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

    Google Scholar 

  30. Lin, Y., Clark, R., Torr, P.: Dreampolisher: towards high-quality text-to-3d generation via geometric diffusion. arXiv preprint arXiv:2403.17237 (2024)

  31. Liu, Y.T., Luo, G., Sun, H., Yin, W., Guo, Y.C., Zhang, S.H.: Pi3d: efficient text-to-3d generation with pseudo-image diffusion. arXiv preprint arXiv:2312.09069 (2023)

  32. Liu, Z., et al.: Unidream: unifying diffusion priors for relightable text-to-3d generation. arXiv preprint arXiv:2312.08754 (2023)

  33. Lorraine, J., et al.: Att3d: amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349 (2023)

  34. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  35. Ma, Y., et al.: X-dreamer: creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv preprint arXiv:2312.00085 (2023)

  36. Mercier, A., et al.: Hexagen3d: stablediffusion is just one step away from fast and diverse text-to-3d generation. arXiv preprint arXiv:2401.07727 (2024)

  37. Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)

    Google Scholar 

  38. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)

  39. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)

    Article  Google Scholar 

  40. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

  41. Qian, G., et al.: Atom: amortized text-to-mesh using 2d diffusion. arXiv preprint arXiv:2402.00867 (2024)

  42. Qiu, L., et al.: Richdreamer: a generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv preprint arXiv:2311.16918 (2023)

  43. Ren, X., Huang, J., Zeng, X., Museth, K., Fidler, S., Williams, F.: Xcube: large-scale 3d generative modeling using sparse voxel hierarchies. arXiv preprint (2023)

    Google Scholar 

  44. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  45. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)

    Google Scholar 

  46. Schwarz, K., Sauer, A., Niemeyer, M., Liao, Y., Geiger, A.: Voxgraf: fast 3d-aware image synthesis with sparse voxel grids. Adv. Neural. Inf. Process. Syst. 35, 33999–34011 (2022)

    Google Scholar 

  47. Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Adv. Neural. Inf. Process. Syst. 34, 6087–6101 (2021)

    Google Scholar 

  48. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

  49. Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deepvoxels: learning persistent 3d feature embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2446 (2019)

    Google Scholar 

  50. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  51. Tang, B., Wang, J., Wu, Z., Zhang, L.: Stable score distillation for high-quality 3d generation. arXiv preprint arXiv:2312.09305 (2023)

  52. Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054 (2024)

  53. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)

  54. Tang, Z., et al.: Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459 (2023)

  55. Thanh-Tung, H., Tran, T.: Catastrophic forgetting and mode collapse in gans. In: 2020 International Joint Conference on Neural Networks (ijcnn), pp. 1–10. IEEE (2020)

    Google Scholar 

  56. Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: Textmesh: generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439 (2023)

  57. Vilesov, A., Chari, P., Kadambi, A.: Cg3d: compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907 (2023)

  58. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

    Google Scholar 

  59. Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)

  60. Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)

  61. Wohlgenannt, I., Simons, A., Stieglitz, S.: Virtual reality. Business Inform. Syst. Eng. 62, 455–461 (2020)

    Article  Google Scholar 

  62. Wu, R., Sun, L., Ma, Z., Zhang, L.: One-step effective diffusion network for real-world image super-resolution. arXiv preprint arXiv:2406.08177 (2024)

  63. Wu, T., et al.: Omniobject3d: large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 803–814 (2023)

    Google Scholar 

  64. Wu, Z., Zhou, P., Yi, X., Yuan, X., Zhang, H.: Consistent3d: towards consistent high-fidelity text-to-3d generation with deterministic sampling prior. arXiv preprint arXiv:2401.09050 (2024)

  65. Xie, K., et al..: Latte3d: large-scale amortized text-to-enhanced3d synthesis. arXiv preprint arXiv:2403.15385 (2024)

  66. Xu, Y., et al.: Dmv3d: denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217 (2023)

  67. Yang, Z., et al.: Eliminating lipschitz singularities in diffusion models. arXiv preprint arXiv:2306.11251 (2023)

  68. Yang, Zet al.: Lipschitz singularities in diffusion models. In: The Twelfth International Conference on Learning Representations (2023)

    Google Scholar 

  69. Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. Adv. Neural. Inf. Process. Syst. 34, 4805–4815 (2021)

    Google Scholar 

  70. Yi, T., et al.: Gaussiandreamer: fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)

  71. Yu, X., et al.: Mvimgnet: a large-scale dataset of multi-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9150–9161 (2023)

    Google Scholar 

  72. Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3d with classifier score distillation. arXiv preprint arXiv:2310.19415 (2023)

  73. Zhao, M., et al.: Efficientdreamer: high-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223 (2023)

  74. Zhao, R., Wang, Z., Wang, Y., Zhou, Z., Zhu, J.: Flexidreamer: single image-to-3d generation with flexicubes. arXiv preprint arXiv:2404.00987 (2024)

  75. Zhou, L., Shih, A., Meng, C., Ermon, S.: Dreampropeller: supercharge text-to-3d generation with parallel sampling. arXiv preprint arXiv:2311.17082 (2023)

Download references

Acknowledgement

This work is supported in part by the Beijing Science and Technology Plan Project Z231100005923033, and the InnoHK program.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3700 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, Z., Wei, Y., Zhang, Y., Zhu, X., Lei, Z., Zhang, L. (2025). ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72667-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72666-8

  • Online ISBN: 978-3-031-72667-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics