ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

Ma, Zhiyuan; Wei, Yuxiang; Zhang, Yabin; Zhu, Xiangyu; Lei, Zhen; Zhang, Lei

doi:10.1007/978-3-031-72667-5_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15065))

Included in the following conference series:

European Conference on Computer Vision

539 Accesses

Abstract

By leveraging the text-to-image diffusion prior, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model’s comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments using different text-to-3D architectures, including Hyper-iNGP and 3DConv-Net. The results demonstrate ASD’s effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus. Code is available at https://github.com/theEricMa/ScaleDreamer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Diverse Text-to-3D Synthesis with Augmented Text Embedding

TPA3D: Triplane Attention for Fast Text-to-3D Generation

Score Distillation Sampling with Learned Manifold Corrective

References

Stable-diffusion-v2.1-base. https://huggingface.co/stabilityai/stable-diffusion-2-1-base
Armandpour, M., Zheng, H., Sadeghian, A., Sadeghian, A., Zhou, M.: Re-imagine the negative prompt algorithm: transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968 (2023)
Babu, S., Liu, R., Zhou, A., Maire, M., Shakhnarovich, G., Hanocka, R.: Hyperfields: towards zero-shot generation of nerfs from text. arXiv preprint arXiv:2310.17075 (2023)
Bahmani, S., et al.: Cc3d: layout-conditioned generation of compositional 3d scenes. arXiv preprint arXiv:2303.12074 (2023)
Balaji, Y., et al.: ediffi: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer. arXiv preprint arXiv:2309.07920 (2023)
Chan, E.R., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133 (2022)
Google Scholar
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873 (2023)
Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023)
Deitke, M., et al.: Objaverse-xl: a universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663 (2023)
Ding, L., et al.: Text-to-3d generation with bidirectional diffusion using both 2d and 3d priors. arXiv preprint arXiv:2312.04963 (2023)
Guo, P., et al.: Stabledreamer: taming noisy score distillation sampling for text-to-3d. arXiv preprint arXiv:2312.02189 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, Y., et al.: T$^3$bench: benchmarking current progress in text-to-3d generation. arXiv preprint arXiv:2310.02977 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Hong, Y., et al.: Lrm: large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422 (2023)
Jiang, L., Wang, L.: Brightdreamer: generic 3d gaussian generative framework for fast text-to-3d synthesis. arXiv preprint arXiv:2403.11273 (2024)
Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. (ToG) 42(4), 1–14 (2023)
Article Google Scholar
Koster, R.: Theory of fun for game design. " O’Reilly Media, Inc." (2013)
Google Scholar
Lee, K., Sohn, K., Shin, J.: Dreamflow: High-quality text-to-3d generation by approximating probability flow. arXiv preprint arXiv:2403.14966 (2024)
Li, M., et al.: Instant3d: instant text-to-3d generation. arXiv preprint arXiv:2311.08403 (2023)
Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023)
Li, Z., Chen, Y., Zhao, L., Liu, P.: Mvcontrol: adding conditional control to multi-view diffusion for controllable text-to-3d generation. arXiv preprint arXiv:2311.14494 (2023)
Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint arXiv:2311.11284 (2023)
Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Google Scholar
Lin, Y., Clark, R., Torr, P.: Dreampolisher: towards high-quality text-to-3d generation via geometric diffusion. arXiv preprint arXiv:2403.17237 (2024)
Liu, Y.T., Luo, G., Sun, H., Yin, W., Guo, Y.C., Zhang, S.H.: Pi3d: efficient text-to-3d generation with pseudo-image diffusion. arXiv preprint arXiv:2312.09069 (2023)
Liu, Z., et al.: Unidream: unifying diffusion priors for relightable text-to-3d generation. arXiv preprint arXiv:2312.08754 (2023)
Lorraine, J., et al.: Att3d: amortized text-to-3d object synthesis. arXiv preprint arXiv:2306.07349 (2023)
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Ma, Y., et al.: X-dreamer: creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation. arXiv preprint arXiv:2312.00085 (2023)
Mercier, A., et al.: Hexagen3d: stablediffusion is just one step away from fast and diverse text-to-3d generation. arXiv preprint arXiv:2401.07727 (2024)
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)
Article Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
Qian, G., et al.: Atom: amortized text-to-mesh using 2d diffusion. arXiv preprint arXiv:2402.00867 (2024)
Qiu, L., et al.: Richdreamer: a generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv preprint arXiv:2311.16918 (2023)
Ren, X., Huang, J., Zeng, X., Museth, K., Fidler, S., Williams, F.: Xcube: large-scale 3d generative modeling using sparse voxel hierarchies. arXiv preprint (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Schwarz, K., Sauer, A., Niemeyer, M., Liao, Y., Geiger, A.: Voxgraf: fast 3d-aware image synthesis with sparse voxel grids. Adv. Neural. Inf. Process. Syst. 35, 33999–34011 (2022)
Google Scholar
Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Adv. Neural. Inf. Process. Syst. 34, 6087–6101 (2021)
Google Scholar
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.: Deepvoxels: learning persistent 3d feature embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2446 (2019)
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
Tang, B., Wang, J., Wu, Z., Zhang, L.: Stable score distillation for high-quality 3d generation. arXiv preprint arXiv:2312.09305 (2023)
Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054 (2024)
Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
Tang, Z., et al.: Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459 (2023)
Thanh-Tung, H., Tran, T.: Catastrophic forgetting and mode collapse in gans. In: 2020 International Joint Conference on Neural Networks (ijcnn), pp. 1–10. IEEE (2020)
Google Scholar
Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: Textmesh: generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439 (2023)
Vilesov, A., Chari, P., Kadambi, A.: Cg3d: compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907 (2023)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)
Google Scholar
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
Wohlgenannt, I., Simons, A., Stieglitz, S.: Virtual reality. Business Inform. Syst. Eng. 62, 455–461 (2020)
Article Google Scholar
Wu, R., Sun, L., Ma, Z., Zhang, L.: One-step effective diffusion network for real-world image super-resolution. arXiv preprint arXiv:2406.08177 (2024)
Wu, T., et al.: Omniobject3d: large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 803–814 (2023)
Google Scholar
Wu, Z., Zhou, P., Yi, X., Yuan, X., Zhang, H.: Consistent3d: towards consistent high-fidelity text-to-3d generation with deterministic sampling prior. arXiv preprint arXiv:2401.09050 (2024)
Xie, K., et al..: Latte3d: large-scale amortized text-to-enhanced3d synthesis. arXiv preprint arXiv:2403.15385 (2024)
Xu, Y., et al.: Dmv3d: denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217 (2023)
Yang, Z., et al.: Eliminating lipschitz singularities in diffusion models. arXiv preprint arXiv:2306.11251 (2023)
Yang, Zet al.: Lipschitz singularities in diffusion models. In: The Twelfth International Conference on Learning Representations (2023)
Google Scholar
Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. Adv. Neural. Inf. Process. Syst. 34, 4805–4815 (2021)
Google Scholar
Yi, T., et al.: Gaussiandreamer: fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023)
Yu, X., et al.: Mvimgnet: a large-scale dataset of multi-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9150–9161 (2023)
Google Scholar
Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3d with classifier score distillation. arXiv preprint arXiv:2310.19415 (2023)
Zhao, M., et al.: Efficientdreamer: high-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv preprint arXiv:2308.13223 (2023)
Zhao, R., Wang, Z., Wang, Y., Zhou, Z., Zhu, J.: Flexidreamer: single image-to-3d generation with flexicubes. arXiv preprint arXiv:2404.00987 (2024)
Zhou, L., Shih, A., Meng, C., Ermon, S.: Dreampropeller: supercharge text-to-3d generation with parallel sampling. arXiv preprint arXiv:2311.17082 (2023)

Download references

Acknowledgement

This work is supported in part by the Beijing Science and Technology Plan Project Z231100005923033, and the InnoHK program.

Author information

Authors and Affiliations

The Hong Kong Polytechnic University, PolyU, Hong Kong, China
Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Zhen Lei & Lei Zhang
Center for Artificial Intelligence and Robotics, HKISI CAS, Hong Kong, China
Zhiyuan Ma & Zhen Lei
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Hong Kong, China
Xiangyu Zhu & Zhen Lei
School of Artificial Intelligence, University of Chinese Academy of Sciences, UCAS, Beijing, China
Xiangyu Zhu & Zhen Lei
Harbin Institute of Technology, HIT, Harbin, China
Yuxiang Wei

Authors

Zhiyuan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yabin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Lei
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3700 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, Z., Wei, Y., Zhang, Y., Zhu, X., Lei, Z., Zhang, L. (2025). ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-72667-5_1
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation