DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Yan, Zizheng; Zhou, Jiapeng; Meng, Fanpeng; Wu, Yushuang; Qiu, Lingteng; Ye, Zisheng; Cui, Shuguang; Chen, Guanying; Han, Xiaoguang

doi:10.1007/978-3-031-73254-6_8

Zizheng Yan^13,14,15,
Jiapeng Zhou^13,14,15,
Fanpeng Meng^13,14,15,
Yushuang Wu^13,14,15,
Lingteng Qiu^13,14,15,
Zisheng Ye^13,14,15,
Shuguang Cui^13,14,15,
Guanying Chen^13,15 &
…
Xiaoguang Han^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15070))

Included in the following conference series:

European Conference on Computer Vision

Abstract

Text-to-3D generation has recently seen significant progress. To enhance its practicality in real-world applications, it is crucial to generate multiple independent objects with interactions, similar to layer-compositing in 2D image editing. However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking spatially plausible interactions. Addressing this, we propose DreamDissector, a text-to-3D method capable of generating multiple independent objects with interactions. DreamDissector accepts a multi-object text-to-3D NeRF as input and produces independent textured meshes. To achieve this, we introduce the Neural Category Field (NeCF) for disentangling the input NeRF. Additionally, we present the Category Score Distillation Sampling (CSDS), facilitated by a Deep Concept Mining (DCM) module, to tackle the concept gap issue in diffusion models. By leveraging NeCF and CSDS, we can effectively derive sub-NeRFs from the original scene. Further refinement enhances geometry and texture. Our experimental results validate the effectiveness of DreamDissector, providing users with novel means to control 3D synthesis at the object level and potentially opening avenues for various creative applications in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

Improving Text-Guided Object Inpainting with Semantic Pre-inpainting

Notes

1.
https://github.com/threestudio-project/threestudio.

References

Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: ICML (2018)
Google Scholar
Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: text-driven layered image and video editing. In: ECCV (2022)
Google Scholar
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Cen, J., et al.: Segment anything in 3D with NeRFs. arXiv preprint arXiv:2304.12308 (2023)
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation (2023)
Google Scholar
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: CVPR (2019)
Google Scholar
Drebin, R.A., Carpenter, L., Hanrahan, P.: Volume rendering. In: SIGGRAPH (1988)
Google Scholar
Epstein, D., Poole, B., Mildenhall, B., Efros, A.A., Holynski, A.: Disentangled 3D scene generation with layout learning. arXiv preprint arXiv:2402.16936 (2024)
Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: text-driven consistent scene generation (2023)
Google Scholar
Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B.: Graphdreamer: compositional 3D scene synthesis from scene graphs. In: CVPR (2024)
Google Scholar
Gao, J., et al.: Get3D: a generative model of high quality 3D textured shapes learned from images. In: NeurIPS (2022)
Google Scholar
Henzler, P., Mitra, N.J., Ritschel, T.: Escaping Plato’s cave: 3D shape from adversarial rendering. In: ICCV (2019)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. In: SIGGRAPH (2022)
Google Scholar
Hong, S., Ahn, D., Kim, S.: Debiasing scores and prompts of 2D diffusion for view-consistent text-to-3D generation (2023)
Google Scholar
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models (2023)
Google Scholar
Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3D content creation (2023)
Google Scholar
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3D meshes from 2D text-to-image models (2023)
Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR (2022)
Google Scholar
Khalid, N.M., Xie, T., Belilovsky, E., Popa, T.: CLIP-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia (2022)
Google Scholar
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Google Scholar
Lee, H.H., Chang, A.X.: Understanding pure clip guidance for voxel grid NeRF models (2022)
Google Scholar
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR (2023)
Google Scholar
Lin, Y., et al.: CompoNeRF: text-guided multi-object compositional NeRF with editable 3D scene layout. arXiv preprint arXiv:2303.13843 (2023)
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: ICCV (2023)
Google Scholar
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image (2023)
Google Scholar
Lunz, S., Li, Y., Fitzgibbon, A., Kushman, N.: Inverse graphics GAN: learning to generate 3D shapes from unstructured 2D data. arXiv preprint arXiv:2002.12674 (2020)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR (2021)
Google Scholar
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Realfusion: 360deg reconstruction of any object from a single image. In: CVPR (2023)
Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR (2019)
Google Scholar
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-NeRF for shape-guided generation of 3D shapes and textures. In: CVPR (2023)
Google Scholar
Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: text-driven neural stylization for meshes. In: CVPR (2022)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)
Google Scholar
Mo, K., et al.: Structurenet: hierarchical graph networks for 3D shape generation. arXiv preprint arXiv:1908.00575 (2019)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)
Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
Niemeyer, M., Geiger, A.: Giraffe: representing scenes as compositional generative neural feature fields. In: CVPR (2021)
Google Scholar
Po, R., Wetzstein, G.: Compositional 3D scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218 (2023)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Raj, A., et al.: Dreambooth3D: subject-driven text-to-3D generation (2023)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. 3DVAR (2022)
Google Scholar
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Seo, J., et al.: Let 2D diffusion model know 3D-consistency for robust text-to-3D generation (2023)
Google Scholar
Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3D shape synthesis. In: NeurIPS (2021)
Google Scholar
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation (2023)
Google Scholar
Smith, E.J., Meger, D.: Improved adversarial systems for 3D object generation and reconstruction. In: CoRL (2017)
Google Scholar
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Google Scholar
Wang, T., et al.: Rodin: a generative model for sculpting 3d digital avatars using diffusion. In: CVPR (2023)
Google Scholar
Wang, T., et al.: Pretraining is all you need for image-to-image translation (2022)
Google Scholar
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3D generation with variational score distillation (2023)
Google Scholar
Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NeurIPS (2016)
Google Scholar
Wu, Y., et al.: SCoDA: domain adaptive shape completion for real scans. In: CVPR (2023)
Google Scholar
Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: ICCV (2023)
Google Scholar
Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: ICCV (2019)
Google Scholar
Yang, J., et al.: LLM-grounder: open-vocabulary 3d visual grounding with large language model as an agent (2023)
Google Scholar
Yu, X., et al.: MVImgNet: a large-scale dataset of multi-view images. In: CVPR (2023)
Google Scholar
Zeng, X., et al.: Lion: latent point diffusion models for 3D shape generation. In: NeurIPS (2022)
Google Scholar
Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2NeRF: text-driven 3D scene generation with neural radiance fields (2023)
Google Scholar
Zhang, X., Zhao, W., Lu, X., Chien, J.: Text2layer: layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781 (2023)
Zhang, Y., et al.: Image GANs meet differentiable rendering for inverse graphics and interpretable 3D neural rendering. arXiv preprint arXiv:2010.09125 (2020)
Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: ICCV (2021)
Google Scholar
Zhou, X., et al.: Gala3D: towards text-to-3D complex scene generation via layout-guided generative Gaussian splatting. arXiv preprint arXiv:2402.07207 (2024)

Download references

Acknowledgement

The work was supported in part by NSFC with Grant No. 62293482, the Basic Research Project No HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S&T Cooperation Zone, Guangdong Provincial Outstanding Youth Fund (No. 2023B1515020055), the National Key R&D Program of China with grant No. 2018YFB1800800, by Shenzhen Outstanding Talents Training Fund 202002, by Guangdong Research Projects No. 2017ZT07X152 and No. 2019CX01X104, by Key Area R&D Program of Guangdong Province (Grant No. 2018B030338001), by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence (Grant No. 2022B1212010001), and by Shenzhen Key Laboratory of Big Data and Artificial Intelligence (Grant No. ZDSYS201707251409055). It is also partly supported by NSFC-61931024, NSFC-62172348, and Shenzhen Science and Technology Program No. JCYJ20220530143604010.

Author information

Authors and Affiliations

Shenzhen Future Network of Intelligence Institute, Shenzhen, China
Zizheng Yan, Jiapeng Zhou, Fanpeng Meng, Yushuang Wu, Lingteng Qiu, Zisheng Ye, Shuguang Cui, Guanying Chen & Xiaoguang Han
SSE, CUHKSZ, Shenzhen, China
Zizheng Yan, Jiapeng Zhou, Fanpeng Meng, Yushuang Wu, Lingteng Qiu, Zisheng Ye, Shuguang Cui & Xiaoguang Han
Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHKSZ, Shenzhen, China
Zizheng Yan, Jiapeng Zhou, Fanpeng Meng, Yushuang Wu, Lingteng Qiu, Zisheng Ye, Shuguang Cui, Guanying Chen & Xiaoguang Han

Authors

Zizheng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jiapeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Fanpeng Meng
View author publications
You can also search for this author in PubMed Google Scholar
Yushuang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Lingteng Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Zisheng Ye
View author publications
You can also search for this author in PubMed Google Scholar
Shuguang Cui
View author publications
You can also search for this author in PubMed Google Scholar
Guanying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoguang Han .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 27511 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, Z. et al. (2025). DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15070. Springer, Cham. https://doi.org/10.1007/978-3-031-73254-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-73254-6_8
Published: 28 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73253-9
Online ISBN: 978-3-031-73254-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors