Abstract
Text-to-3D generation has recently seen significant progress. To enhance its practicality in real-world applications, it is crucial to generate multiple independent objects with interactions, similar to layer-compositing in 2D image editing. However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking spatially plausible interactions. Addressing this, we propose DreamDissector, a text-to-3D method capable of generating multiple independent objects with interactions. DreamDissector accepts a multi-object text-to-3D NeRF as input and produces independent textured meshes. To achieve this, we introduce the Neural Category Field (NeCF) for disentangling the input NeRF. Additionally, we present the Category Score Distillation Sampling (CSDS), facilitated by a Deep Concept Mining (DCM) module, to tackle the concept gap issue in diffusion models. By leveraging NeCF and CSDS, we can effectively derive sub-NeRFs from the original scene. Further refinement enhances geometry and texture. Our experimental results validate the effectiveness of DreamDissector, providing users with novel means to control 3D synthesis at the object level and potentially opening avenues for various creative applications in the future.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: ICML (2018)
Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: text-driven layered image and video editing. In: ECCV (2022)
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Cen, J., et al.: Segment anything in 3D with NeRFs. arXiv preprint arXiv:2304.12308 (2023)
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation (2023)
Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: CVPR (2019)
Drebin, R.A., Carpenter, L., Hanrahan, P.: Volume rendering. In: SIGGRAPH (1988)
Epstein, D., Poole, B., Mildenhall, B., Efros, A.A., Holynski, A.: Disentangled 3D scene generation with layout learning. arXiv preprint arXiv:2402.16936 (2024)
Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: text-driven consistent scene generation (2023)
Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B.: Graphdreamer: compositional 3D scene synthesis from scene graphs. In: CVPR (2024)
Gao, J., et al.: Get3D: a generative model of high quality 3D textured shapes learned from images. In: NeurIPS (2022)
Henzler, P., Mitra, N.J., Ritschel, T.: Escaping Plato’s cave: 3D shape from adversarial rendering. In: ICCV (2019)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. In: SIGGRAPH (2022)
Hong, S., Ahn, D., Kim, S.: Debiasing scores and prompts of 2D diffusion for view-consistent text-to-3D generation (2023)
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models (2023)
Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3D content creation (2023)
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3D meshes from 2D text-to-image models (2023)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR (2022)
Khalid, N.M., Xie, T., Belilovsky, E., Popa, T.: CLIP-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia (2022)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Lee, H.H., Chang, A.X.: Understanding pure clip guidance for voxel grid NeRF models (2022)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR (2023)
Lin, Y., et al.: CompoNeRF: text-guided multi-object compositional NeRF with editable 3D scene layout. arXiv preprint arXiv:2303.13843 (2023)
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: ICCV (2023)
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image (2023)
Lunz, S., Li, Y., Fitzgibbon, A., Kushman, N.: Inverse graphics GAN: learning to generate 3D shapes from unstructured 2D data. arXiv preprint arXiv:2002.12674 (2020)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR (2021)
Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Realfusion: 360deg reconstruction of any object from a single image. In: CVPR (2023)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR (2019)
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-NeRF for shape-guided generation of 3D shapes and textures. In: CVPR (2023)
Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: text-driven neural stylization for meshes. In: CVPR (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)
Mo, K., et al.: Structurenet: hierarchical graph networks for 3D shape generation. arXiv preprint arXiv:1908.00575 (2019)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
Niemeyer, M., Geiger, A.: Giraffe: representing scenes as compositional generative neural feature fields. In: CVPR (2021)
Po, R., Wetzstein, G.: Compositional 3D scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218 (2023)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Raj, A., et al.: Dreambooth3D: subject-driven text-to-3D generation (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. 3DVAR (2022)
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Seo, J., et al.: Let 2D diffusion model know 3D-consistency for robust text-to-3D generation (2023)
Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3D shape synthesis. In: NeurIPS (2021)
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation (2023)
Smith, E.J., Meger, D.: Improved adversarial systems for 3D object generation and reconstruction. In: CoRL (2017)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Wang, T., et al.: Rodin: a generative model for sculpting 3d digital avatars using diffusion. In: CVPR (2023)
Wang, T., et al.: Pretraining is all you need for image-to-image translation (2022)
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3D generation with variational score distillation (2023)
Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NeurIPS (2016)
Wu, Y., et al.: SCoDA: domain adaptive shape completion for real scans. In: CVPR (2023)
Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: ICCV (2023)
Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: ICCV (2019)
Yang, J., et al.: LLM-grounder: open-vocabulary 3d visual grounding with large language model as an agent (2023)
Yu, X., et al.: MVImgNet: a large-scale dataset of multi-view images. In: CVPR (2023)
Zeng, X., et al.: Lion: latent point diffusion models for 3D shape generation. In: NeurIPS (2022)
Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2NeRF: text-driven 3D scene generation with neural radiance fields (2023)
Zhang, X., Zhao, W., Lu, X., Chien, J.: Text2layer: layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781 (2023)
Zhang, Y., et al.: Image GANs meet differentiable rendering for inverse graphics and interpretable 3D neural rendering. arXiv preprint arXiv:2010.09125 (2020)
Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: ICCV (2021)
Zhou, X., et al.: Gala3D: towards text-to-3D complex scene generation via layout-guided generative Gaussian splatting. arXiv preprint arXiv:2402.07207 (2024)
Acknowledgement
The work was supported in part by NSFC with Grant No. 62293482, the Basic Research Project No HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S&T Cooperation Zone, Guangdong Provincial Outstanding Youth Fund (No. 2023B1515020055), the National Key R&D Program of China with grant No. 2018YFB1800800, by Shenzhen Outstanding Talents Training Fund 202002, by Guangdong Research Projects No. 2017ZT07X152 and No. 2019CX01X104, by Key Area R&D Program of Guangdong Province (Grant No. 2018B030338001), by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence (Grant No. 2022B1212010001), and by Shenzhen Key Laboratory of Big Data and Artificial Intelligence (Grant No. ZDSYS201707251409055). It is also partly supported by NSFC-61931024, NSFC-62172348, and Shenzhen Science and Technology Program No. JCYJ20220530143604010.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yan, Z. et al. (2025). DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15070. Springer, Cham. https://doi.org/10.1007/978-3-031-73254-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-73254-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73253-9
Online ISBN: 978-3-031-73254-6
eBook Packages: Computer ScienceComputer Science (R0)