Skip to main content

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Text-to-3D generation has recently seen significant progress. To enhance its practicality in real-world applications, it is crucial to generate multiple independent objects with interactions, similar to layer-compositing in 2D image editing. However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking spatially plausible interactions. Addressing this, we propose DreamDissector, a text-to-3D method capable of generating multiple independent objects with interactions. DreamDissector accepts a multi-object text-to-3D NeRF as input and produces independent textured meshes. To achieve this, we introduce the Neural Category Field (NeCF) for disentangling the input NeRF. Additionally, we present the Category Score Distillation Sampling (CSDS), facilitated by a Deep Concept Mining (DCM) module, to tackle the concept gap issue in diffusion models. By leveraging NeCF and CSDS, we can effectively derive sub-NeRFs from the original scene. Further refinement enhances geometry and texture. Our experimental results validate the effectiveness of DreamDissector, providing users with novel means to control 3D synthesis at the object level and potentially opening avenues for various creative applications in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/threestudio-project/threestudio.

References

  1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: ICML (2018)

    Google Scholar 

  2. Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)

  3. Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: text-driven layered image and video editing. In: ECCV (2022)

    Google Scholar 

  4. Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  5. Cen, J., et al.: Segment anything in 3D with NeRFs. arXiv preprint arXiv:2304.12308 (2023)

  6. Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)

  7. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation (2023)

    Google Scholar 

  8. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: CVPR (2019)

    Google Scholar 

  9. Drebin, R.A., Carpenter, L., Hanrahan, P.: Volume rendering. In: SIGGRAPH (1988)

    Google Scholar 

  10. Epstein, D., Poole, B., Mildenhall, B., Efros, A.A., Holynski, A.: Disentangled 3D scene generation with layout learning. arXiv preprint arXiv:2402.16936 (2024)

  11. Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: text-driven consistent scene generation (2023)

    Google Scholar 

  12. Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B.: Graphdreamer: compositional 3D scene synthesis from scene graphs. In: CVPR (2024)

    Google Scholar 

  13. Gao, J., et al.: Get3D: a generative model of high quality 3D textured shapes learned from images. In: NeurIPS (2022)

    Google Scholar 

  14. Henzler, P., Mitra, N.J., Ritschel, T.: Escaping Plato’s cave: 3D shape from adversarial rendering. In: ICCV (2019)

    Google Scholar 

  15. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  16. Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. In: SIGGRAPH (2022)

    Google Scholar 

  17. Hong, S., Ahn, D., Kim, S.: Debiasing scores and prompts of 2D diffusion for view-consistent text-to-3D generation (2023)

    Google Scholar 

  18. Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models (2023)

    Google Scholar 

  19. Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3D content creation (2023)

    Google Scholar 

  20. Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3D meshes from 2D text-to-image models (2023)

    Google Scholar 

  21. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR (2022)

    Google Scholar 

  22. Khalid, N.M., Xie, T., Belilovsky, E., Popa, T.: CLIP-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia (2022)

    Google Scholar 

  23. Kirillov, A., et al.: Segment anything. In: ICCV (2023)

    Google Scholar 

  24. Lee, H.H., Chang, A.X.: Understanding pure clip guidance for voxel grid NeRF models (2022)

    Google Scholar 

  25. Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR (2023)

    Google Scholar 

  26. Lin, Y., et al.: CompoNeRF: text-guided multi-object compositional NeRF with editable 3D scene layout. arXiv preprint arXiv:2303.13843 (2023)

  27. Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: ICCV (2023)

    Google Scholar 

  28. Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image (2023)

    Google Scholar 

  29. Lunz, S., Li, Y., Fitzgibbon, A., Kushman, N.: Inverse graphics GAN: learning to generate 3D shapes from unstructured 2D data. arXiv preprint arXiv:2002.12674 (2020)

  30. Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR (2021)

    Google Scholar 

  31. Melas-Kyriazi, L., Rupprecht, C., Laina, I., Vedaldi, A.: Realfusion: 360deg reconstruction of any object from a single image. In: CVPR (2023)

    Google Scholar 

  32. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR (2019)

    Google Scholar 

  33. Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-NeRF for shape-guided generation of 3D shapes and textures. In: CVPR (2023)

    Google Scholar 

  34. Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: text-driven neural stylization for meshes. In: CVPR (2022)

    Google Scholar 

  35. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)

    Google Scholar 

  36. Mo, K., et al.: Structurenet: hierarchical graph networks for 3D shape generation. arXiv preprint arXiv:1908.00575 (2019)

  37. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)

    Google Scholar 

  38. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)

  39. Niemeyer, M., Geiger, A.: Giraffe: representing scenes as compositional generative neural feature fields. In: CVPR (2021)

    Google Scholar 

  40. Po, R., Wetzstein, G.: Compositional 3D scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218 (2023)

  41. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  43. Raj, A., et al.: Dreambooth3D: subject-driven text-to-3D generation (2023)

    Google Scholar 

  44. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. 3DVAR (2022)

    Google Scholar 

  45. Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

  46. Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)

  47. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  48. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  49. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  50. Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

  51. Seo, J., et al.: Let 2D diffusion model know 3D-consistency for robust text-to-3D generation (2023)

    Google Scholar 

  52. Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3D shape synthesis. In: NeurIPS (2021)

    Google Scholar 

  53. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation (2023)

    Google Scholar 

  54. Smith, E.J., Meger, D.: Improved adversarial systems for 3D object generation and reconstruction. In: CoRL (2017)

    Google Scholar 

  55. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)

    Google Scholar 

  56. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)

    Google Scholar 

  57. Wang, T., et al.: Rodin: a generative model for sculpting 3d digital avatars using diffusion. In: CVPR (2023)

    Google Scholar 

  58. Wang, T., et al.: Pretraining is all you need for image-to-image translation (2022)

    Google Scholar 

  59. Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3D generation with variational score distillation (2023)

    Google Scholar 

  60. Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NeurIPS (2016)

    Google Scholar 

  61. Wu, Y., et al.: SCoDA: domain adaptive shape completion for real scans. In: CVPR (2023)

    Google Scholar 

  62. Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In: ICCV (2023)

    Google Scholar 

  63. Yang, G., Huang, X., Hao, Z., Liu, M.Y., Belongie, S., Hariharan, B.: Pointflow: 3D point cloud generation with continuous normalizing flows. In: ICCV (2019)

    Google Scholar 

  64. Yang, J., et al.: LLM-grounder: open-vocabulary 3d visual grounding with large language model as an agent (2023)

    Google Scholar 

  65. Yu, X., et al.: MVImgNet: a large-scale dataset of multi-view images. In: CVPR (2023)

    Google Scholar 

  66. Zeng, X., et al.: Lion: latent point diffusion models for 3D shape generation. In: NeurIPS (2022)

    Google Scholar 

  67. Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2NeRF: text-driven 3D scene generation with neural radiance fields (2023)

    Google Scholar 

  68. Zhang, X., Zhao, W., Lu, X., Chien, J.: Text2layer: layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781 (2023)

  69. Zhang, Y., et al.: Image GANs meet differentiable rendering for inverse graphics and interpretable 3D neural rendering. arXiv preprint arXiv:2010.09125 (2020)

  70. Zhou, L., Du, Y., Wu, J.: 3D shape generation and completion through point-voxel diffusion. In: ICCV (2021)

    Google Scholar 

  71. Zhou, X., et al.: Gala3D: towards text-to-3D complex scene generation via layout-guided generative Gaussian splatting. arXiv preprint arXiv:2402.07207 (2024)

Download references

Acknowledgement

The work was supported in part by NSFC with Grant No. 62293482, the Basic Research Project No HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S&T Cooperation Zone, Guangdong Provincial Outstanding Youth Fund (No. 2023B1515020055), the National Key R&D Program of China with grant No. 2018YFB1800800, by Shenzhen Outstanding Talents Training Fund 202002, by Guangdong Research Projects No. 2017ZT07X152 and No. 2019CX01X104, by Key Area R&D Program of Guangdong Province (Grant No. 2018B030338001), by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence (Grant No. 2022B1212010001), and by Shenzhen Key Laboratory of Big Data and Artificial Intelligence (Grant No. ZDSYS201707251409055). It is also partly supported by NSFC-61931024, NSFC-62172348, and Shenzhen Science and Technology Program No. JCYJ20220530143604010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoguang Han .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 27511 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, Z. et al. (2025). DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15070. Springer, Cham. https://doi.org/10.1007/978-3-031-73254-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73254-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73253-9

  • Online ISBN: 978-3-031-73254-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics