Skip to main content

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.

R. Fu, Z. Wen and Z. Liu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning, pp. 40–49. PMLR (2018)

    Google Scholar 

  2. Bahmani, S., et al.: CC3D: layout-conditioned generation of compositional 3D scenes. arXiv preprint arXiv:2303.12074 (2023)

  3. Bautista, M.A., et al.: GAUDI: a neural architect for immersive 3D scene generation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 25102–25116 (2022)

    Google Scholar 

  4. Bisht, S., Shekhawat, K., Upasani, N., Jain, R.N., Tiwaskar, R.J., Hebbar, C.: Transforming an adjacency graph into dimensioned floorplan layouts. In: Computer Graphics Forum, vol. 41, pp. 5–22. Wiley Online Library (2022)

    Google Scholar 

  5. Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: synthesizing 3D textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4169–4181 (2023)

    Google Scholar 

  6. Chang, A.X., Eric, M., Savva, M., Manning, C.D.: Sceneseer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017)

  7. Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)

  8. Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2Shape: generating shapes from natural language by learning joint embeddings. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 100–116. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_7

    Chapter  Google Scholar 

  9. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. arXiv preprint arXiv:2303.13873 (2023)

  10. Chen, Z., Wang, G., Liu, Z.: Text2light: zero-shot text-driven HDR panorama generation. ACM Trans. Graph. (TOG) 41(6), 1–16 (2022)

    Article  Google Scholar 

  11. Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: SDFusion: multimodal 3D shape completion, reconstruction, and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4456–4465 (2023)

    Google Scholar 

  12. Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)

    Google Scholar 

  13. Deitke, M., et al.: ProcTHOR: large-scale embodied AI using procedural generation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 5982–5994 (2022)

    Google Scholar 

  14. Feng, W., et al.: LayoutGPT: compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393 (2023)

  15. Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., Hanrahan, P.: Example-based synthesis of 3D object arrangements. ACM Trans. Graph. (TOG) 31(6), 1–11 (2012)

    Article  Google Scholar 

  16. Fisher, M., Savva, M., Li, Y., Hanrahan, P., Nießner, M.: Activity-centric scene synthesis for functional 3D scene modeling. ACM Trans. Graph. (TOG) 34(6), 1–13 (2015)

    Article  Google Scholar 

  17. Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023)

  18. Fu, H., et al.: 3D-front: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)

    Google Scholar 

  19. Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vis. 1–25 (2021)

    Google Scholar 

  20. Fu, Q., Chen, X., Wang, X., Wen, S., Zhou, B., Fu, H.: Adaptive synthesis of indoor scenes via activity-associated object relation graphs. ACM Trans. Graph. (TOG) 36(6), 1–13 (2017)

    Google Scholar 

  21. Fu, R., Zhan, X., Chen, Y., Ritchie, D., Sridhar, S.: ShapeCrafter: a recursive text-conditioned 3D shape generation model. In: Advances in Neural Information Processing Systems, vol. 35, pp. 8882–8895 (2022)

    Google Scholar 

  22. Gibson, J.J.: The Ecological Approach to Visual Perception: Classic Edition. Psychology Press (2014)

    Google Scholar 

  23. Giudice, N.A.: 15. Navigating without vision: principles of blind spatial cognition. Handbook of behavioral and cognitive geography, p. 260 (2018)

    Google Scholar 

  24. Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3D meshes from 2D text-to-image models. arXiv preprint arXiv:2303.11989 (2023)

  25. Hu, R., Huang, Z., Tang, Y., Van Kaick, O., Zhang, H., Huang, H.: Graph2Plan: learning floorplan generation from layout graphs. ACM Trans. Graph. (TOG) 39(4), 118-1 (2020)

    Google Scholar 

  26. Huang, I., Krishna, V., Atekha, O., Guibas, L.: Aladdin: zero-shot hallucination of stylized 3D assets from abstract scene descriptions. arXiv preprint arXiv:2306.06212 (2023)

  27. Hwang, I., Kim, H., Kim, Y.M.: Text2Scene: text-driven indoor scene stylization with part-aware details. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1890–1899 (2023)

    Google Scholar 

  28. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields (2022)

    Google Scholar 

  29. Jun, H., Nichol, A.: Shap-E: generating conditional 3D implicit functions. arXiv preprint arXiv:2305.02463 (2023)

  30. Khanna, M., et al.: Habitat synthetic scenes dataset (HSSD-200): an analysis of 3D scene scale and realism tradeoffs for objectgoal navigation. arXiv preprint arXiv:2306.11290 (2023)

  31. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  32. Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 1–16 (2019)

    Article  Google Scholar 

  33. Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

    Google Scholar 

  34. Liu, Z., Wang, Y., Qi, X., Fu, C.W.: Towards implicit text-guided 3D shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17896–17906 (2022)

    Google Scholar 

  35. Loomis, J.M., Lippa, Y., Klatzky, R.L., Golledge, R.G.: Spatial updating of locations specified by 3-D sound and spatial language. J. Exp. Psychol. Learn. Mem. Cogn. 28(2), 335 (2002)

    Article  Google Scholar 

  36. Luo, Z., Huang, W.: FloorplanGAN: vector residential floorplan adversarial generation. Autom. Constr. 142, 104470 (2022)

    Article  Google Scholar 

  37. Ma, C., Vining, N., Lefebvre, S., Sheffer, A.: Game level layout from design specification. In: Computer Graphics Forum, vol. 33, pp. 95–104. Wiley Online Library (2014)

    Google Scholar 

  38. Ma, Y., et al.: X-mesh: towards fast and accurate text-driven 3D stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760 (2023)

    Google Scholar 

  39. Merrell, P., Schkufza, E., Koltun, V.: Computer-generated residential building layouts. In: ACM SIGGRAPH Asia 2010 Papers, pp. 1–12 (2010)

    Google Scholar 

  40. Merrell, P., Schkufza, E., Li, Z., Agrawala, M., Koltun, V.: Interactive furniture layout using interior design guidelines. ACM Trans. Graph. (TOG) 30(4), 1–10 (2011)

    Article  Google Scholar 

  41. Mittal, P., Cheng, Y.C., Singh, M., Tulsiani, S.: AutoSDF: shape priors for 3D completion, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 306–315 (2022)

    Google Scholar 

  42. Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)

    Google Scholar 

  43. Müller, P., Wonka, P., Haegler, S., Ulmer, A., Van Gool, L.: Procedural modeling of buildings. In: ACM SIGGRAPH 2006 Papers, pp. 614–623 (2006)

    Google Scholar 

  44. Nauata, N., Chang, K.-H., Cheng, C.-Y., Mori, G., Furukawa, Y.: House-GAN: relational generative adversarial networks for graph-constrained house layout generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 162–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_10

    Chapter  Google Scholar 

  45. Nauata, N., Hosseini, S., Chang, K.H., Chu, H., Cheng, C.Y., Furukawa, Y.: House-GAN++: generative adversarial layout refinement network towards intelligent computational agent for professional architects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13632–13641 (2021)

    Google Scholar 

  46. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)

  47. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  48. Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: ATISS: autoregressive transformers for indoor scene synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12013–12026 (2021)

    Google Scholar 

  49. Peng, C.H., Yang, Y.L., Wonka, P.: Computing layouts with deformable templates. ACM Trans. Graph. (TOG) 33(4), 1–11 (2014)

    Article  Google Scholar 

  50. Pick, H.L.: Visual coding of nonvisual spatial information (1974)

    Google Scholar 

  51. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  52. Purkait, P., Zach, C., Reid, I.: SG-VAE: scene grammar variational autoencoder to generate new indoor scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 155–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_10

    Chapter  Google Scholar 

  53. Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)

    Google Scholar 

  54. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  55. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks (2019). http://arxiv.org/abs/1908.10084

  56. Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)

  57. Sanghi, A., Chu, H., Lambourne, J.G., Wang, Y.A.R.: Clip-forge: towards zero-shot text-to-shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18603–18613 (2022)

    Google Scholar 

  58. Sanghi, A., et al.: Clip-sculptor: zero-shot generation of high-fidelity and diverse shapes from natural language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18339–18348 (2023)

    Google Scholar 

  59. Shabani, M.A., Hosseini, S., Furukawa, Y.: Housediffusion: vector floorplan generation via a diffusion model with discrete and continuous denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5466–5475 (2023)

    Google Scholar 

  60. Song, L., et al.: Roomdreamer: text-driven 3D indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023)

  61. Sun, J., Wu, W., Liu, L., Min, W., Zhang, G., Zheng, L.: WallPlan: synthesizing floorplans by learning to generate wall graphs. ACM Trans. Graph. (TOG) 41(4), 1–14 (2022)

    Article  Google Scholar 

  62. Tang, H., et al.: Graph transformer GANs for graph-constrained house generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2173–2182 (2023)

    Google Scholar 

  63. Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: scene graph denoising diffusion probabilistic model for generative indoor scene synthesis. arXiv preprint arXiv:2303.14207 (2023)

  64. Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: Mvdiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097 (2023)

  65. Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: Planit: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)

    Article  Google Scholar 

  66. Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)

    Google Scholar 

  67. Wang, X., Yeshwanth, C., Nießner, M.: SceneFormer: indoor scene generation with transformers. In: 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE (2021)

    Google Scholar 

  68. Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)

  69. Wei, J., et al.: Emergent abilities of large language models. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=yzkSU5zdwD, Survey Certification

  70. Wei, J., Wang, H., Feng, J., Lin, G., Yap, K.H.: Taps3D: text-guided 3D textured shape generation from pseudo supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16815 (2023)

    Google Scholar 

  71. Wei, Q.A., et al.: Lego-net: learning regular rearrangements of objects in rooms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19037–19047 (2023)

    Google Scholar 

  72. Wu, W., Fu, X.M., Tang, R., Wang, Y., Qi, Y.H., Liu, L.: Data-driven interior plan generation for residential buildings. ACM Trans. Graph. (SIGGRAPH Asia) 38(6) (2019)

    Google Scholar 

  73. Xu, K., Chen, K., Fu, H., Sun, W.L., Hu, S.M.: Sketch2Scene: sketch-based co-retrieval and co-placement of 3D models. ACM Trans. Graph. (TOG) 32(4), 1–15 (2013)

    Article  Google Scholar 

  74. Yang, M.J., Guo, Y.X., Zhou, B., Tong, X.: Indoor scene generation from a collection of semantic-segmented depth images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15203–15212 (2021)

    Google Scholar 

  75. Yang, Y., et al.: Holodeck: language guided generation of 3D embodied AI environments. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), vol. 30, pp. 20–25. IEEE/CVF (2024)

    Google Scholar 

  76. Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  77. Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.J.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. (TOG)-Proceedings of ACM SIGGRAPH 2011, Article no. 86 30(4) (2011)

    Google Scholar 

  78. Zhai, G., et al.: Commonscenes: generating commonsense 3D indoor scenes with scene graphs. arXiv preprint arXiv:2305.16283 (2023)

  79. Zhang, Z., et al.: Deep generative modeling for scene synthesis via hybrid representations. ACM Trans. Graph. (TOG) 39(2), 1–21 (2020)

    Google Scholar 

  80. Zhao, W.X., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)

Download references

Acknowledgements

This research was supported by AFOSR grant FA9550-21-1-0214. The authors thank Dylan Hu, Selena Ling, Kai Wang and Daniel Ritchie.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rao Fu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 103814 KB)

Supplementary material 2 (pdf 4501 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fu, R., Wen, Z., Liu, Z., Sridhar, S. (2025). AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72933-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72932-4

  • Online ISBN: 978-3-031-72933-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics