Abstract
Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.
R. Fu, Z. Wen and Z. Liu—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning, pp. 40–49. PMLR (2018)
Bahmani, S., et al.: CC3D: layout-conditioned generation of compositional 3D scenes. arXiv preprint arXiv:2303.12074 (2023)
Bautista, M.A., et al.: GAUDI: a neural architect for immersive 3D scene generation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 25102–25116 (2022)
Bisht, S., Shekhawat, K., Upasani, N., Jain, R.N., Tiwaskar, R.J., Hebbar, C.: Transforming an adjacency graph into dimensioned floorplan layouts. In: Computer Graphics Forum, vol. 41, pp. 5–22. Wiley Online Library (2022)
Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: synthesizing 3D textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4169–4181 (2023)
Chang, A.X., Eric, M., Savva, M., Manning, C.D.: Sceneseer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017)
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2Shape: generating shapes from natural language by learning joint embeddings. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 100–116. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_7
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. arXiv preprint arXiv:2303.13873 (2023)
Chen, Z., Wang, G., Liu, Z.: Text2light: zero-shot text-driven HDR panorama generation. ACM Trans. Graph. (TOG) 41(6), 1–16 (2022)
Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: SDFusion: multimodal 3D shape completion, reconstruction, and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4456–4465 (2023)
Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)
Deitke, M., et al.: ProcTHOR: large-scale embodied AI using procedural generation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 5982–5994 (2022)
Feng, W., et al.: LayoutGPT: compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393 (2023)
Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., Hanrahan, P.: Example-based synthesis of 3D object arrangements. ACM Trans. Graph. (TOG) 31(6), 1–11 (2012)
Fisher, M., Savva, M., Li, Y., Hanrahan, P., Nießner, M.: Activity-centric scene synthesis for functional 3D scene modeling. ACM Trans. Graph. (TOG) 34(6), 1–13 (2015)
Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023)
Fu, H., et al.: 3D-front: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)
Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vis. 1–25 (2021)
Fu, Q., Chen, X., Wang, X., Wen, S., Zhou, B., Fu, H.: Adaptive synthesis of indoor scenes via activity-associated object relation graphs. ACM Trans. Graph. (TOG) 36(6), 1–13 (2017)
Fu, R., Zhan, X., Chen, Y., Ritchie, D., Sridhar, S.: ShapeCrafter: a recursive text-conditioned 3D shape generation model. In: Advances in Neural Information Processing Systems, vol. 35, pp. 8882–8895 (2022)
Gibson, J.J.: The Ecological Approach to Visual Perception: Classic Edition. Psychology Press (2014)
Giudice, N.A.: 15. Navigating without vision: principles of blind spatial cognition. Handbook of behavioral and cognitive geography, p. 260 (2018)
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3D meshes from 2D text-to-image models. arXiv preprint arXiv:2303.11989 (2023)
Hu, R., Huang, Z., Tang, Y., Van Kaick, O., Zhang, H., Huang, H.: Graph2Plan: learning floorplan generation from layout graphs. ACM Trans. Graph. (TOG) 39(4), 118-1 (2020)
Huang, I., Krishna, V., Atekha, O., Guibas, L.: Aladdin: zero-shot hallucination of stylized 3D assets from abstract scene descriptions. arXiv preprint arXiv:2306.06212 (2023)
Hwang, I., Kim, H., Kim, Y.M.: Text2Scene: text-driven indoor scene stylization with part-aware details. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1890–1899 (2023)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields (2022)
Jun, H., Nichol, A.: Shap-E: generating conditional 3D implicit functions. arXiv preprint arXiv:2305.02463 (2023)
Khanna, M., et al.: Habitat synthetic scenes dataset (HSSD-200): an analysis of 3D scene scale and realism tradeoffs for objectgoal navigation. arXiv preprint arXiv:2306.11290 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 1–16 (2019)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Liu, Z., Wang, Y., Qi, X., Fu, C.W.: Towards implicit text-guided 3D shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17896–17906 (2022)
Loomis, J.M., Lippa, Y., Klatzky, R.L., Golledge, R.G.: Spatial updating of locations specified by 3-D sound and spatial language. J. Exp. Psychol. Learn. Mem. Cogn. 28(2), 335 (2002)
Luo, Z., Huang, W.: FloorplanGAN: vector residential floorplan adversarial generation. Autom. Constr. 142, 104470 (2022)
Ma, C., Vining, N., Lefebvre, S., Sheffer, A.: Game level layout from design specification. In: Computer Graphics Forum, vol. 33, pp. 95–104. Wiley Online Library (2014)
Ma, Y., et al.: X-mesh: towards fast and accurate text-driven 3D stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760 (2023)
Merrell, P., Schkufza, E., Koltun, V.: Computer-generated residential building layouts. In: ACM SIGGRAPH Asia 2010 Papers, pp. 1–12 (2010)
Merrell, P., Schkufza, E., Li, Z., Agrawala, M., Koltun, V.: Interactive furniture layout using interior design guidelines. ACM Trans. Graph. (TOG) 30(4), 1–10 (2011)
Mittal, P., Cheng, Y.C., Singh, M., Tulsiani, S.: AutoSDF: shape priors for 3D completion, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 306–315 (2022)
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)
Müller, P., Wonka, P., Haegler, S., Ulmer, A., Van Gool, L.: Procedural modeling of buildings. In: ACM SIGGRAPH 2006 Papers, pp. 614–623 (2006)
Nauata, N., Chang, K.-H., Cheng, C.-Y., Mori, G., Furukawa, Y.: House-GAN: relational generative adversarial networks for graph-constrained house layout generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 162–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_10
Nauata, N., Hosseini, S., Chang, K.H., Chu, H., Cheng, C.Y., Furukawa, Y.: House-GAN++: generative adversarial layout refinement network towards intelligent computational agent for professional architects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13632–13641 (2021)
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
OpenAI: GPT-4 technical report (2023)
Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: ATISS: autoregressive transformers for indoor scene synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12013–12026 (2021)
Peng, C.H., Yang, Y.L., Wonka, P.: Computing layouts with deformable templates. ACM Trans. Graph. (TOG) 33(4), 1–11 (2014)
Pick, H.L.: Visual coding of nonvisual spatial information (1974)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Purkait, P., Zach, C., Reid, I.: SG-VAE: scene grammar variational autoencoder to generate new indoor scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 155–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_10
Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks (2019). http://arxiv.org/abs/1908.10084
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)
Sanghi, A., Chu, H., Lambourne, J.G., Wang, Y.A.R.: Clip-forge: towards zero-shot text-to-shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18603–18613 (2022)
Sanghi, A., et al.: Clip-sculptor: zero-shot generation of high-fidelity and diverse shapes from natural language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18339–18348 (2023)
Shabani, M.A., Hosseini, S., Furukawa, Y.: Housediffusion: vector floorplan generation via a diffusion model with discrete and continuous denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5466–5475 (2023)
Song, L., et al.: Roomdreamer: text-driven 3D indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023)
Sun, J., Wu, W., Liu, L., Min, W., Zhang, G., Zheng, L.: WallPlan: synthesizing floorplans by learning to generate wall graphs. ACM Trans. Graph. (TOG) 41(4), 1–14 (2022)
Tang, H., et al.: Graph transformer GANs for graph-constrained house generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2173–2182 (2023)
Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: scene graph denoising diffusion probabilistic model for generative indoor scene synthesis. arXiv preprint arXiv:2303.14207 (2023)
Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: Mvdiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097 (2023)
Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: Planit: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)
Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)
Wang, X., Yeshwanth, C., Nießner, M.: SceneFormer: indoor scene generation with transformers. In: 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE (2021)
Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
Wei, J., et al.: Emergent abilities of large language models. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=yzkSU5zdwD, Survey Certification
Wei, J., Wang, H., Feng, J., Lin, G., Yap, K.H.: Taps3D: text-guided 3D textured shape generation from pseudo supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16815 (2023)
Wei, Q.A., et al.: Lego-net: learning regular rearrangements of objects in rooms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19037–19047 (2023)
Wu, W., Fu, X.M., Tang, R., Wang, Y., Qi, Y.H., Liu, L.: Data-driven interior plan generation for residential buildings. ACM Trans. Graph. (SIGGRAPH Asia) 38(6) (2019)
Xu, K., Chen, K., Fu, H., Sun, W.L., Hu, S.M.: Sketch2Scene: sketch-based co-retrieval and co-placement of 3D models. ACM Trans. Graph. (TOG) 32(4), 1–15 (2013)
Yang, M.J., Guo, Y.X., Zhou, B., Tong, X.: Indoor scene generation from a collection of semantic-segmented depth images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15203–15212 (2021)
Yang, Y., et al.: Holodeck: language guided generation of 3D embodied AI environments. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), vol. 30, pp. 20–25. IEEE/CVF (2024)
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)
Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.J.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. (TOG)-Proceedings of ACM SIGGRAPH 2011, Article no. 86 30(4) (2011)
Zhai, G., et al.: Commonscenes: generating commonsense 3D indoor scenes with scene graphs. arXiv preprint arXiv:2305.16283 (2023)
Zhang, Z., et al.: Deep generative modeling for scene synthesis via hybrid representations. ACM Trans. Graph. (TOG) 39(2), 1–21 (2020)
Zhao, W.X., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
Acknowledgements
This research was supported by AFOSR grant FA9550-21-1-0214. The authors thank Dylan Hu, Selena Ling, Kai Wang and Daniel Ritchie.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 103814 KB)
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fu, R., Wen, Z., Liu, Z., Sridhar, S. (2025). AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-72933-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72932-4
Online ISBN: 978-3-031-72933-1
eBook Packages: Computer ScienceComputer Science (R0)