AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Fu, Rao; Wen, Zehao; Liu, Zichen; Sridhar, Srinath

doi:10.1007/978-3-031-72933-1_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15097))

Included in the following conference series:

European Conference on Computer Vision

399 Accesses

Abstract

Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.

R. Fu, Z. Wen and Z. Liu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SceneTeller: Language-to-3D Scene Generation

Generating Architectural Floor Plans Through Conditional Large Diffusion Model

Advances in text-guided 3D editing: a survey

Article Open access 12 October 2024

References

Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3D point clouds. In: International Conference on Machine Learning, pp. 40–49. PMLR (2018)
Google Scholar
Bahmani, S., et al.: CC3D: layout-conditioned generation of compositional 3D scenes. arXiv preprint arXiv:2303.12074 (2023)
Bautista, M.A., et al.: GAUDI: a neural architect for immersive 3D scene generation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 25102–25116 (2022)
Google Scholar
Bisht, S., Shekhawat, K., Upasani, N., Jain, R.N., Tiwaskar, R.J., Hebbar, C.: Transforming an adjacency graph into dimensioned floorplan layouts. In: Computer Graphics Forum, vol. 41, pp. 5–22. Wiley Online Library (2022)
Google Scholar
Cao, T., Kreis, K., Fidler, S., Sharp, N., Yin, K.: Texfusion: synthesizing 3D textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4169–4181 (2023)
Google Scholar
Chang, A.X., Eric, M., Savva, M., Manning, C.D.: Sceneseer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017)
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
Chen, K., Choy, C.B., Savva, M., Chang, A.X., Funkhouser, T., Savarese, S.: Text2Shape: generating shapes from natural language by learning joint embeddings. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 100–116. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_7
Chapter Google Scholar
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. arXiv preprint arXiv:2303.13873 (2023)
Chen, Z., Wang, G., Liu, Z.: Text2light: zero-shot text-driven HDR panorama generation. ACM Trans. Graph. (TOG) 41(6), 1–16 (2022)
Article Google Scholar
Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: SDFusion: multimodal 3D shape completion, reconstruction, and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4456–4465 (2023)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)
Google Scholar
Deitke, M., et al.: ProcTHOR: large-scale embodied AI using procedural generation. In: Advances in Neural Information Processing Systems, vol. 35, pp. 5982–5994 (2022)
Google Scholar
Feng, W., et al.: LayoutGPT: compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393 (2023)
Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., Hanrahan, P.: Example-based synthesis of 3D object arrangements. ACM Trans. Graph. (TOG) 31(6), 1–11 (2012)
Article Google Scholar
Fisher, M., Savva, M., Li, Y., Hanrahan, P., Nießner, M.: Activity-centric scene synthesis for functional 3D scene modeling. ACM Trans. Graph. (TOG) 34(6), 1–13 (2015)
Article Google Scholar
Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: Scenescape: text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023)
Fu, H., et al.: 3D-front: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)
Google Scholar
Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vis. 1–25 (2021)
Google Scholar
Fu, Q., Chen, X., Wang, X., Wen, S., Zhou, B., Fu, H.: Adaptive synthesis of indoor scenes via activity-associated object relation graphs. ACM Trans. Graph. (TOG) 36(6), 1–13 (2017)
Google Scholar
Fu, R., Zhan, X., Chen, Y., Ritchie, D., Sridhar, S.: ShapeCrafter: a recursive text-conditioned 3D shape generation model. In: Advances in Neural Information Processing Systems, vol. 35, pp. 8882–8895 (2022)
Google Scholar
Gibson, J.J.: The Ecological Approach to Visual Perception: Classic Edition. Psychology Press (2014)
Google Scholar
Giudice, N.A.: 15. Navigating without vision: principles of blind spatial cognition. Handbook of behavioral and cognitive geography, p. 260 (2018)
Google Scholar
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3D meshes from 2D text-to-image models. arXiv preprint arXiv:2303.11989 (2023)
Hu, R., Huang, Z., Tang, Y., Van Kaick, O., Zhang, H., Huang, H.: Graph2Plan: learning floorplan generation from layout graphs. ACM Trans. Graph. (TOG) 39(4), 118-1 (2020)
Google Scholar
Huang, I., Krishna, V., Atekha, O., Guibas, L.: Aladdin: zero-shot hallucination of stylized 3D assets from abstract scene descriptions. arXiv preprint arXiv:2306.06212 (2023)
Hwang, I., Kim, H., Kim, Y.M.: Text2Scene: text-driven indoor scene stylization with part-aware details. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1890–1899 (2023)
Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields (2022)
Google Scholar
Jun, H., Nichol, A.: Shap-E: generating conditional 3D implicit functions. arXiv preprint arXiv:2305.02463 (2023)
Khanna, M., et al.: Habitat synthetic scenes dataset (HSSD-200): an analysis of 3D scene scale and realism tradeoffs for objectgoal navigation. arXiv preprint arXiv:2306.11290 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, M., et al.: Grains: generative recursive autoencoders for indoor scenes. ACM Trans. Graph. (TOG) 38(2), 1–16 (2019)
Article Google Scholar
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Google Scholar
Liu, Z., Wang, Y., Qi, X., Fu, C.W.: Towards implicit text-guided 3D shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17896–17906 (2022)
Google Scholar
Loomis, J.M., Lippa, Y., Klatzky, R.L., Golledge, R.G.: Spatial updating of locations specified by 3-D sound and spatial language. J. Exp. Psychol. Learn. Mem. Cogn. 28(2), 335 (2002)
Article Google Scholar
Luo, Z., Huang, W.: FloorplanGAN: vector residential floorplan adversarial generation. Autom. Constr. 142, 104470 (2022)
Article Google Scholar
Ma, C., Vining, N., Lefebvre, S., Sheffer, A.: Game level layout from design specification. In: Computer Graphics Forum, vol. 33, pp. 95–104. Wiley Online Library (2014)
Google Scholar
Ma, Y., et al.: X-mesh: towards fast and accurate text-driven 3D stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2749–2760 (2023)
Google Scholar
Merrell, P., Schkufza, E., Koltun, V.: Computer-generated residential building layouts. In: ACM SIGGRAPH Asia 2010 Papers, pp. 1–12 (2010)
Google Scholar
Merrell, P., Schkufza, E., Li, Z., Agrawala, M., Koltun, V.: Interactive furniture layout using interior design guidelines. ACM Trans. Graph. (TOG) 30(4), 1–10 (2011)
Article Google Scholar
Mittal, P., Cheng, Y.C., Singh, M., Tulsiani, S.: AutoSDF: shape priors for 3D completion, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 306–315 (2022)
Google Scholar
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)
Google Scholar
Müller, P., Wonka, P., Haegler, S., Ulmer, A., Van Gool, L.: Procedural modeling of buildings. In: ACM SIGGRAPH 2006 Papers, pp. 614–623 (2006)
Google Scholar
Nauata, N., Chang, K.-H., Cheng, C.-Y., Mori, G., Furukawa, Y.: House-GAN: relational generative adversarial networks for graph-constrained house layout generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 162–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_10
Chapter Google Scholar
Nauata, N., Hosseini, S., Chang, K.H., Chu, H., Cheng, C.Y., Furukawa, Y.: House-GAN++: generative adversarial layout refinement network towards intelligent computational agent for professional architects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13632–13641 (2021)
Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
OpenAI: GPT-4 technical report (2023)
Google Scholar
Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: ATISS: autoregressive transformers for indoor scene synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12013–12026 (2021)
Google Scholar
Peng, C.H., Yang, Y.L., Wonka, P.: Computing layouts with deformable templates. ACM Trans. Graph. (TOG) 33(4), 1–11 (2014)
Article Google Scholar
Pick, H.L.: Visual coding of nonvisual spatial information (1974)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Purkait, P., Zach, C., Reid, I.: SG-VAE: scene grammar variational autoencoder to generate new indoor scenes. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 155–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_10
Chapter Google Scholar
Qi, S., Zhu, Y., Huang, S., Jiang, C., Zhu, S.C.: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5899–5908 (2018)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks (2019). http://arxiv.org/abs/1908.10084
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)
Sanghi, A., Chu, H., Lambourne, J.G., Wang, Y.A.R.: Clip-forge: towards zero-shot text-to-shape generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18603–18613 (2022)
Google Scholar
Sanghi, A., et al.: Clip-sculptor: zero-shot generation of high-fidelity and diverse shapes from natural language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18339–18348 (2023)
Google Scholar
Shabani, M.A., Hosseini, S., Furukawa, Y.: Housediffusion: vector floorplan generation via a diffusion model with discrete and continuous denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5466–5475 (2023)
Google Scholar
Song, L., et al.: Roomdreamer: text-driven 3D indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023)
Sun, J., Wu, W., Liu, L., Min, W., Zhang, G., Zheng, L.: WallPlan: synthesizing floorplans by learning to generate wall graphs. ACM Trans. Graph. (TOG) 41(4), 1–14 (2022)
Article Google Scholar
Tang, H., et al.: Graph transformer GANs for graph-constrained house generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2173–2182 (2023)
Google Scholar
Tang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: scene graph denoising diffusion probabilistic model for generative indoor scene synthesis. arXiv preprint arXiv:2303.14207 (2023)
Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: Mvdiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097 (2023)
Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: Planit: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)
Article Google Scholar
Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)
Google Scholar
Wang, X., Yeshwanth, C., Nießner, M.: SceneFormer: indoor scene generation with transformers. In: 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE (2021)
Google Scholar
Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
Wei, J., et al.: Emergent abilities of large language models. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=yzkSU5zdwD, Survey Certification
Wei, J., Wang, H., Feng, J., Lin, G., Yap, K.H.: Taps3D: text-guided 3D textured shape generation from pseudo supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16815 (2023)
Google Scholar
Wei, Q.A., et al.: Lego-net: learning regular rearrangements of objects in rooms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19037–19047 (2023)
Google Scholar
Wu, W., Fu, X.M., Tang, R., Wang, Y., Qi, Y.H., Liu, L.: Data-driven interior plan generation for residential buildings. ACM Trans. Graph. (SIGGRAPH Asia) 38(6) (2019)
Google Scholar
Xu, K., Chen, K., Fu, H., Sun, W.L., Hu, S.M.: Sketch2Scene: sketch-based co-retrieval and co-placement of 3D models. ACM Trans. Graph. (TOG) 32(4), 1–15 (2013)
Article Google Scholar
Yang, M.J., Guo, Y.X., Zhou, B., Tong, X.: Indoor scene generation from a collection of semantic-segmented depth images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15203–15212 (2021)
Google Scholar
Yang, Y., et al.: Holodeck: language guided generation of 3D embodied AI environments. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), vol. 30, pp. 20–25. IEEE/CVF (2024)
Google Scholar
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.J.: Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. (TOG)-Proceedings of ACM SIGGRAPH 2011, Article no. 86 30(4) (2011)
Google Scholar
Zhai, G., et al.: Commonscenes: generating commonsense 3D indoor scenes with scene graphs. arXiv preprint arXiv:2305.16283 (2023)
Zhang, Z., et al.: Deep generative modeling for scene synthesis via hybrid representations. ACM Trans. Graph. (TOG) 39(2), 1–21 (2020)
Google Scholar
Zhao, W.X., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)

Download references

Acknowledgements

This research was supported by AFOSR grant FA9550-21-1-0214. The authors thank Dylan Hu, Selena Ling, Kai Wang and Daniel Ritchie.

Author information

Authors and Affiliations

Brown University, Providence, USA
Rao Fu & Srinath Sridhar
Shenzhen College of International Education, Shenzhen, China
Zehao Wen & Zichen Liu

Authors

Rao Fu
View author publications
You can also search for this author in PubMed Google Scholar
Zehao Wen
View author publications
You can also search for this author in PubMed Google Scholar
Zichen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Srinath Sridhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rao Fu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 103814 KB)

Supplementary material 2 (pdf 4501 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, R., Wen, Z., Liu, Z., Sridhar, S. (2025). AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-72933-1_4
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72932-4
Online ISBN: 978-3-031-72933-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SceneTeller: Language-to-3D Scene Generation

Generating Architectural Floor Plans Through Conditional Large Diffusion Model

Advances in text-guided 3D editing: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 2 (pdf 4501 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SceneTeller: Language-to-3D Scene Generation

Generating Architectural Floor Plans Through Conditional Large Diffusion Model

Advances in text-guided 3D editing: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 2 (pdf 4501 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation