Structured-NeRF: Hierarchical Scene Graph with Neural Representation

Zhong, Zhide; Cao, Jiakai; Gu, Songen; Xie, Sirui; Luo, Liyi; Zhao, Hao; Zhou, Guyue; Li, Haoang; Yan, Zike

doi:10.1007/978-3-031-72761-0_11

Zhide Zhong^13,14,
Jiakai Cao¹⁴,
Songen Gu^14,15,
Sirui Xie^14,16,
Liyi Luo¹⁴,
Hao Zhao¹⁴,
Guyue Zhou¹⁴,
Haoang Li¹³ &
…
Zike Yan¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15093))

Included in the following conference series:

European Conference on Computer Vision

449 Accesses

Abstract

We present Structured Neural Radiance Field (Structured-NeRF) for indoor scene representaion based on a novel hierarchical scene graph structure to organize the neural radiance field. Existing object-centric methods focus only on the inherent characteristics of objects, while overlooking the semantic and physical relationships between them. Our scene graph is adept at managing the complex real-world correlation between objects within a scene, enabling functionality beyond novel view synthesis, such as scene re-arrangement. Based on the hierarchical structure, we introduce the optimization strategy based on semantic and physical relationships, thus simplifying the operations involved in scene editing and ensuring both efficiency and accuracy. Moreover, we conduct shadow rendering on objects to further intensify the realism of the rendered images. Experimental results demonstrate our structured representation not only achieves state-of-the-art (SOTA) performance in object-level and scene-level rendering, but also advances downstream applications in union with LLM/VLM, such as automatic and instruction/image conditioned scene re-arrangement, thereby extending the NeRF to interactive editing conveniently and controllably.

Z. Zhong and J. Cao—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EchoScene: Indoor Scene Generation via Information Echo Over Scene Graph Diffusion

Visual Harmony: LLM’s Power in Crafting Coherent Indoor Scenes from Images

Digging into Radiance Grid for Real-Time View Synthesis with Detail Preservation

References

Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5855–5864 (2021)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479 (2022)
Google Scholar
Bing, W., Chen, L., Yang, B.: Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. In: ICLR (2023)
Google Scholar
Cao, C., Cai, Y., Dong, Q., Wang, Y., Fu, Y.: Leftrefill: filling right canvas based on left reference through generalized text-to-image diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Chang, H., et al.: Context-aware entity grounding with open-vocabulary 3d scene graphs. arXiv preprint arXiv:2309.15940 (2023)
Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 640–658. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19815-1_37
Chapter Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
Google Scholar
Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: ICCV, pp. 10786–10796 (2021)
Google Scholar
Gu, Q., et al.: Conceptgraphs: open-vocabulary 3d scene graphs for perception and planning (2023)
Google Scholar
Han, X., Liu, H., Ding, Y., Yang, L.: Ro-map: real-time multi-object mapping with neural radiance fields (2023)
Google Scholar
Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: editing 3d scenes with instructions. In: ICCV (2023)
Google Scholar
Heller, G., Fetaya, E.: Can stochastic gradient langevin dynamics provide differential privacy for deep learning? (2023)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Jambon, C., Kerbl, B., Kopanas, G., Diolatzis, S., Drettakis, G., Leimkühler, T.: Nerfshop: interactive editing of neural radiance fields, vol. 6 (2023)
Google Scholar
Kapelyukh, I., Vosylius, V., Johns, E.: Dall-e-bot: introducing web-scale diffusion models to robotics. IEEE Rob. Autom. Lett. 8, 3956–3963 (2023)
Article Google Scholar
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: language embedded radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19729–19739 (2023)
Google Scholar
Kong, X., Liu, S., Taher, M., Davison, A.J.: vmap: vectorised object mapping for neural field slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 952–961 (2023)
Google Scholar
Kong, X., Liu, S., Taher, M., Davison, A.J.: vmap: vectorised object mapping for neural field slam. In: CVPR, pp. 952–961 (2023)
Google Scholar
Kundu, A., et al.: Panoptic neural fields: a semantic object-aware neural scene representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12871–12881 (2022)
Google Scholar
Le Cleac’h, S., et al.: Differentiable physics simulation of dynamics-augmented neural objects. IEEE Rob. Autom. Lett. 8(5), 2780–2787 (2023). https://doi.org/10.1109/LRA.2023.3257707
Article Google Scholar
Liu, H.K., Shen, I., Chen, B.Y., et al.: Nerf-in: free-form nerf inpainting with rgb-d priors. arXiv preprint arXiv:2206.04901 (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Mirzaei, A., et al.: Reference-guided controllable inpainting of neural radiance fields. arXiv preprint arXiv:2304.09677 (2023)
Mirzaei, A., et al.: Spin-nerf: multiview segmentation and perceptual inpainting with neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20669–20679 (2023)
Google Scholar
Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: Total3dunderstanding: joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 55–64 (2020)
Google Scholar
OpenAI: Gpt-4 technical report (2023)
Google Scholar
Ost, J., Mannan, F., Thuerey, N., Knodt, J., Heide, F.: Neural scene graphs for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2856–2865 (2021)
Google Scholar
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)
Google Scholar
Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: Atiss: autoregressive transformers for indoor scene synthesis. Adv. Neural. Inf. Process. Syst. 34, 12013–12026 (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022)
Google Scholar
Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I., Suenderhauf, N.: Sayplan: grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Google Scholar
Shahbazi, M., et al.: Inserf: text-driven generative object insertion in neural 3d scenes. arXiv preprint arXiv:2401.05335 (2024)
Shum, K.C., Kim, J., Hua, B.S., Nguyen, D.T., Yeung, S.K.: Language-driven object fusion into neural radiance fields with pose-conditioned dataset updates (2023)
Google Scholar
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions, pp. 2149–2159 (2022)
Google Scholar
Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development, pp. 1–12 (2023)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)
Google Scholar
Wang, X., Yeshwanth, C., Nießner, M.: Sceneformer: indoor scene generation with transformers. In: 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE (2021)
Google Scholar
Wang, Y., Wu, W., Xu, D.: Learning unified decompositional and compositional nerf for editable novel view synthesis. In: ICCV (2023)
Google Scholar
Wei, Q.A., et al.: Lego-net: learning regular rearrangements of objects in rooms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19037–19047 (2023)
Google Scholar
Williams, L.: Casting curved shadows on curved surfaces. SIGGRAPH Comput. Graph. 12(3), 270-274 (1978)
Google Scholar
Wu, Q., et al.: Object-compositional neural implicit surfaces. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 197–213. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_12
Chapter Google Scholar
Wu, Z., et al.: Mars: an instance-aware, modular and realistic simulator for autonomous driving. In: CICAI (2023)
Google Scholar
Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV, pp. 13779–13788 (2021)
Google Scholar
Yang, Z., et al.: Unisim: a neural closed-loop sensor simulator. In: CVPR, pp. 1389–1399 (2023)
Google Scholar
Yang, Z., et al.: The dawn of lmms: preliminary explorations with gpt-4v (ision), 9(1), 1 (2023). arXiv preprint arXiv:2309.17421
Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS 35, 25018–25032 (2022)
Google Scholar
Zha, W., Li, X., Xing, Y., He, L., Li, D.: Reconstruction of shale image based on wasserstein generative adversarial networks with gradient penalty. Adv. Geo-Energy Res. 4(1), 107–114 (2020)
Article Google Scholar
Zhi, S., Laidlow, T., Leutenegger, S., Davison, A.J.: In-place scene labelling and understanding with implicit scene representation (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

HKUST (GZ), Guangzhou, China
Zhide Zhong & Haoang Li
AIR, Tsinghua University, Beijing, China
Zhide Zhong, Jiakai Cao, Songen Gu, Sirui Xie, Liyi Luo, Hao Zhao, Guyue Zhou & Zike Yan
Institute of Software, Chinese Academy of Sciences, Beijing, China
Songen Gu
Beijing University of Civil Engineering and Architecture, Beijing, China
Sirui Xie

Authors

Zhide Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Jiakai Cao
View author publications
You can also search for this author in PubMed Google Scholar
Songen Gu
View author publications
You can also search for this author in PubMed Google Scholar
Sirui Xie
View author publications
You can also search for this author in PubMed Google Scholar
Liyi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Guyue Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Haoang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zike Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zike Yan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhong, Z. et al. (2025). Structured-NeRF: Hierarchical Scene Graph with Neural Representation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72761-0_11
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72760-3
Online ISBN: 978-3-031-72761-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Structured-NeRF: Hierarchical Scene Graph with Neural Representation