Abstract
We present Structured Neural Radiance Field (Structured-NeRF) for indoor scene representaion based on a novel hierarchical scene graph structure to organize the neural radiance field. Existing object-centric methods focus only on the inherent characteristics of objects, while overlooking the semantic and physical relationships between them. Our scene graph is adept at managing the complex real-world correlation between objects within a scene, enabling functionality beyond novel view synthesis, such as scene re-arrangement. Based on the hierarchical structure, we introduce the optimization strategy based on semantic and physical relationships, thus simplifying the operations involved in scene editing and ensuring both efficiency and accuracy. Moreover, we conduct shadow rendering on objects to further intensify the realism of the rendered images. Experimental results demonstrate our structured representation not only achieves state-of-the-art (SOTA) performance in object-level and scene-level rendering, but also advances downstream applications in union with LLM/VLM, such as automatic and instruction/image conditioned scene re-arrangement, thereby extending the NeRF to interactive editing conveniently and controllably.
Z. Zhong and J. Cao—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5855–5864 (2021)
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479 (2022)
Bing, W., Chen, L., Yang, B.: Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. In: ICLR (2023)
Cao, C., Cai, Y., Dong, Q., Wang, Y., Fu, Y.: Leftrefill: filling right canvas based on left reference through generalized text-to-image diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Chang, H., et al.: Context-aware entity grounding with open-vocabulary 3d scene graphs. arXiv preprint arXiv:2309.15940 (2023)
Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 640–658. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19815-1_37
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: ICCV, pp. 10786–10796 (2021)
Gu, Q., et al.: Conceptgraphs: open-vocabulary 3d scene graphs for perception and planning (2023)
Han, X., Liu, H., Ding, Y., Yang, L.: Ro-map: real-time multi-object mapping with neural radiance fields (2023)
Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: editing 3d scenes with instructions. In: ICCV (2023)
Heller, G., Fetaya, E.: Can stochastic gradient langevin dynamics provide differential privacy for deep learning? (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Jambon, C., Kerbl, B., Kopanas, G., Diolatzis, S., Drettakis, G., Leimkühler, T.: Nerfshop: interactive editing of neural radiance fields, vol. 6 (2023)
Kapelyukh, I., Vosylius, V., Johns, E.: Dall-e-bot: introducing web-scale diffusion models to robotics. IEEE Rob. Autom. Lett. 8, 3956–3963 (2023)
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: language embedded radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19729–19739 (2023)
Kong, X., Liu, S., Taher, M., Davison, A.J.: vmap: vectorised object mapping for neural field slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 952–961 (2023)
Kong, X., Liu, S., Taher, M., Davison, A.J.: vmap: vectorised object mapping for neural field slam. In: CVPR, pp. 952–961 (2023)
Kundu, A., et al.: Panoptic neural fields: a semantic object-aware neural scene representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12871–12881 (2022)
Le Cleac’h, S., et al.: Differentiable physics simulation of dynamics-augmented neural objects. IEEE Rob. Autom. Lett. 8(5), 2780–2787 (2023). https://doi.org/10.1109/LRA.2023.3257707
Liu, H.K., Shen, I., Chen, B.Y., et al.: Nerf-in: free-form nerf inpainting with rgb-d priors. arXiv preprint arXiv:2206.04901 (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Mirzaei, A., et al.: Reference-guided controllable inpainting of neural radiance fields. arXiv preprint arXiv:2304.09677 (2023)
Mirzaei, A., et al.: Spin-nerf: multiview segmentation and perceptual inpainting with neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20669–20679 (2023)
Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: Total3dunderstanding: joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 55–64 (2020)
OpenAI: Gpt-4 technical report (2023)
Ost, J., Mannan, F., Thuerey, N., Knodt, J., Heide, F.: Neural scene graphs for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2856–2865 (2021)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)
Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: Atiss: autoregressive transformers for indoor scene synthesis. Adv. Neural. Inf. Process. Syst. 34, 12013–12026 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022)
Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I., Suenderhauf, N.: Sayplan: grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Shahbazi, M., et al.: Inserf: text-driven generative object insertion in neural 3d scenes. arXiv preprint arXiv:2401.05335 (2024)
Shum, K.C., Kim, J., Hua, B.S., Nguyen, D.T., Yeung, S.K.: Language-driven object fusion into neural radiance fields with pose-conditioned dataset updates (2023)
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions, pp. 2149–2159 (2022)
Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development, pp. 1–12 (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)
Wang, X., Yeshwanth, C., Nießner, M.: Sceneformer: indoor scene generation with transformers. In: 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE (2021)
Wang, Y., Wu, W., Xu, D.: Learning unified decompositional and compositional nerf for editable novel view synthesis. In: ICCV (2023)
Wei, Q.A., et al.: Lego-net: learning regular rearrangements of objects in rooms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19037–19047 (2023)
Williams, L.: Casting curved shadows on curved surfaces. SIGGRAPH Comput. Graph. 12(3), 270-274 (1978)
Wu, Q., et al.: Object-compositional neural implicit surfaces. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 197–213. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_12
Wu, Z., et al.: Mars: an instance-aware, modular and realistic simulator for autonomous driving. In: CICAI (2023)
Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV, pp. 13779–13788 (2021)
Yang, Z., et al.: Unisim: a neural closed-loop sensor simulator. In: CVPR, pp. 1389–1399 (2023)
Yang, Z., et al.: The dawn of lmms: preliminary explorations with gpt-4v (ision), 9(1), 1 (2023). arXiv preprint arXiv:2309.17421
Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS 35, 25018–25032 (2022)
Zha, W., Li, X., Xing, Y., He, L., Li, D.: Reconstruction of shale image based on wasserstein generative adversarial networks with gradient penalty. Adv. Geo-Energy Res. 4(1), 107–114 (2020)
Zhi, S., Laidlow, T., Leutenegger, S., Davison, A.J.: In-place scene labelling and understanding with implicit scene representation (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhong, Z. et al. (2025). Structured-NeRF: Hierarchical Scene Graph with Neural Representation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-72761-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72760-3
Online ISBN: 978-3-031-72761-0
eBook Packages: Computer ScienceComputer Science (R0)