Skip to main content

Structured-NeRF: Hierarchical Scene Graph with Neural Representation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15093))

Included in the following conference series:

  • 449 Accesses

Abstract

We present Structured Neural Radiance Field (Structured-NeRF) for indoor scene representaion based on a novel hierarchical scene graph structure to organize the neural radiance field. Existing object-centric methods focus only on the inherent characteristics of objects, while overlooking the semantic and physical relationships between them. Our scene graph is adept at managing the complex real-world correlation between objects within a scene, enabling functionality beyond novel view synthesis, such as scene re-arrangement. Based on the hierarchical structure, we introduce the optimization strategy based on semantic and physical relationships, thus simplifying the operations involved in scene editing and ensuring both efficiency and accuracy. Moreover, we conduct shadow rendering on objects to further intensify the realism of the rendered images. Experimental results demonstrate our structured representation not only achieves state-of-the-art (SOTA) performance in object-level and scene-level rendering, but also advances downstream applications in union with LLM/VLM, such as automatic and instruction/image conditioned scene re-arrangement, thereby extending the NeRF to interactive editing conveniently and controllably.

Z. Zhong and J. Cao—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5855–5864 (2021)

    Google Scholar 

  3. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479 (2022)

    Google Scholar 

  4. Bing, W., Chen, L., Yang, B.: Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. In: ICLR (2023)

    Google Scholar 

  5. Cao, C., Cai, Y., Dong, Q., Wang, Y., Fu, Y.: Leftrefill: filling right canvas based on left reference through generalized text-to-image diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Google Scholar 

  6. Chang, H., et al.: Context-aware entity grounding with open-vocabulary 3d scene graphs. arXiv preprint arXiv:2309.15940 (2023)

  7. Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 640–658. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19815-1_37

    Chapter  Google Scholar 

  8. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)

    Google Scholar 

  9. Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: ICCV, pp. 10786–10796 (2021)

    Google Scholar 

  10. Gu, Q., et al.: Conceptgraphs: open-vocabulary 3d scene graphs for perception and planning (2023)

    Google Scholar 

  11. Han, X., Liu, H., Ding, Y., Yang, L.: Ro-map: real-time multi-object mapping with neural radiance fields (2023)

    Google Scholar 

  12. Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: editing 3d scenes with instructions. In: ICCV (2023)

    Google Scholar 

  13. Heller, G., Fetaya, E.: Can stochastic gradient langevin dynamics provide differential privacy for deep learning? (2023)

    Google Scholar 

  14. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  15. Jambon, C., Kerbl, B., Kopanas, G., Diolatzis, S., Drettakis, G., Leimkühler, T.: Nerfshop: interactive editing of neural radiance fields, vol. 6 (2023)

    Google Scholar 

  16. Kapelyukh, I., Vosylius, V., Johns, E.: Dall-e-bot: introducing web-scale diffusion models to robotics. IEEE Rob. Autom. Lett. 8, 3956–3963 (2023)

    Article  Google Scholar 

  17. Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: language embedded radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19729–19739 (2023)

    Google Scholar 

  18. Kong, X., Liu, S., Taher, M., Davison, A.J.: vmap: vectorised object mapping for neural field slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 952–961 (2023)

    Google Scholar 

  19. Kong, X., Liu, S., Taher, M., Davison, A.J.: vmap: vectorised object mapping for neural field slam. In: CVPR, pp. 952–961 (2023)

    Google Scholar 

  20. Kundu, A., et al.: Panoptic neural fields: a semantic object-aware neural scene representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12871–12881 (2022)

    Google Scholar 

  21. Le Cleac’h, S., et al.: Differentiable physics simulation of dynamics-augmented neural objects. IEEE Rob. Autom. Lett. 8(5), 2780–2787 (2023). https://doi.org/10.1109/LRA.2023.3257707

    Article  Google Scholar 

  22. Liu, H.K., Shen, I., Chen, B.Y., et al.: Nerf-in: free-form nerf inpainting with rgb-d priors. arXiv preprint arXiv:2206.04901 (2022)

  23. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  24. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  25. Mirzaei, A., et al.: Reference-guided controllable inpainting of neural radiance fields. arXiv preprint arXiv:2304.09677 (2023)

  26. Mirzaei, A., et al.: Spin-nerf: multiview segmentation and perceptual inpainting with neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20669–20679 (2023)

    Google Scholar 

  27. Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: Total3dunderstanding: joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 55–64 (2020)

    Google Scholar 

  28. OpenAI: Gpt-4 technical report (2023)

    Google Scholar 

  29. Ost, J., Mannan, F., Thuerey, N., Knodt, J., Heide, F.: Neural scene graphs for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2856–2865 (2021)

    Google Scholar 

  30. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)

    Google Scholar 

  31. Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., Fidler, S.: Atiss: autoregressive transformers for indoor scene synthesis. Adv. Neural. Inf. Process. Syst. 34, 12013–12026 (2021)

    Google Scholar 

  32. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022)

    Google Scholar 

  33. Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I., Suenderhauf, N.: Sayplan: grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135 (2023)

  34. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  35. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)

    Google Scholar 

  36. Shahbazi, M., et al.: Inserf: text-driven generative object insertion in neural 3d scenes. arXiv preprint arXiv:2401.05335 (2024)

  37. Shum, K.C., Kim, J., Hua, B.S., Nguyen, D.T., Yeung, S.K.: Language-driven object fusion into neural radiance fields with pose-conditioned dataset updates (2023)

    Google Scholar 

  38. Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions, pp. 2149–2159 (2022)

    Google Scholar 

  39. Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development, pp. 1–12 (2023)

    Google Scholar 

  40. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  41. Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor scene synthesis. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)

    Google Scholar 

  42. Wang, X., Yeshwanth, C., Nießner, M.: Sceneformer: indoor scene generation with transformers. In: 2021 International Conference on 3D Vision (3DV), pp. 106–115. IEEE (2021)

    Google Scholar 

  43. Wang, Y., Wu, W., Xu, D.: Learning unified decompositional and compositional nerf for editable novel view synthesis. In: ICCV (2023)

    Google Scholar 

  44. Wei, Q.A., et al.: Lego-net: learning regular rearrangements of objects in rooms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19037–19047 (2023)

    Google Scholar 

  45. Williams, L.: Casting curved shadows on curved surfaces. SIGGRAPH Comput. Graph. 12(3), 270-274 (1978)

    Google Scholar 

  46. Wu, Q., et al.: Object-compositional neural implicit surfaces. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 197–213. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_12

    Chapter  Google Scholar 

  47. Wu, Z., et al.: Mars: an instance-aware, modular and realistic simulator for autonomous driving. In: CICAI (2023)

    Google Scholar 

  48. Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV, pp. 13779–13788 (2021)

    Google Scholar 

  49. Yang, Z., et al.: Unisim: a neural closed-loop sensor simulator. In: CVPR, pp. 1389–1399 (2023)

    Google Scholar 

  50. Yang, Z., et al.: The dawn of lmms: preliminary explorations with gpt-4v (ision), 9(1), 1 (2023). arXiv preprint arXiv:2309.17421

  51. Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS 35, 25018–25032 (2022)

    Google Scholar 

  52. Zha, W., Li, X., Xing, Y., He, L., Li, D.: Reconstruction of shale image based on wasserstein generative adversarial networks with gradient penalty. Adv. Geo-Energy Res. 4(1), 107–114 (2020)

    Article  Google Scholar 

  53. Zhi, S., Laidlow, T., Leutenegger, S., Davison, A.J.: In-place scene labelling and understanding with implicit scene representation (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zike Yan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhong, Z. et al. (2025). Structured-NeRF: Hierarchical Scene Graph with Neural Representation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72761-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72760-3

  • Online ISBN: 978-3-031-72761-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics