Abstract
Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first project the sparse RGBD images to a highly incomplete 3D mesh. Instead of iteratively generating novel views to fill in the void, we utilized our proposed E-Diffusion to generate a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. Furthermore, we maintain the input-output scene stylistic consistency through textual inversion to replace human-designed text prompts. To bridge the domain gap among datasets, E-Diffusion leverages models trained on large-scale datasets to generate diverse appearances. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets, even though GenRC is not trained on these datasets nor using predefined camera trajectories. Project page: https://minfenli.github.io/GenRC/
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anciukevičius, T., et al.: RenderDiffusion: image diffusion for 3D reconstruction, inpainting and generation. In: CVPR, pp. 12608–12618 (2023)
Bae, G., Budvytis, I., Cipolla, R.: IronDepth: iterative refinement of single-view depth using surface normal and its uncertainty. In: BMVC (2022)
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. In: ICML (2023)
Baruch, G., et al.: ARKitScenes: a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data. arXiv preprint arXiv:2111.08897 (2021)
Cai, S., et al.: DiffDreamer: towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In: ICCV, pp. 2139–2150 (2023)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: SDFusion: multimodal 3D shape completion, reconstruction, and generation. In: CVPR, pp. 4456–4465 (2023)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
Erkoç, Z., Ma, F., Shan, Q., Nießner, M., Dai, A.: HyperDiffusion: generating implicit neural fields with weight-space diffusion. In: ICCV (2023)
Fridman, R., Abecasis, A., Kasten, Y., Dekel, T.: SceneScape: text-driven consistent scene generation. In: NeurIPS (2023)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
Gao, J., et al.: GET3D: a generative model of high quality 3D textured shapes learned from images, vol. 35, pp. 31841–31854 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2Room: extracting textured 3D meshes from 2D text-to-image models. In: ICCV (2023)
Johnson, J., et al.: Accelerating 3d deep learning with PyTorch3D. In: SIGGRAPH Asia 2020 Courses, p. 1 (2020)
Kasten, Y., Rahamim, O., Chechik, G.: Point-cloud completion with pretrained text-to-image diffusion models. In: NeurIPS (2023)
Lei, J., Tang, J., Jia, K.: RGBD2: generative scene synthesis via incremental view inpainting using RGBD diffusion models. In: CVPR, pp. 8422–8434 (2023)
Li, Z., Wang, Q., Snavely, N., Kanazawa, A.: InfiniteNature-Zero:: learning perpetual view generation of natural scenes from single images. In: ECCV, pp. 515–534 (2022). https://doi.org/10.1007/978-3-031-19769-7_30
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR, pp. 300–309 (2023)
Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: perpetual view generation of natural scenes from a single image. In: ICCV, pp. 14458–14467 (2021)
Liu, M., et al.: One-2-3-45: any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928 (2023)
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: ICCV, pp. 9298–9309 (2023)
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-NeRF for shape-guided generation of 3D shapes and textures. In: CVPR, pp. 12663–12673 (2023)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Müller, N., Siddiqui, Y., Porzi, L., Bulo, S.R., Kontschieder, P., Nießner, M.: DiffRF: rendering-guided 3D radiance field diffusion. In: CVPR, pp. 4328–4338 (2023)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: CVPR (2019)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 1(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Saharia, C., et al.: Palette: image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding, vol. 35, pp. 36479–36494 (2022)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models, vol. 35, pp. 25278–25294 (2022)
Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D neural field generation using triplane diffusion. In: CVPR, pp. 20875–20886 (2023)
Song, L., et al.: RoomDreamer: text-driven 3D indoor scene synthesis with coherent geometry and texture. In: ACM MM (2023)
Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: DreamGaussian: generative gaussian splatting for efficient 3D content creation (2024)
Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y.: MVDiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion. In: NeurIPS (2023)
Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR, pp. 12619–12629 (2023)
Wu, T., Zheng, C., Cham, T.J.: PanoDiffusion: depth-aided 360-degree indoor RGB panorama outpainting via latent diffusion model. In: ICLR (2024)
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: CVPR, pp. 18381–18391 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)
Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3D: a large photo-realistic dataset for structured 3D modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 519–535. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_30
Acknowledgements
This project is supported by the National Science and Technology Council (NSTC) and Taiwan Computing Cloud (TWCC) under the project NSTC 112-2634-F-002-006 and 113-2221-E-007-104.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, MF. et al. (2025). GenRC: Generative 3D Room Completion from Sparse Image Collections. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15095. Springer, Cham. https://doi.org/10.1007/978-3-031-72913-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72913-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72912-6
Online ISBN: 978-3-031-72913-3
eBook Packages: Computer ScienceComputer Science (R0)