FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

Zuo, Xingxing; Samangouei, Pouya; Zhou, Yunwen; Di, Yan; Li, Mingyang

doi:10.1007/s11263-024-02183-8

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

Published: 12 August 2024

Volume 133, pages 611–627, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xingxing Zuo ORCID: orcid.org/0000-0003-4158-3153¹,
Pouya Samangouei¹,
Yunwen Zhou¹,
Yan Di¹ &
…
Mingyang Li¹

1703 Accesses
Explore all metrics

Abstract

Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by ${10.2}$object detection, despite that we are ${851\times }$ faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code on the [project page].

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 5

CONDENSE: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

FoundPose: Unseen Object Pose Estimation with Foundation Features

GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data Availability

Figures. 3, 5, 6, 7, 8, and Tables 1, 3 are experimental results on the the publicly available data shared in Kerr et al. (2023). Figure 9 and Table 2 are experimental results on the publicly available data shared in Liu et al. (2023). All the intermediate experimental results and data shown in this paper can be accessed from the first author by request [https://xingxingzuo.github.io/fmgs/].

References

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716–23736.
Google Scholar
Amir, S., Gandelsman, Y., Bagon, S., & Dekel, T. (2021). Deep vit features as dense visual descriptors., 2(3), 4. arXiv:2112.05814.
Azuma, D., Miyanishi, T., Kurita, S., & Kawanabe, M. (2022). Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19129–19139).
Barron, J. T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., & Srinivasan, P. P. (2021). Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5855–5864).
Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P., & Hedman, P. (2022). Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5470–5479).
Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P., Hedman, P. (2023). Zip-nerf: Anti-aliased grid-based neural radiance fields. In ICCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650–9660).
Cascante-Bonilla, P., Hui, W., Wang, L., Feris, R. S., & Ordonez, V. (2022). Simvqa: Exploring simulated environments for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5056–5066).
Cen, J., Zhou, Z., Fang, J., Yang, C., Shen, W., Xie, L., Zhang, X., & Tian, Q. (2023). Segment anything in 3D with nerfs. In NeurIPS .
Chen, A., Xu, Z., Geiger, A., Yu, J., & Su, H. (2022). Tensorf: Tensorial radiance fields. In European conference on computer vision (pp. 333–350). Springer.
Chen, G., & Wang, W. (2024). A survey on 3D Gaussian splatting. arXiv preprint arXiv:2401.03890.
Chen, H., Blomqvist, K., Milano, F., & Siegwart, R. (2023). Panoptic vision-language feature fields. arXiv preprint arXiv:2309.05448.
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., & Jitsev, J. (2022). Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143.
Corona, R., Zhu, S., Klein, D., & Darrell, T. (2022). Voxel-informed language grounding. arXiv preprint arXiv:2205.09710.
Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828–5839).
Fei, B., Xu, J., Zhang, R., Zhou, Q., Yang, W., & He, Y. (2024). 3D Gaussian as a new vision era: A survey. arXiv preprint arXiv:2402.07181.
Ghiasi, G., Gu, X., Cui, Y., & Lin, T.-Y. (2021). Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143.
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., & Farhadi, A. (2018). Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4089–4098).
Grinvald, M., Furrer, F., Novkovic, T., Chung, J. J., Cadena, C., Siegwart, R., & Nieto, J. (2019). Volumetric instance-aware semantic mapping and 3d object discovery. IEEE Robotics and Automation Letters, 4(3), 3037–3044.
Article Google Scholar
Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K. M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., Gan, C., Miguel de Melo, C., Tenenbaum, J. B., Torralba, A., Shkurti, F., & Paull, L. (2023). Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv.
Gu, X., Lin, T.-Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921.
Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., & Davison, A., et al. (2011). Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology (pp. 559–568).
Jatavallabhula, K. M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Li, S., Iyer, G., Saryazdi, S., Keetha, N., Tewari, A., Tenenbaum, J. B., de Melo, C. M., Krishna, M., Paull, L., Shkurti, F., & Torralba, A. (2023). Conceptfusion: Open-set multimodal 3d mapping.
Karkus, P., Cai, S., & Hsu, D. (2021). Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2815–2825).
Keetha, N., Karhade, J., Jatavallabhula, K. M., Yang, G., Scherer, S., Ramanan, D., & Luiten, J. (2023). Splatam: Splat, track & map 3d gaussians for dense rgb-d slam. arXiv preprint arXiv:2312.02126.
Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4), 1–14.
Article Google Scholar
Kerr, J., Kim, C. M., Goldberg, K., Kanazawa, A., & Tancik, M. (2023). Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 19729–19739).
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, L. C., Tete X., Gustafson, W., Spencer, B., Alexander, C., & Wan-Yen, L., et al. (2023). Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4015–4026).
Kobayashi, S., Matsumoto, E., & Sitzmann, V. (2022). Decomposing nerf for editing via feature field distillation. In Advances in neural information processing systems volume 35.
Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022). Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546.
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., & Marculescu, D. (2022). Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150.
Lin, K., Wang, L., & Liu, Z. (2021). End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1954–1963).
Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., El Saddik, A., Theobalt, C., Xing, E., & Lu, S. (2023). Weakly supervised 3d open-vocabulary segmentation. In Thirty-seventh conference on neural information processing systems.
Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., Saddik, A. E., Theobalt, C., Xing, E., & Lu, S. (2023). 3d open-vocabulary segmentation with foundation models. arXiv preprint arXiv:2305.14093.
Lu, S., Chang, H., Jing, E. P., Boularias, A., & Bekris, K. (2023). OVIR-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In 7th annual conference on robot learning.
Lu, Y., Xu, C., Wei, X., Xie, X., Tomizuka, M., Keutzer, K., & Zhang, S. (2023). Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Lüddecke, T., & Ecker, A. (2022). Image segmentation using text and image prompts. In CVPR (pp. 7086–7096).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., & Ramamoorthi, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV.
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., & Shen, Z., et al. (2022). Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230.
Müller, T., Evans, A., Schied, C., & Keller, A. (2022). Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4), 1–15.
Article MATH Google Scholar
Narita, G., Seno, T., Ishikawa, T., & Kaji, Y. (2019). Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4205–4212).
Qi, C. R,. Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).
Qin, M., Li, W., Zhou, J., Wang, H., & Pfister, H. (2024). Langsplat: 3d language gaussian splatting. In Proceedings of the IEEE/CVF international conference on computer vision.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104–4113).
Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104–4113).
Schöps, T., Sattler, T., & Pollefeys, M. (2019). Surfelmeshing: Online surfel-based mesh reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2494–2507.
Article Google Scholar
Shafiullah, N. M. M., Paxton, C., Pinto, L., Chintala, S., & Szlam, A. (2022). Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663.
Shi, J.-C., Wang, M., Duan, H.-B., & Guan, S.-H. (2024). LEGaussians: Language embedded 3d gaussians for open-vocabulary scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision.
Sun, J., Xie, Y., Chen, L., Zhou, X., & Bao, H. (2021). Neuralrecon: Real-time coherent 3d reconstruction from monocular video. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15598–15607).
Takmaz, A., Fedele, E., Sumner, R. W., Pollefeys, M., Tombari, F., & Engelmann, F. (2023). OpenMask3D: Open-vocabulary 3D instance segmentation. In Advances in neural information processing systems (NeurIPS).
Thomason, J., Shridhar, M., Bisk, Y., Paxton, C., & Zettlemoyer, L. (2022). Language grounding with 3d objects. In Conference on robot learning (pp. 1691–1701).
Tsagkas, N., Mac A. O., & Lu, C. X. (2023). Vl-fields: Towards language-grounded neural implicit spatial representations. arXiv preprint arXiv:2305.12427.
Tschernezki, V., Laina, I., Larlus, D., & Vedaldi, A. (2022). Neural feature fusion fields: 3D distillation of self-supervised 2D image representations. In 3DV.
Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., & Liu, T. (2022). Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11686–11695).
Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., & Neumann, U. (2022). Point-nerf: Point-based neural radiance fields. CoRR. arXiv: 2201.08845.
Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J. C., & Savarese, S. (2022). Ulip: Learning unified representation of language, image and point cloud for 3d understanding. arXiv preprint arXiv:2212.05171.
Xue, L., Yu, N., Zhang, S., Li, J., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J. C., & Savarese, S. (2023). Ulip-2: Towards scalable multimodal pre-training for 3d understanding.
Yang, X., Zhou, L., Jiang, H., Tang, Z., Wang, Y., Bao, H., & Zhang, G. (2020). Mobile3drecon: Real-time monocular 3d reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, 26(12), 3446–3456.
Article Google Scholar
Ye, J., Wang, N., & Wang, X. (2023). Featurenerf: Learning generalizable nerfs by distilling pre-trained vision foundation models. arXiv preprint arXiv:2303.12786.
Yifan, W., Serena, F., Shihao, W., Öztireli, C., & Sorkine-Hornung, O. (2019). Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG), 38(6), 1–14.
Article Google Scholar
Ze, Y., Yan, G., Yueh-Hua, W., Macaluso, A., Ge, Y., Ye, J., & Hansen, N. (2023). CoRL: Multi-task real robot learning with generalizable neural feature fields.
Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., & Li, H. (2021). Pointclip: Point cloud understanding by clip. arXiv preprint arXiv:2112.02413.
Zhou, S., Chang, H., Jiang, S., Fan, Z., Zhu, Z., Xu, D., Chari, P., You, S., Wang, Z., & Kadambi, A. (2024). Feature 3DGS: Supercharging 3d Gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF international conference on computer vision.
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., & Misra, I. (2022). Detecting twenty-thousand classes using image-level supervision. In ECCV (pp. 350–368). Springer.
Zwicker, M., Pfister, H., Van Baar, J., & Gross, M. (2001). Ewa volume splatting. In Proceedings visualization.

Download references

Acknowledgements

We are very grateful to Juan J. Gómez Rodríguez and Francis Engelmann for their advice and insightful discussions about this work.

Author information

Authors and Affiliations

Google, Mountain View, USA
Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di & Mingyang Li

Authors

Xingxing Zuo
View author publications
You can also search for this author inPubMed Google Scholar
Pouya Samangouei
View author publications
You can also search for this author inPubMed Google Scholar
Yunwen Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Yan Di
View author publications
You can also search for this author inPubMed Google Scholar
Mingyang Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Xingxing Zuo or Mingyang Li.

Additional information

Communicated by Hong Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zuo, X., Samangouei, P., Zhou, Y. et al. FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding. Int J Comput Vis 133, 611–627 (2025). https://doi.org/10.1007/s11263-024-02183-8

Download citation

Received: 16 December 2023
Accepted: 04 July 2024
Published: 12 August 2024
Issue Date: February 2025
DOI: https://doi.org/10.1007/s11263-024-02183-8

Keywords

Part of a collection:

Special Issue on Open-World Visual Recognition

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CONDENSE: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

FoundPose: Unseen Object Pose Estimation with Foundation Features

GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now