Abstract
Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by \({10.2}\)object detection, despite that we are \({851\times }\) faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code on the [project page].









Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data Availability
Figures. 3, 5, 6, 7, 8, and Tables 1, 3 are experimental results on the the publicly available data shared in Kerr et al. (2023). Figure 9 and Table 2 are experimental results on the publicly available data shared in Liu et al. (2023). All the intermediate experimental results and data shown in this paper can be accessed from the first author by request [https://xingxingzuo.github.io/fmgs/].
References
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716–23736.
Amir, S., Gandelsman, Y., Bagon, S., & Dekel, T. (2021). Deep vit features as dense visual descriptors., 2(3), 4. arXiv:2112.05814.
Azuma, D., Miyanishi, T., Kurita, S., & Kawanabe, M. (2022). Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19129–19139).
Barron, J. T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., & Srinivasan, P. P. (2021). Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5855–5864).
Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P., & Hedman, P. (2022). Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5470–5479).
Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P., Hedman, P. (2023). Zip-nerf: Anti-aliased grid-based neural radiance fields. In ICCV.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650–9660).
Cascante-Bonilla, P., Hui, W., Wang, L., Feris, R. S., & Ordonez, V. (2022). Simvqa: Exploring simulated environments for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5056–5066).
Cen, J., Zhou, Z., Fang, J., Yang, C., Shen, W., Xie, L., Zhang, X., & Tian, Q. (2023). Segment anything in 3D with nerfs. In NeurIPS .
Chen, A., Xu, Z., Geiger, A., Yu, J., & Su, H. (2022). Tensorf: Tensorial radiance fields. In European conference on computer vision (pp. 333–350). Springer.
Chen, G., & Wang, W. (2024). A survey on 3D Gaussian splatting. arXiv preprint arXiv:2401.03890.
Chen, H., Blomqvist, K., Milano, F., & Siegwart, R. (2023). Panoptic vision-language feature fields. arXiv preprint arXiv:2309.05448.
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., & Jitsev, J. (2022). Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143.
Corona, R., Zhu, S., Klein, D., & Darrell, T. (2022). Voxel-informed language grounding. arXiv preprint arXiv:2205.09710.
Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828–5839).
Fei, B., Xu, J., Zhang, R., Zhou, Q., Yang, W., & He, Y. (2024). 3D Gaussian as a new vision era: A survey. arXiv preprint arXiv:2402.07181.
Ghiasi, G., Gu, X., Cui, Y., & Lin, T.-Y. (2021). Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143.
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., & Farhadi, A. (2018). Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4089–4098).
Grinvald, M., Furrer, F., Novkovic, T., Chung, J. J., Cadena, C., Siegwart, R., & Nieto, J. (2019). Volumetric instance-aware semantic mapping and 3d object discovery. IEEE Robotics and Automation Letters, 4(3), 3037–3044.
Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K. M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., Gan, C., Miguel de Melo, C., Tenenbaum, J. B., Torralba, A., Shkurti, F., & Paull, L. (2023). Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv.
Gu, X., Lin, T.-Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921.
Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., & Davison, A., et al. (2011). Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology (pp. 559–568).
Jatavallabhula, K. M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Li, S., Iyer, G., Saryazdi, S., Keetha, N., Tewari, A., Tenenbaum, J. B., de Melo, C. M., Krishna, M., Paull, L., Shkurti, F., & Torralba, A. (2023). Conceptfusion: Open-set multimodal 3d mapping.
Karkus, P., Cai, S., & Hsu, D. (2021). Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2815–2825).
Keetha, N., Karhade, J., Jatavallabhula, K. M., Yang, G., Scherer, S., Ramanan, D., & Luiten, J. (2023). Splatam: Splat, track & map 3d gaussians for dense rgb-d slam. arXiv preprint arXiv:2312.02126.
Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4), 1–14.
Kerr, J., Kim, C. M., Goldberg, K., Kanazawa, A., & Tancik, M. (2023). Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 19729–19739).
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, L. C., Tete X., Gustafson, W., Spencer, B., Alexander, C., & Wan-Yen, L., et al. (2023). Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4015–4026).
Kobayashi, S., Matsumoto, E., & Sitzmann, V. (2022). Decomposing nerf for editing via feature field distillation. In Advances in neural information processing systems volume 35.
Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022). Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546.
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., & Marculescu, D. (2022). Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150.
Lin, K., Wang, L., & Liu, Z. (2021). End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1954–1963).
Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., El Saddik, A., Theobalt, C., Xing, E., & Lu, S. (2023). Weakly supervised 3d open-vocabulary segmentation. In Thirty-seventh conference on neural information processing systems.
Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., Saddik, A. E., Theobalt, C., Xing, E., & Lu, S. (2023). 3d open-vocabulary segmentation with foundation models. arXiv preprint arXiv:2305.14093.
Lu, S., Chang, H., Jing, E. P., Boularias, A., & Bekris, K. (2023). OVIR-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In 7th annual conference on robot learning.
Lu, Y., Xu, C., Wei, X., Xie, X., Tomizuka, M., Keutzer, K., & Zhang, S. (2023). Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Lüddecke, T., & Ecker, A. (2022). Image segmentation using text and image prompts. In CVPR (pp. 7086–7096).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., & Ramamoorthi, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV.
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., & Shen, Z., et al. (2022). Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230.
Müller, T., Evans, A., Schied, C., & Keller, A. (2022). Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4), 1–15.
Narita, G., Seno, T., Ishikawa, T., & Kaji, Y. (2019). Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4205–4212).
Qi, C. R,. Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).
Qin, M., Li, W., Zhou, J., Wang, H., & Pfister, H. (2024). Langsplat: 3d language gaussian splatting. In Proceedings of the IEEE/CVF international conference on computer vision.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104–4113).
Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104–4113).
Schöps, T., Sattler, T., & Pollefeys, M. (2019). Surfelmeshing: Online surfel-based mesh reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2494–2507.
Shafiullah, N. M. M., Paxton, C., Pinto, L., Chintala, S., & Szlam, A. (2022). Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663.
Shi, J.-C., Wang, M., Duan, H.-B., & Guan, S.-H. (2024). LEGaussians: Language embedded 3d gaussians for open-vocabulary scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision.
Sun, J., Xie, Y., Chen, L., Zhou, X., & Bao, H. (2021). Neuralrecon: Real-time coherent 3d reconstruction from monocular video. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15598–15607).
Takmaz, A., Fedele, E., Sumner, R. W., Pollefeys, M., Tombari, F., & Engelmann, F. (2023). OpenMask3D: Open-vocabulary 3D instance segmentation. In Advances in neural information processing systems (NeurIPS).
Thomason, J., Shridhar, M., Bisk, Y., Paxton, C., & Zettlemoyer, L. (2022). Language grounding with 3d objects. In Conference on robot learning (pp. 1691–1701).
Tsagkas, N., Mac A. O., & Lu, C. X. (2023). Vl-fields: Towards language-grounded neural implicit spatial representations. arXiv preprint arXiv:2305.12427.
Tschernezki, V., Laina, I., Larlus, D., & Vedaldi, A. (2022). Neural feature fusion fields: 3D distillation of self-supervised 2D image representations. In 3DV.
Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., & Liu, T. (2022). Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11686–11695).
Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., & Neumann, U. (2022). Point-nerf: Point-based neural radiance fields. CoRR. arXiv: 2201.08845.
Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J. C., & Savarese, S. (2022). Ulip: Learning unified representation of language, image and point cloud for 3d understanding. arXiv preprint arXiv:2212.05171.
Xue, L., Yu, N., Zhang, S., Li, J., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J. C., & Savarese, S. (2023). Ulip-2: Towards scalable multimodal pre-training for 3d understanding.
Yang, X., Zhou, L., Jiang, H., Tang, Z., Wang, Y., Bao, H., & Zhang, G. (2020). Mobile3drecon: Real-time monocular 3d reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, 26(12), 3446–3456.
Ye, J., Wang, N., & Wang, X. (2023). Featurenerf: Learning generalizable nerfs by distilling pre-trained vision foundation models. arXiv preprint arXiv:2303.12786.
Yifan, W., Serena, F., Shihao, W., Öztireli, C., & Sorkine-Hornung, O. (2019). Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG), 38(6), 1–14.
Ze, Y., Yan, G., Yueh-Hua, W., Macaluso, A., Ge, Y., Ye, J., & Hansen, N. (2023). CoRL: Multi-task real robot learning with generalizable neural feature fields.
Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., & Li, H. (2021). Pointclip: Point cloud understanding by clip. arXiv preprint arXiv:2112.02413.
Zhou, S., Chang, H., Jiang, S., Fan, Z., Zhu, Z., Xu, D., Chari, P., You, S., Wang, Z., & Kadambi, A. (2024). Feature 3DGS: Supercharging 3d Gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF international conference on computer vision.
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., & Misra, I. (2022). Detecting twenty-thousand classes using image-level supervision. In ECCV (pp. 350–368). Springer.
Zwicker, M., Pfister, H., Van Baar, J., & Gross, M. (2001). Ewa volume splatting. In Proceedings visualization.
Acknowledgements
We are very grateful to Juan J. Gómez Rodríguez and Francis Engelmann for their advice and insightful discussions about this work.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Hong Liu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zuo, X., Samangouei, P., Zhou, Y. et al. FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding. Int J Comput Vis 133, 611–627 (2025). https://doi.org/10.1007/s11263-024-02183-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-024-02183-8