SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

Miao, Yang; Engelmann, Francis; Vysotska, Olga; Tombari, Federico; Pollefeys, Marc; Baráth, Dániel Béla

doi:10.1007/978-3-031-73242-3_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15066))

Included in the following conference series:

European Conference on Computer Vision

558 Accesses

Abstract

We introduce the task of localizing an input image within a multi-modal reference map represented by a collection of 3D scene graphs. These scene graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given these modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing object instances) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map representation. With images, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. Code and models are available at https://scenegraphloc.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CONDENSE: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

MeshLoc: Mesh-Based Visual Localization

CISPc: Embedding Images and Point Clouds in a Joint Concept Space by Contrastive Learning

References

Agia, C., et al.: TaskoGraphy: evaluating robot task planning over large 3D scene graphs. In: Conference on Robot Learning (CoRL) (2022)
Google Scholar
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Aubry, M., Russell, B.C., Sivic, J.: Painting-to-3D model alignment via discriminative visual elements. ACM Trans. Graph. (TOG) (2014)
Google Scholar
Aubry, M., Russell, B.C., Sivic, J.: Visual geo-localization of non-photographic depictions via 2D-3D alignment. In: Large-Scale Visual Geo-Localization (2016)
Google Scholar
Balntas, V., Li, S., Prisacariu, V.: RelocNet: continuous metric learning relocalisation using neural nets. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Bernreiter, L., Ott, L., Nieto, J., Siegwart, R., Cadena, C.: Spherical multi-modal place recognition for heterogeneous sensor systems. In: International Conference on Robotics and Automation (ICRA) (2021)
Google Scholar
Berton, G., Masone, C., Caputo, B.: Rethinking visual geo-localization for large-scale applications. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Berton, G., Paolicelli, V., Masone, C., Caputo, B.: Adaptive-attentive geolocalization from few queries: a hybrid approach. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021)
Google Scholar
Bhayani, S., Sattler, T., Barath, D., Beliansky, P., Heikkilä, J., Kukelova, Z.: Calibrated and partially calibrated semi-generalized homographies. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Brachmann, E., et al.: DSAC - Differentiable RANSAC for camera localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Brachmann, E., Rother, C.: Learning less is more - 6D camera localization via 3D surface regression. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Brachmann, E., Rother, C.: Visual camera re-localization from RGB and RGB-D images using DSAC. Trans. Pattern Anal. Mach. Intell. (PAMI) (2021)
Google Scholar
Brejcha, J., Lukáč, M., Hold-Geoffroy, Y., Wang, O., Čadík, M.: LandscapeAR: large scale outdoor augmented reality by matching photographs with terrain models using learned descriptors. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 295–312. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_18
Chapter Google Scholar
Cadik, M., Sykora, D., Lee, S.: Automated outdoor depth-map generation and alignment. Comput. Graph. (2018)
Google Scholar
Castle, R., Klein, G., Murray, D.W.: Video-rate localization in multiple maps for wearable augmented reality. In: IEEE International Symposium on Wearable Computers (2008)
Google Scholar
Cavallari, T., Bertinetto, L., Mukhoti, J., Torr, P., Golodetz, S.: Let’s take this online: adapting scene coordinate regression network predictions for online RGB-D camera relocalisation. In: International Conference on 3D Vision (3DV) (2019)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (ECCV) Workshops (2004)
Google Scholar
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Annual Conference on Computer Graphics and Interactive Techniques (1996)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Doan, A.D., Latif, Y., Chin, T.J., Liu, Y., Do, T.T., Reid, I.: Scalable place recognition under appearance change for autonomous driving. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Tombari, F.: OpenNeRF: open Set 3D neural scene segmentation with pixel-wise features and rendered novel views. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Gadre, S.Y., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied AI. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Gao, P., Liang, J., Shen, Y., Son, S., Lin, M.C.: Visual, spatial, geometric-preserved place recognition for cross-view and cross-modal collaborative perception. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)
Google Scholar
Garg, S., Fischer, T., Milford, M.: Where is your place, visual place recognition? In: International Joint Conference on Artificial Intelligence (IJCAI) (2021)
Google Scholar
Garg, S., Suenderhauf, N., Milford, M.: Semantic-geometric visual place recognition: a new perspective for reconciling opposing views. Int. J. Robot. Res. (IJRR) (2019)
Google Scholar
Garg, S., et al.: Semantics for robotic mapping, perception and interaction: a survey. Found. Trends Robot. (2020)
Google Scholar
Georgakis, G., Karanam, S., Wu, Z., Kosecka, J.: Learning local RGB-to-CAD correspondences for object pose estimation. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Germain, H., Bourmaud, G., Lepetit, V.: Sparse-to-dense hypercolumn matching for long-term visual localization. In: International Conference on 3D Vision (3DV) (2019)
Google Scholar
Germain, H., Bourmaud, G., Lepetit, V.: S2DNet: learning image features for accurate sparse-to-dense matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 626–643. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_37
Chapter Google Scholar
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Chapter Google Scholar
Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Grelsson, B., Robinson, A., Felsberg, M., Khan, F.S.: GPS-level accurate camera localization with HorizonNet. J. Field Robot. (2020)
Google Scholar
Gumeli, C., Dai, A., Nießner, M.: ROCA: robust CAD model retrieval and alignment from a single image. arXiv preprint arXiv:2112.01988 (2021)
Hanocka, R., Metzer, G., Giryes, R., Cohen-Or, D.: Point2Mesh: a self-prior for deformable meshes. arXiv preprint arXiv:2005.11084 (2020)
Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P.: Global context vision transformers. In: International Conference on Machine Learning (ICML) (2023)
Google Scholar
Hausler, S., Jacobson, A., Milford, M.: Multi-process fusion: visual place recognition using multiple image processing methods. IEEE Robot. Autom. Lett. (RA-L) (2019)
Google Scholar
Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-NetVLAD: multi-scale fusion of locally-global descriptors for place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Heng, L., et al.: Project autovision: localization and 3D scene perception for an autonomous Cehicle with a multi-camera system. In: International Conference on Robotics and Automation (ICRA) (2019)
Google Scholar
Hess, G., Tonderski, A., Petersson, C., Åström, K., Svensson, L.: LidarCLIP or: how i learned to talk to point clouds. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)
Google Scholar
Hodan, T.: Pose estimation of specific rigid objects. Ph.D. thesis (2021)
Google Scholar
Hodan, T., Barath, D., Matas, J.: EPOS: estimating 6D pose of objects with symmetries. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Hodan, T., Zabulis, X., Lourakis, M.I.A., Obdrzalek, S., Matas, J.: Detection and fine 3D pose estimation of texture-less objects in RGB-D images. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2015)
Google Scholar
Hu, S., Feng, M., Nguyen, R.H.M., Lee, G.H.: CVM-net: cross-view matching network for image-based ground-to-aerial geo-localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Hu, S., Lee, G.H.: Image-based geolocalization using satellite imagery. Int. J. Comput. Vision (IJCV) (2019)
Google Scholar
Hughes, N., Chang, Y., Carlone, L.: Hydra: a real-time spatial perception system for 3D scene graph construction and optimization. arXiv preprint arXiv:2201.13360 (2022)
Ibrahimi, S., van Noord, N., Alpherts, T., Worring, M.: Inside out visual place recognition. In: British Machine Vision Conference (2021)
Google Scholar
Irschara, A., Zach, C., Frahm, J.M., Bischof, H.: From structure-from-motion point clouds to fast location recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: ACM Symposium on User Interface Software and Technology (2011)
Google Scholar
Ji, X., Wei, J., Wang, Y., Shang, H., Kneip, L.: Cross-modal place recognition in image databases using event-based sensors. arXiv preprint arXiv:2307.01047 (2023)
Kabalar, J., Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Towards long-term retrieval-based visual localization in indoor environments with changes. IEEE Robot. Autom. Lett. (2023)
Google Scholar
Keetha, N., et al.: AnyLoc: towards universal visual place recognition. IEEE Robot. Autom. Lett. (RA-L) (2023)
Google Scholar
Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Khaliq, A., Ehsan, S., Chen, Z., Milford, M., McDonald-Maier, K.: A holistic visual place recognition approach using lightweight CNNs for significant viewpoint and appearance changes. IEEE Trans. Robot. (T-RO) (2020)
Google Scholar
Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geolocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Kim, U.H., Park, J.M., Song, T.J., Kim, J.H.: 3-D scene graph: a sparse and semantic representation of physical environments for intelligent agents. IEEE Trans. Cybern. (2019)
Google Scholar
Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International J. Comput. Vision (IJCV) (2000)
Google Scholar
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34
Chapter Google Scholar
Lee, S., Seong, H., Lee, S., Kim, E.: Correlation verification for image retrieval. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Li, Y., Snavely, N., Huttenlocher, D., Fua, P.: Worldwide pose estimation using 3D point clouds. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 15–29. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_2
Chapter Google Scholar
Lim, H., Sinha, S.N., Cohen, M.F., Uyttendaele, M.: Real-time image-based 6-DoF localization in large-scale environments. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Google Scholar
Lin, T.Y., Cui, Y., Belongie, S.J., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Lin, Z., Zhang, Z., Wang, M., Shi, Y., Wu, X., Zheng, Y.: Multi-modal contrastive representation learning for entity alignment. arXiv preprint arXiv:2209.00891 (2022)
Liu, L., Li, H., Dai, Y.: Efficient global 2D-3D matching for camera localization in a large-scale 3D map. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Liu, L., Li, H., Dai, Y.: Stochastic attraction-repulsion embedding for large scale image localization. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Looper, S., Rodriguez-Puigvert, J., Siegwart, R., Cadena, C., Schmid, L.: 3D VSG: long-term semantic scene change prediction through 3D variable scene graphs. In: International Conference on Robotics and Automation (ICRA) (2023)
Google Scholar
Lynen, S., et al.: Large-scale, real-time visual–inertial localization revisited. Int. J. Robot. Res. (IJRR) (2020)
Google Scholar
Lynen, S., et al.: Large-scale, real-time visual-inertial localization revisited. Int. J. Robot. Res. (IJRR) (2020)
Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Miao, Y., Armeni, I., Pollefeys, M., Barath, D.: Volumetric semantically consistent 3D panoptic mapping. arXiv preprint arXiv:2309.14737 (2024)
Miao, Y., Li, C., Li, Z., Yang, Y., Yu, X.: A novel algorithm of ship structure modeling and target identification based on point cloud for automation in bulk cargo terminals. Meas. Control (2021)
Google Scholar
Mihajlovic, M., Weder, S., Pollefeys, M., Oswald, M.R.: DeepSurfels: learning online appearance fusion. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)
Google Scholar
Moreau, A., Piasco, N., Tsishkou, D., Stanciulescu, B., de La Fortelle, A.: LENS: localization enhanced by neRF synthesis. In: Conference on Robot Learning (CoRL) (2021)
Google Scholar
Murez, Z., van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A.: Atlas: end-to-end 3D scene reconstruction from posed images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 414–431. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_25
Chapter Google Scholar
Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: learning texture representations in function space. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Panek, V., Kukelova, Z., Sattler, T.: MeshLoc: mesh-based visual localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 589–609. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_34
Chapter Google Scholar
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Peng, G., Yue, Y., Zhang, J., Wu, Z., Tang, X., Wang, D.: Semantic reinforced attention learning for visual place recognition. In: International Conference on Robotics and Automation (ICRA) (2021)
Google Scholar
Peng, G., Zhang, J., Li, H., Wang, D.: Attentional pyramid pooling of salient visual residuals for place recognition. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 523–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_31
Chapter Google Scholar
Pion, N., Humenberger, M., Csurka, G., Cabon, Y., Sattler, T.: Benchmarking image retrieval for visual localization. In: International Conference on 3D Vision (3DV) (2020)
Google Scholar
Plotz, T., Roth, S.: Automatic registration of images to untextured geometry using average shading gradients. Int. J. Comput. Vision (IJCV) (2017)
Google Scholar
Ponimatkin, G., Labbe, Y., Russell, B., Aubry, M., Sivic, J.: Focal length and object pose estimation via render and compare. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML) (2021)
Google Scholar
Ramalingam, S., Bouaziz, S., Sturm, P.F., Brand, M.: SKYLINE2GPS: localization in urban canyons using omni-skylines. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2010)
Google Scholar
Ravichandran, Z., Peng, L., Hughes, N., Griffith, J., Carlone, L.: Hierarchical representations and explicit memory: Learning effective navigation policies on 3D scene graphs using graph neural networks. In: International Conference on Robotics and Automation (ICRA) (2022)
Google Scholar
Rosinol, A., et al.: Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Int. J. Robot. Res. (IJRR) (2021)
Google Scholar
Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3D dynamic scene graphs: actionable spatial perception with places, objects, and humans. arXiv preprint arXiv:2002.06289 (2020)
Sarkar, S.D., Miksik, O., Pollefeys, M., Barath, D., Armeni, I.: SGAligner: 3D scene alignment with scene graphs. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Sarlin, P.E., et al.: OrienterNet: visual localization in 2D public maps with neural matching. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Sarlin, P.E., et al.: Back to the feature: learning robust camera localization from pixels to pose. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. Trans. Pattern Anal. Mach. Intell. (PAMI) (2017)
Google Scholar
Sattler, T., Zhou, Q., Pollefeys, M., Leal-Taixe, L.: Understanding the limitations of CNN-based absolute camera pose regression. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Savinov, N., Hane, C., Ladicky, L., Pollefeys, M.: Semantic 3D reconstruction with continuous regularization and ray potentials using a visibility consistency constraint. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T.: Semantic visual localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Sepulveda, G., Niebles, J., Soto, A.: A deep learning based behavioral approach to indoor autonomous navigation. In: International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Shan, Q., Wu, C., Curless, B., Furukawa, Y., Hernandez, C., Seitz, S.M.: Accurate geo-registration by ground-to-aerial image matching. In: International Conference on 3D Vision (3DV) (2014)
Google Scholar
Shubodh, S., Omama, M., Zaidi, H., Parihar, U.S., Krishna, M.: LIP-loc: LiDAR image pretraining for cross-modal localization. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)
Google Scholar
Sibbing, D., Sattler, T., Leibe, B., Kobbelt, L.: SIFT-realistic rendering. In: International Conference on 3D Vision (3DV) (2013)
Google Scholar
Steiger Mueller, M., Sattler, T., Pollefeys, M., Jutzi, B.: Image-to-image translation for enhanced feature matching, image retrieval and visual localization. ISPRS Ann. Photogram. Remote Sens. Spat. Inf.n Sci. (2019)
Google Scholar
Stückler, J., Behnke, S.: Multi-resolution surfel maps for efficient dense 3D modeling and tracking. J. Vis. Commun. Image Representation (2014)
Google Scholar
Svarm, L., Enqvist, O., Kahl, F., Oskarsson, M.: City-scale localization for cameras with known vertical direction. Trans. Pattern Anal. Mach. Intell. (PAMI) (2017)
Google Scholar
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Tewari, A., et al.: Advances in neural rendering. In: Computer Graphics Forum (2022)
Google Scholar
Tomesek, J., Cadik, M., Brejcha, J.: CrossLocate: cross-modal large-scale visual geo-localization in natural environments using rendered modalities. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022)
Google Scholar
Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. Trans. Pattern Anal. Mach. Intell. (PAMI) (2018)
Google Scholar
Torii, A., et al.: Are large-scale 3D models really necessary for accurate visual localization? Trans. Pattern Anal. Mach. Intell. (PAMI) (2021)
Google Scholar
Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., Torr, P.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Ventura, J., Kukelova, Z., Sattler, T., Baráth, D.: Absolute pose from one or two scaled and oriented features. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Viswanathan, A., Rodrigues Pires, B., Huber, D.F.: Vision based robot localization by ground to satellite matching in GPS-denied situations. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2014)
Google Scholar
Walch, F., Hazirbas, C., Leal-Taixe, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Image-based localization using LSTMs for structured feature correlation. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: RIO: 3D object instance re-localization in changing indoor environments. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3D semantic scene graphs from 3D indoor reconstructions. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Wang, S., Kannala, J., Barath, D.: DGC-GNN: descriptor-free geometric-color graph neural network for 2D-3D matching. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: a dataset for lifelong place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Weder, S., Schonberger, J.L., Pollefeys, M., Oswald, M.R.: NeuralFusion: online depth fusion in latent space. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Ying, Z., et al.: RP-SG: relation prediction in 3D scene graphs for unobserved objects localization. IEEE Robot. Autom. Lett. (RA-L) (2023)
Google Scholar
Zaffar, M., Garg, S., Milford, M., et al.: VPR-bench: an open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. Int. J. Comput. Vision (IJCV) (2021)
Google Scholar
Zeisl, B., Sattler, T., Pollefeys, M.: Camera pose voting for large-scale image-based localization. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3D point-based scene graph analysis. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Zhang, G., Larsson, V., Barath, D.: Revisiting rotation averaging: uncertainties and robust losses. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Zhang, S., Hao, A., Qin, H.: Knowledge-inspired 3D scene graph prediction in point cloud. International Conference on Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Zhang, W., Kosecka, J.: Image based localization in urban environments. In: International Symposium on 3D Data Processing, Visualization, and Transmission (2006)
Google Scholar
Zhang, Z., Sattler, T., Scaramuzza, D.: Reference pose generation for long-term visual localization via learned features and view synthesis. Int. J. Comput. Vis. (IJCV) (2020)
Google Scholar
Zhao, L., Gatsis, K., Papachristodoulou, A.: Stable and safe reinforcement learning via a barrier-Lyapunov actor-critic approach. In: IEEE Conference on Decision and Control (CDC) (2023)
Google Scholar
Zhao, L., Miao, K., Gatsis, K., Papachristodoulou, A.: Stable and safe human-aligned reinforcement learning through neural ordinary differential equations. arXiv preprint arXiv:2401.13148 (2024)
Zheng, E., Wu, C.: Structure from motion using structure-less resection. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Zhou, Q., Agostinho, S., Ošep, A., Leal-Taixé, L.: Is geometry enough for matching in visual localization? In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13670, pp. 407–425. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_24
Chapter Google Scholar
Zurbrügg, R., et al.: ICGNet: a unified approach for instance-centric grasping. In: International Conference on Robotics and Automation (ICRA) (2024)
Google Scholar

Download references

Acknowledgements

We would like to thank our colleagues Ganlin Zhang, Sayan Deb Sarkar, and Cathrin Elich for their valuable advice and insightful discussions throughout the course of this research. Their contributions and suggestions greatly enhanced the quality and depth of this work. This work was partially funded by Design++ initiative of ETH Zurich, by the ETH RobotX research grant, the Hasler Stiftung Research Grant via the ETH Zurich Foundation, an ETH AI Center postdoctoral research fellowship and an ETH Zurich Career Seed Award.

Author information

Authors and Affiliations

ETH Zurich, Zürich, Switzerland
Yang Miao, Francis Engelmann, Olga Vysotska, Marc Pollefeys & Dániel Béla Baráth
Google, Menlo Park, USA
Francis Engelmann & Federico Tombari
TU Munich, Munich, Germany
Federico Tombari
Microsoft, Redmond, USA
Marc Pollefeys

Authors

Yang Miao
View author publications
You can also search for this author in PubMed Google Scholar
Francis Engelmann
View author publications
You can also search for this author in PubMed Google Scholar
Olga Vysotska
View author publications
You can also search for this author in PubMed Google Scholar
Federico Tombari
View author publications
You can also search for this author in PubMed Google Scholar
Marc Pollefeys
View author publications
You can also search for this author in PubMed Google Scholar
Dániel Béla Baráth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dániel Béla Baráth .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10072 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miao, Y., Engelmann, F., Vysotska, O., Tombari, F., Pollefeys, M., Baráth, D.B. (2025). SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-73242-3_8
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73241-6
Online ISBN: 978-3-031-73242-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs