Skip to main content

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We introduce the task of localizing an input image within a multi-modal reference map represented by a collection of 3D scene graphs. These scene graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given these modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing object instances) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map representation. With images, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. Code and models are available at https://scenegraphloc.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agia, C., et al.: TaskoGraphy: evaluating robot task planning over large 3D scene graphs. In: Conference on Robot Learning (CoRL) (2022)

    Google Scholar 

  2. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  3. Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  4. Aubry, M., Russell, B.C., Sivic, J.: Painting-to-3D model alignment via discriminative visual elements. ACM Trans. Graph. (TOG) (2014)

    Google Scholar 

  5. Aubry, M., Russell, B.C., Sivic, J.: Visual geo-localization of non-photographic depictions via 2D-3D alignment. In: Large-Scale Visual Geo-Localization (2016)

    Google Scholar 

  6. Balntas, V., Li, S., Prisacariu, V.: RelocNet: continuous metric learning relocalisation using neural nets. In: European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  7. Bernreiter, L., Ott, L., Nieto, J., Siegwart, R., Cadena, C.: Spherical multi-modal place recognition for heterogeneous sensor systems. In: International Conference on Robotics and Automation (ICRA) (2021)

    Google Scholar 

  8. Berton, G., Masone, C., Caputo, B.: Rethinking visual geo-localization for large-scale applications. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  9. Berton, G., Paolicelli, V., Masone, C., Caputo, B.: Adaptive-attentive geolocalization from few queries: a hybrid approach. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021)

    Google Scholar 

  10. Bhayani, S., Sattler, T., Barath, D., Beliansky, P., Heikkilä, J., Kukelova, Z.: Calibrated and partially calibrated semi-generalized homographies. In: International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  11. Brachmann, E., et al.: DSAC - Differentiable RANSAC for camera localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  12. Brachmann, E., Rother, C.: Learning less is more - 6D camera localization via 3D surface regression. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  13. Brachmann, E., Rother, C.: Visual camera re-localization from RGB and RGB-D images using DSAC. Trans. Pattern Anal. Mach. Intell. (PAMI) (2021)

    Google Scholar 

  14. Brejcha, J., Lukáč, M., Hold-Geoffroy, Y., Wang, O., Čadík, M.: LandscapeAR: large scale outdoor augmented reality by matching photographs with terrain models using learned descriptors. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 295–312. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_18

    Chapter  Google Scholar 

  15. Cadik, M., Sykora, D., Lee, S.: Automated outdoor depth-map generation and alignment. Comput. Graph. (2018)

    Google Scholar 

  16. Castle, R., Klein, G., Murray, D.W.: Video-rate localization in multiple maps for wearable augmented reality. In: IEEE International Symposium on Wearable Computers (2008)

    Google Scholar 

  17. Cavallari, T., Bertinetto, L., Mukhoti, J., Torr, P., Golodetz, S.: Let’s take this online: adapting scene coordinate regression network predictions for online RGB-D camera relocalisation. In: International Conference on 3D Vision (3DV) (2019)

    Google Scholar 

  18. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  19. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (ECCV) Workshops (2004)

    Google Scholar 

  20. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Annual Conference on Computer Graphics and Interactive Techniques (1996)

    Google Scholar 

  21. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  22. Doan, A.D., Latif, Y., Chin, T.J., Liu, Y., Do, T.T., Reid, I.: Scalable place recognition under appearance change for autonomous driving. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  23. Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Tombari, F.: OpenNeRF: open Set 3D neural scene segmentation with pixel-wise features and rendered novel views. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  24. Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  25. Gadre, S.Y., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied AI. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  26. Gao, P., Liang, J., Shen, Y., Son, S., Lin, M.C.: Visual, spatial, geometric-preserved place recognition for cross-view and cross-modal collaborative perception. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

    Google Scholar 

  27. Garg, S., Fischer, T., Milford, M.: Where is your place, visual place recognition? In: International Joint Conference on Artificial Intelligence (IJCAI) (2021)

    Google Scholar 

  28. Garg, S., Suenderhauf, N., Milford, M.: Semantic-geometric visual place recognition: a new perspective for reconciling opposing views. Int. J. Robot. Res. (IJRR) (2019)

    Google Scholar 

  29. Garg, S., et al.: Semantics for robotic mapping, perception and interaction: a survey. Found. Trends Robot. (2020)

    Google Scholar 

  30. Georgakis, G., Karanam, S., Wu, Z., Kosecka, J.: Learning local RGB-to-CAD correspondences for object pose estimation. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  31. Germain, H., Bourmaud, G., Lepetit, V.: Sparse-to-dense hypercolumn matching for long-term visual localization. In: International Conference on 3D Vision (3DV) (2019)

    Google Scholar 

  32. Germain, H., Bourmaud, G., Lepetit, V.: S2DNet: learning image features for accurate sparse-to-dense matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 626–643. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_37

    Chapter  Google Scholar 

  33. Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31

    Chapter  Google Scholar 

  34. Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  35. Grelsson, B., Robinson, A., Felsberg, M., Khan, F.S.: GPS-level accurate camera localization with HorizonNet. J. Field Robot. (2020)

    Google Scholar 

  36. Gumeli, C., Dai, A., Nießner, M.: ROCA: robust CAD model retrieval and alignment from a single image. arXiv preprint arXiv:2112.01988 (2021)

  37. Hanocka, R., Metzer, G., Giryes, R., Cohen-Or, D.: Point2Mesh: a self-prior for deformable meshes. arXiv preprint arXiv:2005.11084 (2020)

  38. Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P.: Global context vision transformers. In: International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  39. Hausler, S., Jacobson, A., Milford, M.: Multi-process fusion: visual place recognition using multiple image processing methods. IEEE Robot. Autom. Lett. (RA-L) (2019)

    Google Scholar 

  40. Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-NetVLAD: multi-scale fusion of locally-global descriptors for place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  41. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  42. Heng, L., et al.: Project autovision: localization and 3D scene perception for an autonomous Cehicle with a multi-camera system. In: International Conference on Robotics and Automation (ICRA) (2019)

    Google Scholar 

  43. Hess, G., Tonderski, A., Petersson, C., Åström, K., Svensson, L.: LidarCLIP or: how i learned to talk to point clouds. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

    Google Scholar 

  44. Hodan, T.: Pose estimation of specific rigid objects. Ph.D. thesis (2021)

    Google Scholar 

  45. Hodan, T., Barath, D., Matas, J.: EPOS: estimating 6D pose of objects with symmetries. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  46. Hodan, T., Zabulis, X., Lourakis, M.I.A., Obdrzalek, S., Matas, J.: Detection and fine 3D pose estimation of texture-less objects in RGB-D images. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2015)

    Google Scholar 

  47. Hu, S., Feng, M., Nguyen, R.H.M., Lee, G.H.: CVM-net: cross-view matching network for image-based ground-to-aerial geo-localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  48. Hu, S., Lee, G.H.: Image-based geolocalization using satellite imagery. Int. J. Comput. Vision (IJCV) (2019)

    Google Scholar 

  49. Hughes, N., Chang, Y., Carlone, L.: Hydra: a real-time spatial perception system for 3D scene graph construction and optimization. arXiv preprint arXiv:2201.13360 (2022)

  50. Ibrahimi, S., van Noord, N., Alpherts, T., Worring, M.: Inside out visual place recognition. In: British Machine Vision Conference (2021)

    Google Scholar 

  51. Irschara, A., Zach, C., Frahm, J.M., Bischof, H.: From structure-from-motion point clouds to fast location recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)

    Google Scholar 

  52. Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: ACM Symposium on User Interface Software and Technology (2011)

    Google Scholar 

  53. Ji, X., Wei, J., Wang, Y., Shang, H., Kneip, L.: Cross-modal place recognition in image databases using event-based sensors. arXiv preprint arXiv:2307.01047 (2023)

  54. Kabalar, J., Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Towards long-term retrieval-based visual localization in indoor environments with changes. IEEE Robot. Autom. Lett. (2023)

    Google Scholar 

  55. Keetha, N., et al.: AnyLoc: towards universal visual place recognition. IEEE Robot. Autom. Lett. (RA-L) (2023)

    Google Scholar 

  56. Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  57. Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  58. Khaliq, A., Ehsan, S., Chen, Z., Milford, M., McDonald-Maier, K.: A holistic visual place recognition approach using lightweight CNNs for significant viewpoint and appearance changes. IEEE Trans. Robot. (T-RO) (2020)

    Google Scholar 

  59. Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geolocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  60. Kim, U.H., Park, J.M., Song, T.J., Kim, J.H.: 3-D scene graph: a sparse and semantic representation of physical environments for intelligent agents. IEEE Trans. Cybern. (2019)

    Google Scholar 

  61. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International J. Comput. Vision (IJCV) (2000)

    Google Scholar 

  62. Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34

    Chapter  Google Scholar 

  63. Lee, S., Seong, H., Lee, S., Kim, E.: Correlation verification for image retrieval. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  64. Li, Y., Snavely, N., Huttenlocher, D., Fua, P.: Worldwide pose estimation using 3D point clouds. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 15–29. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_2

    Chapter  Google Scholar 

  65. Lim, H., Sinha, S.N., Cohen, M.F., Uyttendaele, M.: Real-time image-based 6-DoF localization in large-scale environments. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2012)

    Google Scholar 

  66. Lin, T.Y., Cui, Y., Belongie, S.J., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  67. Lin, Z., Zhang, Z., Wang, M., Shi, Y., Wu, X., Zheng, Y.: Multi-modal contrastive representation learning for entity alignment. arXiv preprint arXiv:2209.00891 (2022)

  68. Liu, L., Li, H., Dai, Y.: Efficient global 2D-3D matching for camera localization in a large-scale 3D map. In: International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  69. Liu, L., Li, H., Dai, Y.: Stochastic attraction-repulsion embedding for large scale image localization. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  70. Looper, S., Rodriguez-Puigvert, J., Siegwart, R., Cadena, C., Schmid, L.: 3D VSG: long-term semantic scene change prediction through 3D variable scene graphs. In: International Conference on Robotics and Automation (ICRA) (2023)

    Google Scholar 

  71. Lynen, S., et al.: Large-scale, real-time visual–inertial localization revisited. Int. J. Robot. Res. (IJRR) (2020)

    Google Scholar 

  72. Lynen, S., et al.: Large-scale, real-time visual-inertial localization revisited. Int. J. Robot. Res. (IJRR) (2020)

    Google Scholar 

  73. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  74. Miao, Y., Armeni, I., Pollefeys, M., Barath, D.: Volumetric semantically consistent 3D panoptic mapping. arXiv preprint arXiv:2309.14737 (2024)

  75. Miao, Y., Li, C., Li, Z., Yang, Y., Yu, X.: A novel algorithm of ship structure modeling and target identification based on point cloud for automation in bulk cargo terminals. Meas. Control (2021)

    Google Scholar 

  76. Mihajlovic, M., Weder, S., Pollefeys, M., Oswald, M.R.: DeepSurfels: learning online appearance fusion. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  77. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)

    Google Scholar 

  78. Moreau, A., Piasco, N., Tsishkou, D., Stanciulescu, B., de La Fortelle, A.: LENS: localization enhanced by neRF synthesis. In: Conference on Robot Learning (CoRL) (2021)

    Google Scholar 

  79. Murez, Z., van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A.: Atlas: end-to-end 3D scene reconstruction from posed images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 414–431. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_25

    Chapter  Google Scholar 

  80. Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: learning texture representations in function space. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  81. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  82. Panek, V., Kukelova, Z., Sattler, T.: MeshLoc: mesh-based visual localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 589–609. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_34

    Chapter  Google Scholar 

  83. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  84. Peng, G., Yue, Y., Zhang, J., Wu, Z., Tang, X., Wang, D.: Semantic reinforced attention learning for visual place recognition. In: International Conference on Robotics and Automation (ICRA) (2021)

    Google Scholar 

  85. Peng, G., Zhang, J., Li, H., Wang, D.: Attentional pyramid pooling of salient visual residuals for place recognition. In: International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  86. Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  87. Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 523–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_31

    Chapter  Google Scholar 

  88. Pion, N., Humenberger, M., Csurka, G., Cabon, Y., Sattler, T.: Benchmarking image retrieval for visual localization. In: International Conference on 3D Vision (3DV) (2020)

    Google Scholar 

  89. Plotz, T., Roth, S.: Automatic registration of images to untextured geometry using average shading gradients. Int. J. Comput. Vision (IJCV) (2017)

    Google Scholar 

  90. Ponimatkin, G., Labbe, Y., Russell, B., Aubry, M., Sivic, J.: Focal length and object pose estimation via render and compare. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  91. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  92. Radford, A., et al.: Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML) (2021)

    Google Scholar 

  93. Ramalingam, S., Bouaziz, S., Sturm, P.F., Brand, M.: SKYLINE2GPS: localization in urban canyons using omni-skylines. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2010)

    Google Scholar 

  94. Ravichandran, Z., Peng, L., Hughes, N., Griffith, J., Carlone, L.: Hierarchical representations and explicit memory: Learning effective navigation policies on 3D scene graphs using graph neural networks. In: International Conference on Robotics and Automation (ICRA) (2022)

    Google Scholar 

  95. Rosinol, A., et al.: Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Int. J. Robot. Res. (IJRR) (2021)

    Google Scholar 

  96. Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3D dynamic scene graphs: actionable spatial perception with places, objects, and humans. arXiv preprint arXiv:2002.06289 (2020)

  97. Sarkar, S.D., Miksik, O., Pollefeys, M., Barath, D., Armeni, I.: SGAligner: 3D scene alignment with scene graphs. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  98. Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  99. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  100. Sarlin, P.E., et al.: OrienterNet: visual localization in 2D public maps with neural matching. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  101. Sarlin, P.E., et al.: Back to the feature: learning robust camera localization from pixels to pose. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  102. Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. Trans. Pattern Anal. Mach. Intell. (PAMI) (2017)

    Google Scholar 

  103. Sattler, T., Zhou, Q., Pollefeys, M., Leal-Taixe, L.: Understanding the limitations of CNN-based absolute camera pose regression. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  104. Savinov, N., Hane, C., Ladicky, L., Pollefeys, M.: Semantic 3D reconstruction with continuous regularization and ray potentials using a visibility consistency constraint. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  105. Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T.: Semantic visual localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  106. Sepulveda, G., Niebles, J., Soto, A.: A deep learning based behavioral approach to indoor autonomous navigation. In: International Conference on Robotics and Automation (ICRA) (2018)

    Google Scholar 

  107. Shan, Q., Wu, C., Curless, B., Furukawa, Y., Hernandez, C., Seitz, S.M.: Accurate geo-registration by ground-to-aerial image matching. In: International Conference on 3D Vision (3DV) (2014)

    Google Scholar 

  108. Shubodh, S., Omama, M., Zaidi, H., Parihar, U.S., Krishna, M.: LIP-loc: LiDAR image pretraining for cross-modal localization. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

    Google Scholar 

  109. Sibbing, D., Sattler, T., Leibe, B., Kobbelt, L.: SIFT-realistic rendering. In: International Conference on 3D Vision (3DV) (2013)

    Google Scholar 

  110. Steiger Mueller, M., Sattler, T., Pollefeys, M., Jutzi, B.: Image-to-image translation for enhanced feature matching, image retrieval and visual localization. ISPRS Ann. Photogram. Remote Sens. Spat. Inf.n Sci. (2019)

    Google Scholar 

  111. Stückler, J., Behnke, S.: Multi-resolution surfel maps for efficient dense 3D modeling and tracking. J. Vis. Commun. Image Representation (2014)

    Google Scholar 

  112. Svarm, L., Enqvist, O., Kahl, F., Oskarsson, M.: City-scale localization for cameras with known vertical direction. Trans. Pattern Anal. Mach. Intell. (PAMI) (2017)

    Google Scholar 

  113. Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  114. Tewari, A., et al.: Advances in neural rendering. In: Computer Graphics Forum (2022)

    Google Scholar 

  115. Tomesek, J., Cadik, M., Brejcha, J.: CrossLocate: cross-modal large-scale visual geo-localization in natural environments using rendered modalities. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022)

    Google Scholar 

  116. Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. Trans. Pattern Anal. Mach. Intell. (PAMI) (2018)

    Google Scholar 

  117. Torii, A., et al.: Are large-scale 3D models really necessary for accurate visual localization? Trans. Pattern Anal. Mach. Intell. (PAMI) (2021)

    Google Scholar 

  118. Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., Torr, P.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  119. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  120. Ventura, J., Kukelova, Z., Sattler, T., Baráth, D.: Absolute pose from one or two scaled and oriented features. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  121. Viswanathan, A., Rodrigues Pires, B., Huber, D.F.: Vision based robot localization by ground to satellite matching in GPS-denied situations. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2014)

    Google Scholar 

  122. Walch, F., Hazirbas, C., Leal-Taixe, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Image-based localization using LSTMs for structured feature correlation. In: International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  123. Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: RIO: 3D object instance re-localization in changing indoor environments. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  124. Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3D semantic scene graphs from 3D indoor reconstructions. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  125. Wang, S., Kannala, J., Barath, D.: DGC-GNN: descriptor-free geometric-color graph neural network for 2D-3D matching. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  126. Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: a dataset for lifelong place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  127. Weder, S., Schonberger, J.L., Pollefeys, M., Oswald, M.R.: NeuralFusion: online depth fusion in latent space. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  128. Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  129. Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  130. Ying, Z., et al.: RP-SG: relation prediction in 3D scene graphs for unobserved objects localization. IEEE Robot. Autom. Lett. (RA-L) (2023)

    Google Scholar 

  131. Zaffar, M., Garg, S., Milford, M., et al.: VPR-bench: an open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. Int. J. Comput. Vision (IJCV) (2021)

    Google Scholar 

  132. Zeisl, B., Sattler, T., Pollefeys, M.: Camera pose voting for large-scale image-based localization. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  133. Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3D point-based scene graph analysis. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  134. Zhang, G., Larsson, V., Barath, D.: Revisiting rotation averaging: uncertainties and robust losses. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  135. Zhang, S., Hao, A., Qin, H.: Knowledge-inspired 3D scene graph prediction in point cloud. International Conference on Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  136. Zhang, W., Kosecka, J.: Image based localization in urban environments. In: International Symposium on 3D Data Processing, Visualization, and Transmission (2006)

    Google Scholar 

  137. Zhang, Z., Sattler, T., Scaramuzza, D.: Reference pose generation for long-term visual localization via learned features and view synthesis. Int. J. Comput. Vis. (IJCV) (2020)

    Google Scholar 

  138. Zhao, L., Gatsis, K., Papachristodoulou, A.: Stable and safe reinforcement learning via a barrier-Lyapunov actor-critic approach. In: IEEE Conference on Decision and Control (CDC) (2023)

    Google Scholar 

  139. Zhao, L., Miao, K., Gatsis, K., Papachristodoulou, A.: Stable and safe human-aligned reinforcement learning through neural ordinary differential equations. arXiv preprint arXiv:2401.13148 (2024)

  140. Zheng, E., Wu, C.: Structure from motion using structure-less resection. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  141. Zhou, Q., Agostinho, S., Ošep, A., Leal-Taixé, L.: Is geometry enough for matching in visual localization? In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13670, pp. 407–425. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_24

    Chapter  Google Scholar 

  142. Zurbrügg, R., et al.: ICGNet: a unified approach for instance-centric grasping. In: International Conference on Robotics and Automation (ICRA) (2024)

    Google Scholar 

Download references

Acknowledgements

We would like to thank our colleagues Ganlin Zhang, Sayan Deb Sarkar, and Cathrin Elich for their valuable advice and insightful discussions throughout the course of this research. Their contributions and suggestions greatly enhanced the quality and depth of this work. This work was partially funded by Design++ initiative of ETH Zurich, by the ETH RobotX research grant, the Hasler Stiftung Research Grant via the ETH Zurich Foundation, an ETH AI Center postdoctoral research fellowship and an ETH Zurich Career Seed Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dániel Béla Baráth .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10072 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Miao, Y., Engelmann, F., Vysotska, O., Tombari, F., Pollefeys, M., Baráth, D.B. (2025). SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73242-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73241-6

  • Online ISBN: 978-3-031-73242-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics