Abstract
We introduce the task of localizing an input image within a multi-modal reference map represented by a collection of 3D scene graphs. These scene graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases. Given these modalities, the proposed method SceneGraphLoc learns a fixed-sized embedding for each node (i.e., representing object instances) in the scene graph, enabling effective matching with the objects visible in the input query image. This strategy significantly outperforms other cross-modal methods, even without incorporating images into the map representation. With images, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster. Code and models are available at https://scenegraphloc.github.io.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agia, C., et al.: TaskoGraphy: evaluating robot task planning over large 3D scene graphs. In: Conference on Robot Learning (CoRL) (2022)
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Aubry, M., Russell, B.C., Sivic, J.: Painting-to-3D model alignment via discriminative visual elements. ACM Trans. Graph. (TOG) (2014)
Aubry, M., Russell, B.C., Sivic, J.: Visual geo-localization of non-photographic depictions via 2D-3D alignment. In: Large-Scale Visual Geo-Localization (2016)
Balntas, V., Li, S., Prisacariu, V.: RelocNet: continuous metric learning relocalisation using neural nets. In: European Conference on Computer Vision (ECCV) (2018)
Bernreiter, L., Ott, L., Nieto, J., Siegwart, R., Cadena, C.: Spherical multi-modal place recognition for heterogeneous sensor systems. In: International Conference on Robotics and Automation (ICRA) (2021)
Berton, G., Masone, C., Caputo, B.: Rethinking visual geo-localization for large-scale applications. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Berton, G., Paolicelli, V., Masone, C., Caputo, B.: Adaptive-attentive geolocalization from few queries: a hybrid approach. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021)
Bhayani, S., Sattler, T., Barath, D., Beliansky, P., Heikkilä, J., Kukelova, Z.: Calibrated and partially calibrated semi-generalized homographies. In: International Conference on Computer Vision (ICCV) (2021)
Brachmann, E., et al.: DSAC - Differentiable RANSAC for camera localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Brachmann, E., Rother, C.: Learning less is more - 6D camera localization via 3D surface regression. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Brachmann, E., Rother, C.: Visual camera re-localization from RGB and RGB-D images using DSAC. Trans. Pattern Anal. Mach. Intell. (PAMI) (2021)
Brejcha, J., Lukáč, M., Hold-Geoffroy, Y., Wang, O., Čadík, M.: LandscapeAR: large scale outdoor augmented reality by matching photographs with terrain models using learned descriptors. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 295–312. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_18
Cadik, M., Sykora, D., Lee, S.: Automated outdoor depth-map generation and alignment. Comput. Graph. (2018)
Castle, R., Klein, G., Murray, D.W.: Video-rate localization in multiple maps for wearable augmented reality. In: IEEE International Symposium on Wearable Computers (2008)
Cavallari, T., Bertinetto, L., Mukhoti, J., Torr, P., Golodetz, S.: Let’s take this online: adapting scene coordinate regression network predictions for online RGB-D camera relocalisation. In: International Conference on 3D Vision (3DV) (2019)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (ECCV) Workshops (2004)
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Annual Conference on Computer Graphics and Interactive Techniques (1996)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Doan, A.D., Latif, Y., Chin, T.J., Liu, Y., Do, T.T., Reid, I.: Scalable place recognition under appearance change for autonomous driving. In: International Conference on Computer Vision (ICCV) (2019)
Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Tombari, F.: OpenNeRF: open Set 3D neural scene segmentation with pixel-wise features and rendered novel views. In: International Conference on Learning Representations (ICLR) (2024)
Fan, L., et al.: Embracing single stride 3D object detector with sparse transformer. International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Gadre, S.Y., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied AI. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Gao, P., Liang, J., Shen, Y., Son, S., Lin, M.C.: Visual, spatial, geometric-preserved place recognition for cross-view and cross-modal collaborative perception. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)
Garg, S., Fischer, T., Milford, M.: Where is your place, visual place recognition? In: International Joint Conference on Artificial Intelligence (IJCAI) (2021)
Garg, S., Suenderhauf, N., Milford, M.: Semantic-geometric visual place recognition: a new perspective for reconciling opposing views. Int. J. Robot. Res. (IJRR) (2019)
Garg, S., et al.: Semantics for robotic mapping, perception and interaction: a survey. Found. Trends Robot. (2020)
Georgakis, G., Karanam, S., Wu, Z., Kosecka, J.: Learning local RGB-to-CAD correspondences for object pose estimation. In: International Conference on Computer Vision (ICCV) (2019)
Germain, H., Bourmaud, G., Lepetit, V.: Sparse-to-dense hypercolumn matching for long-term visual localization. In: International Conference on 3D Vision (3DV) (2019)
Germain, H., Bourmaud, G., Lepetit, V.: S2DNet: learning image features for accurate sparse-to-dense matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 626–643. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_37
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Grelsson, B., Robinson, A., Felsberg, M., Khan, F.S.: GPS-level accurate camera localization with HorizonNet. J. Field Robot. (2020)
Gumeli, C., Dai, A., Nießner, M.: ROCA: robust CAD model retrieval and alignment from a single image. arXiv preprint arXiv:2112.01988 (2021)
Hanocka, R., Metzer, G., Giryes, R., Cohen-Or, D.: Point2Mesh: a self-prior for deformable meshes. arXiv preprint arXiv:2005.11084 (2020)
Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P.: Global context vision transformers. In: International Conference on Machine Learning (ICML) (2023)
Hausler, S., Jacobson, A., Milford, M.: Multi-process fusion: visual place recognition using multiple image processing methods. IEEE Robot. Autom. Lett. (RA-L) (2019)
Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-NetVLAD: multi-scale fusion of locally-global descriptors for place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Heng, L., et al.: Project autovision: localization and 3D scene perception for an autonomous Cehicle with a multi-camera system. In: International Conference on Robotics and Automation (ICRA) (2019)
Hess, G., Tonderski, A., Petersson, C., Åström, K., Svensson, L.: LidarCLIP or: how i learned to talk to point clouds. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)
Hodan, T.: Pose estimation of specific rigid objects. Ph.D. thesis (2021)
Hodan, T., Barath, D., Matas, J.: EPOS: estimating 6D pose of objects with symmetries. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Hodan, T., Zabulis, X., Lourakis, M.I.A., Obdrzalek, S., Matas, J.: Detection and fine 3D pose estimation of texture-less objects in RGB-D images. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2015)
Hu, S., Feng, M., Nguyen, R.H.M., Lee, G.H.: CVM-net: cross-view matching network for image-based ground-to-aerial geo-localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Hu, S., Lee, G.H.: Image-based geolocalization using satellite imagery. Int. J. Comput. Vision (IJCV) (2019)
Hughes, N., Chang, Y., Carlone, L.: Hydra: a real-time spatial perception system for 3D scene graph construction and optimization. arXiv preprint arXiv:2201.13360 (2022)
Ibrahimi, S., van Noord, N., Alpherts, T., Worring, M.: Inside out visual place recognition. In: British Machine Vision Conference (2021)
Irschara, A., Zach, C., Frahm, J.M., Bischof, H.: From structure-from-motion point clouds to fast location recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: ACM Symposium on User Interface Software and Technology (2011)
Ji, X., Wei, J., Wang, Y., Shang, H., Kneip, L.: Cross-modal place recognition in image databases using event-based sensors. arXiv preprint arXiv:2307.01047 (2023)
Kabalar, J., Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Towards long-term retrieval-based visual localization in indoor environments with changes. IEEE Robot. Autom. Lett. (2023)
Keetha, N., et al.: AnyLoc: towards universal visual place recognition. IEEE Robot. Autom. Lett. (RA-L) (2023)
Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: International Conference on Computer Vision (ICCV) (2015)
Khaliq, A., Ehsan, S., Chen, Z., Milford, M., McDonald-Maier, K.: A holistic visual place recognition approach using lightweight CNNs for significant viewpoint and appearance changes. IEEE Trans. Robot. (T-RO) (2020)
Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geolocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Kim, U.H., Park, J.M., Song, T.J., Kim, J.H.: 3-D scene graph: a sparse and semantic representation of physical environments for intelligent agents. IEEE Trans. Cybern. (2019)
Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International J. Comput. Vision (IJCV) (2000)
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34
Lee, S., Seong, H., Lee, S., Kim, E.: Correlation verification for image retrieval. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Li, Y., Snavely, N., Huttenlocher, D., Fua, P.: Worldwide pose estimation using 3D point clouds. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 15–29. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_2
Lim, H., Sinha, S.N., Cohen, M.F., Uyttendaele, M.: Real-time image-based 6-DoF localization in large-scale environments. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Lin, T.Y., Cui, Y., Belongie, S.J., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Lin, Z., Zhang, Z., Wang, M., Shi, Y., Wu, X., Zheng, Y.: Multi-modal contrastive representation learning for entity alignment. arXiv preprint arXiv:2209.00891 (2022)
Liu, L., Li, H., Dai, Y.: Efficient global 2D-3D matching for camera localization in a large-scale 3D map. In: International Conference on Computer Vision (ICCV) (2017)
Liu, L., Li, H., Dai, Y.: Stochastic attraction-repulsion embedding for large scale image localization. In: International Conference on Computer Vision (ICCV) (2019)
Looper, S., Rodriguez-Puigvert, J., Siegwart, R., Cadena, C., Schmid, L.: 3D VSG: long-term semantic scene change prediction through 3D variable scene graphs. In: International Conference on Robotics and Automation (ICRA) (2023)
Lynen, S., et al.: Large-scale, real-time visual–inertial localization revisited. Int. J. Robot. Res. (IJRR) (2020)
Lynen, S., et al.: Large-scale, real-time visual-inertial localization revisited. Int. J. Robot. Res. (IJRR) (2020)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Miao, Y., Armeni, I., Pollefeys, M., Barath, D.: Volumetric semantically consistent 3D panoptic mapping. arXiv preprint arXiv:2309.14737 (2024)
Miao, Y., Li, C., Li, Z., Yang, Y., Yu, X.: A novel algorithm of ship structure modeling and target identification based on point cloud for automation in bulk cargo terminals. Meas. Control (2021)
Mihajlovic, M., Weder, S., Pollefeys, M., Oswald, M.R.: DeepSurfels: learning online appearance fusion. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)
Moreau, A., Piasco, N., Tsishkou, D., Stanciulescu, B., de La Fortelle, A.: LENS: localization enhanced by neRF synthesis. In: Conference on Robot Learning (CoRL) (2021)
Murez, Z., van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A.: Atlas: end-to-end 3D scene reconstruction from posed images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 414–431. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_25
Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: learning texture representations in function space. In: International Conference on Computer Vision (ICCV) (2019)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Panek, V., Kukelova, Z., Sattler, T.: MeshLoc: mesh-based visual localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 589–609. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_34
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Peng, G., Yue, Y., Zhang, J., Wu, Z., Tang, X., Wang, D.: Semantic reinforced attention learning for visual place recognition. In: International Conference on Robotics and Automation (ICRA) (2021)
Peng, G., Zhang, J., Li, H., Wang, D.: Attentional pyramid pooling of salient visual residuals for place recognition. In: International Conference on Computer Vision (ICCV) (2021)
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 523–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_31
Pion, N., Humenberger, M., Csurka, G., Cabon, Y., Sattler, T.: Benchmarking image retrieval for visual localization. In: International Conference on 3D Vision (3DV) (2020)
Plotz, T., Roth, S.: Automatic registration of images to untextured geometry using average shading gradients. Int. J. Comput. Vision (IJCV) (2017)
Ponimatkin, G., Labbe, Y., Russell, B., Aubry, M., Sivic, J.: Focal length and object pose estimation via render and compare. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML) (2021)
Ramalingam, S., Bouaziz, S., Sturm, P.F., Brand, M.: SKYLINE2GPS: localization in urban canyons using omni-skylines. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2010)
Ravichandran, Z., Peng, L., Hughes, N., Griffith, J., Carlone, L.: Hierarchical representations and explicit memory: Learning effective navigation policies on 3D scene graphs using graph neural networks. In: International Conference on Robotics and Automation (ICRA) (2022)
Rosinol, A., et al.: Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Int. J. Robot. Res. (IJRR) (2021)
Rosinol, A., Gupta, A., Abate, M., Shi, J., Carlone, L.: 3D dynamic scene graphs: actionable spatial perception with places, objects, and humans. arXiv preprint arXiv:2002.06289 (2020)
Sarkar, S.D., Miksik, O., Pollefeys, M., Barath, D., Armeni, I.: SGAligner: 3D scene alignment with scene graphs. In: International Conference on Computer Vision (ICCV) (2023)
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Sarlin, P.E., et al.: OrienterNet: visual localization in 2D public maps with neural matching. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Sarlin, P.E., et al.: Back to the feature: learning robust camera localization from pixels to pose. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. Trans. Pattern Anal. Mach. Intell. (PAMI) (2017)
Sattler, T., Zhou, Q., Pollefeys, M., Leal-Taixe, L.: Understanding the limitations of CNN-based absolute camera pose regression. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Savinov, N., Hane, C., Ladicky, L., Pollefeys, M.: Semantic 3D reconstruction with continuous regularization and ray potentials using a visibility consistency constraint. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T.: Semantic visual localization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Sepulveda, G., Niebles, J., Soto, A.: A deep learning based behavioral approach to indoor autonomous navigation. In: International Conference on Robotics and Automation (ICRA) (2018)
Shan, Q., Wu, C., Curless, B., Furukawa, Y., Hernandez, C., Seitz, S.M.: Accurate geo-registration by ground-to-aerial image matching. In: International Conference on 3D Vision (3DV) (2014)
Shubodh, S., Omama, M., Zaidi, H., Parihar, U.S., Krishna, M.: LIP-loc: LiDAR image pretraining for cross-modal localization. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)
Sibbing, D., Sattler, T., Leibe, B., Kobbelt, L.: SIFT-realistic rendering. In: International Conference on 3D Vision (3DV) (2013)
Steiger Mueller, M., Sattler, T., Pollefeys, M., Jutzi, B.: Image-to-image translation for enhanced feature matching, image retrieval and visual localization. ISPRS Ann. Photogram. Remote Sens. Spat. Inf.n Sci. (2019)
Stückler, J., Behnke, S.: Multi-resolution surfel maps for efficient dense 3D modeling and tracking. J. Vis. Commun. Image Representation (2014)
Svarm, L., Enqvist, O., Kahl, F., Oskarsson, M.: City-scale localization for cameras with known vertical direction. Trans. Pattern Anal. Mach. Intell. (PAMI) (2017)
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. In: International Conference on Neural Information Processing Systems (NeurIPS) (2023)
Tewari, A., et al.: Advances in neural rendering. In: Computer Graphics Forum (2022)
Tomesek, J., Cadik, M., Brejcha, J.: CrossLocate: cross-modal large-scale visual geo-localization in natural environments using rendered modalities. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022)
Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. Trans. Pattern Anal. Mach. Intell. (PAMI) (2018)
Torii, A., et al.: Are large-scale 3D models really necessary for accurate visual localization? Trans. Pattern Anal. Mach. Intell. (PAMI) (2021)
Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., Torr, P.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (ICLR) (2018)
Ventura, J., Kukelova, Z., Sattler, T., Baráth, D.: Absolute pose from one or two scaled and oriented features. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Viswanathan, A., Rodrigues Pires, B., Huber, D.F.: Vision based robot localization by ground to satellite matching in GPS-denied situations. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2014)
Walch, F., Hazirbas, C., Leal-Taixe, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Image-based localization using LSTMs for structured feature correlation. In: International Conference on Computer Vision (ICCV) (2017)
Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: RIO: 3D object instance re-localization in changing indoor environments. In: International Conference on Computer Vision (ICCV) (2019)
Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3D semantic scene graphs from 3D indoor reconstructions. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Wang, S., Kannala, J., Barath, D.: DGC-GNN: descriptor-free geometric-color graph neural network for 2D-3D matching. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: a dataset for lifelong place recognition. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Weder, S., Schonberger, J.L., Pollefeys, M., Oswald, M.R.: NeuralFusion: online depth fusion in latent space. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: International Conference on Computer Vision (ICCV) (2015)
Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Ying, Z., et al.: RP-SG: relation prediction in 3D scene graphs for unobserved objects localization. IEEE Robot. Autom. Lett. (RA-L) (2023)
Zaffar, M., Garg, S., Milford, M., et al.: VPR-bench: an open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. Int. J. Comput. Vision (IJCV) (2021)
Zeisl, B., Sattler, T., Pollefeys, M.: Camera pose voting for large-scale image-based localization. In: International Conference on Computer Vision (ICCV) (2015)
Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3D point-based scene graph analysis. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Zhang, G., Larsson, V., Barath, D.: Revisiting rotation averaging: uncertainties and robust losses. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Zhang, S., Hao, A., Qin, H.: Knowledge-inspired 3D scene graph prediction in point cloud. International Conference on Neural Information Processing Systems (NeurIPS) (2021)
Zhang, W., Kosecka, J.: Image based localization in urban environments. In: International Symposium on 3D Data Processing, Visualization, and Transmission (2006)
Zhang, Z., Sattler, T., Scaramuzza, D.: Reference pose generation for long-term visual localization via learned features and view synthesis. Int. J. Comput. Vis. (IJCV) (2020)
Zhao, L., Gatsis, K., Papachristodoulou, A.: Stable and safe reinforcement learning via a barrier-Lyapunov actor-critic approach. In: IEEE Conference on Decision and Control (CDC) (2023)
Zhao, L., Miao, K., Gatsis, K., Papachristodoulou, A.: Stable and safe human-aligned reinforcement learning through neural ordinary differential equations. arXiv preprint arXiv:2401.13148 (2024)
Zheng, E., Wu, C.: Structure from motion using structure-less resection. In: International Conference on Computer Vision (ICCV) (2015)
Zhou, Q., Agostinho, S., Ošep, A., Leal-Taixé, L.: Is geometry enough for matching in visual localization? In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13670, pp. 407–425. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_24
Zurbrügg, R., et al.: ICGNet: a unified approach for instance-centric grasping. In: International Conference on Robotics and Automation (ICRA) (2024)
Acknowledgements
We would like to thank our colleagues Ganlin Zhang, Sayan Deb Sarkar, and Cathrin Elich for their valuable advice and insightful discussions throughout the course of this research. Their contributions and suggestions greatly enhanced the quality and depth of this work. This work was partially funded by Design++ initiative of ETH Zurich, by the ETH RobotX research grant, the Hasler Stiftung Research Grant via the ETH Zurich Foundation, an ETH AI Center postdoctoral research fellowship and an ETH Zurich Career Seed Award.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Miao, Y., Engelmann, F., Vysotska, O., Tombari, F., Pollefeys, M., Baráth, D.B. (2025). SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-73242-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73241-6
Online ISBN: 978-3-031-73242-3
eBook Packages: Computer ScienceComputer Science (R0)