Abstract
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. In many cases, our method, ACE0, estimates camera poses with an accuracy close to feature-based SfM, as demonstrated by novel view synthesis.
Project page: https://nianticlabs.github.io/acezero/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, S., et al.: Building Rome in a day. ACM TOG (2011)
Agarwal, S., Snavely, N., Seitz, S.M., Szeliski, R.: Bundle adjustment in the large. In: ECCV (2010)
Arnold, E., et al.: Map-free visual relocalization: metric pose relative to a single image. In: ECCV (2022)
Balntas, V., Li, S., Prisacariu, V.A.: RelocNet: continuous metric learning relocalisation using neural nets. In: ECCV (2018)
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022)
Beardsley, P.A., Zisserman, A., Murray, D.W.: Sequential updating of projective and affine structure from motion. IJCV 23, 235–259 (1997)
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth. arXiv (2023)
Bhowmick, B., Patra, S., Chatterjee, A., Govindu, V.M., Banerjee, S.: Divide and conquer: efficient large-scale structure from motion using graph partitioning. In: ACCV (2015)
Bhowmick, B., Patra, S., Chatterjee, A., Govindu, V.M., Banerjee, S.: Divide and conquer: a hierarchical approach to large-scale structure-from-motion. CVIU 157, 190–205 (2017)
Bian, W., Wang, Z., Li, K., Bian, J.W., Prisacariu, V.A.: NoPe-NeRF: optimising neural radiance field with no pose prior. In: CVPR (2023)
Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: CodeSLAM — learning a compact, optimisable representation for dense visual SLAM. In: CVPR (2018)
Brachmann, E., Cavallari, T., Prisacariu, V.A.: Accelerated coordinate encoding: learning to relocalize in minutes using RGB and poses. In: CVPR (2023)
Brachmann, E., Humenberger, M., Rother, C., Sattler, T.: On the limits of pseudo ground truth in visual camera re-localisation. In: ICCV (2021)
Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: CVPR (2017)
Brachmann, E., Rother, C.: Learning less is more-6D camera localization via 3D surface regression. In: CVPR (2018)
Brachmann, E., Rother, C.: Expert sample consensus applied to camera re-localization. In: ICCV (2019)
Brachmann, E., Rother, C.: Visual camera re-localization from RGB and RGB-D images using DSAC. IEEE TPAMI 44(9), 5847–5865 (2021)
Brégier, R.: Deep regression on manifolds: a 3D rotation case study. In: 3DV (2021)
Brown, D.: The bundle adjustment-progress and prospect. In: Congress of the International Society for Photogrammetry (1976)
Brown, M., Lowe, D.G.: Unsupervised 3D object recognition and reconstruction in unordered datasets. In: 3DIM (2005)
Carlone, L., Tron, R., Daniilidis, K., Dellaert, F.: Initialization techniques for 3D SLAM: a survey on rotation estimation and its use in pose graph optimization. In: ICRA (2015)
Cavallari, T., Bertinetto, L., Mukhoti, J., Torr, P.H., Golodetz, S.: Let’s take this online: adapting scene coordinate regression network predictions for online RGB-D camera relocalisation. In: 3DV (2019)
Cavallari, T., Golodetz, S., Lord, N.A., Valentin, J., Di Stefano, L., Torr, P.H.: On-the-fly adaptation of regression forests for online camera relocalisation. In: CVPR (2017)
Chen, S., Bhalgat, Y., Li, X., Bian, J., Li, K., Wang, Z., Prisacariu, V.A.: Neural refinement for absolute pose regression with feature synthesis. In: CVPR (2024)
Chen, S., Li, X., Wang, Z., Prisacariu, V.: DFNet: enhance absolute pose regression with direct feature matching. In: ECCV (2022)
Chen, S., Wang, Z., Prisacariu, V.: Direct-PoseNet: absolute pose regression with photometric consistency. In: 3DV (2021)
Cheng, Z., Esteves, C., Jampani, V., Kar, A., Maji, S., Makadia, A.: LU-NeRF: scene and pose estimation by synchronizing local unposed NeRFs. In: ICCV (2023)
Crandall, D., Owens, A., Snavely, N., Huttenlocher, D.: Discrete-continuous optimization for large-scale structure from motion. In: CVPR (2011)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)
Davison, A.J.: Real-time simultaneous localisation and mapping with a single camera. In: ICCV (2003)
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPRW (2018)
Ding, M., Wang, Z., Sun, J., Shi, J., Luo, P.: CamNet: coarse-to-fine retrieval for camera re-localization. In: ICCV (2019)
Dusmanu, M., et al.: D2-net: a trainable CNN for joint description and detection of local features. In: CVPR (2019)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Gao, X.S., Hou, X.R., Tang, J., Cheng, H.F.: Complete solution classification for the perspective-three-point problem. IEEE TPAMI 25(8), 930–943 (2003)
Gherardi, R., Farenzena, M., Fusiello, A.: Improving the efficiency of hierarchical structure-and-motion. In: CVPR (2010)
Govindu, V.M.: Combining two-view constraints for motion estimation. In: CVPR (2001)
Govindu, V.M.: Lie-algebraic averaging for globally consistent motion estimation. In: CVPR (2004)
Hartley, R., Trumpf, J., Dai, Y., Li, H.: Rotation averaging. IJCV (2013)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
He, X., et al.: Detector-free structure from motion. In: CVPR (2024)
Heinly, J., Schönberger, J.L., Dunn, E., Frahm, J.M.: Reconstructing the World in six days. In: CVPR (2015)
Humenberger, M., et al.: Investigating the role of image retrieval for visual localization: an exhaustive benchmark. IJCV 130(7), 1811–1836 (2022)
Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: UIST (2011)
Jeong, Y., Ahn, S., Choy, C., Anandkumar, A., Cho, M., Park, J.: Self-calibrating neural radiance fields. In: ICCV (2021)
Jin, Y., et al.: Image matching across wide baselines: from paper to practice. IJCV 129(2), 517–547 (2021)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DoF camera relocalization. In: ICCV (2015)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D gaussian splatting for real-time radiance field rendering. ACM TOG (2023)
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM TOG 36(4), 1–13 (2017)
Kraus, K.: Photogrammetry. No. v. 1 in Photogrammetry, Ferdinand Dummlers Verlag (1993)
Laskar, Z., Melekhov, I., Kalia, S., Kannala, J.: Camera relocalization by computing pairwise relative poses using convolutional neural network. In: ICCV Workshops (2017)
Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: CVPR (2020)
Lin, A., Zhang, J.Y., Ramanan, D., Tulsiani, S.: Relpose++: recovering 6D poses from sparse-view observations. In: 3DV (2024)
Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural radiance fields. In: ICCV (2021)
Lin, Y., et al.: Parallel inversion of neural radiance fields for robust pose estimation. In: ICRA (2023)
Lindenberger, P., Sarlin, P.E., Pollefeys, M.: LightGlue: local feature matching at light speed. In: ICCV (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
Martinec, D., Pajdla, T.: Robust rotation and translation estimation in multiview reconstruction. In: CVPR (2007)
Meng, Q., et al.: GNeRF: GAN-based neural radiance field without posed camera. In: ICCV (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Moreau, A., Piasco, N., Bennehar, M., Tsishkou, D., Stanciulescu, B., de La Fortelle, A.: CROSSFIRE: camera relocalization on self-supervised features from an implicit representation. ICCV (2023)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG (2022)
Newcombe, R., et al.: KinectFusion: real-time dense surface mapping and tracking. In: ISMAR (2011)
Nistér, D., Naroditsky, O., Bergen, J.: Visual odometry. In: CVPR (2004)
Pollefeys, M., Koch, R., Vergauwen, M., Van Gool, L.: Automated reconstruction of 3D scenes from sequences of images. J. Photogr. Rem. Sens. (2000)
Rau, A., Garcia-Hernando, G., Stoyanov, D., Brostow, G.J., Turmukhambetov, D.: Predicting visual overlap of images through interpretable non-metric box embeddings. In: ECCV (2020)
Reality, C.: Reality Capture (2016). https://www.capturingreality.com/realitycapture. Accessed 15 Nov 2023
Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3D: large-scale learning and evaluation of real-life 3D category reconstruction. In: ICCV (2021)
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019)
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: CVPR (2020)
Sarlin, P.E., et al.: Back to the feature: learning robust camera localization from pixels to pose. In: CVPR (2021)
Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2D-to-3D matching. In: ICCV (2011)
Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: ECCV (2012)
Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE TPAMI 39(9), 1744–1756 (2016)
Sattler, T., et al.: Are large-scale 3D models really necessary for accurate visual localization? In: CVPR (2017)
Sattler, T., Zhou, Q., Pollefeys, M., Leal-Taixe, L.: Understanding the limitations of CNN-based absolute camera pose regression. In: CVPR (2019)
Schaffalitzky, F., Zisserman, A.: Multi-view matching for unordered image sets, or “how do i organize my holiday snaps?”. In: ECCV (2002)
Schönberger, J.L.: Colmap Github Issues (2017). https://github.com/colmap/colmap/issues/116#issuecomment-298926277. Accessed 15 Nov 2023
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)
Sinha, S., Zhang, J.Y., Tagliasacchi, A., Gilitschenski, I., Lindell, D.B.: SparsePose: sparse-view camera pose regression and refinement. In: CVPR (2023)
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications (2019)
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM TOG (2006)
Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo collections. IJCV (2008)
Snavely, N., Seitz, S.M., Szeliski, R.: Skeletal graphs for efficient structure from motion. In: CVPR (2008)
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)
Szeliski, R., Kang, S.B.: Recovering 3D shape and motion from image streams using nonlinear least squares. J. Vis. Comut. Image Repr. 5(1), 10–28 (1994)
Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development. In: ACM TOG (2023)
Teed, Z., Deng, J.: DROID-SLAM: deep visual SLAM for monocular, stereo, and RGB-D cameras. In: NeurIPS (2021)
Toldo, R., Gherardi, R., Farenzena, M., Fusiello, A.: Hierarchical structure-and-motion recovery from uncalibrated images. CVIU (2015)
Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment — a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) IWVA 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44480-7_21
Türkoğlu, M.Ö., Brachmann, E., Schindler, K., Brostow, G., Monszpart, A.: Visual camera re-localization using graph neural networks and relative pose supervision. In: 3DV (2021)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR (2018)
Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: CVPR (2017)
Waechter, M., Beljan, M., Fuhrmann, S., Moehrle, N., Kopf, J., Goesele, M.: Virtual rephotography: novel view prediction error for 3D reconstruction. ACM TOG 36(1), 1–11 (2017)
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: DUSt3R: geometric 3D vision made easy. In: CVPR (2024)
Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: NeRF–: neural radiance fields without known camera parameters. arXiv (2021)
Wei, X., Zhang, Y., Li, Z., Fu, Y., Xue, X.: DeepSFM: structure from motion via deep bundle adjustment. In: ECCV (2020)
Wu, C.: Towards linear-time incremental structure from motion. In: 3DV (2013)
Xia, Y., Tang, H., Timofte, R., Van Gool, L.: SiNeRF: sinusoidal neural radiance fields for joint pose estimation and scene reconstruction. In: BMVC (2022)
Yen-Chen, L., Florence, P., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: iNeRF: inverting neural radiance fields for pose estimation. In: IROS (2021)
Zhang, J.Y., Lin, A., Kumar, M., Yang, T.H., Ramanan, D., Tulsiani, S.: Cameras as rays: pose estimation via ray diffusion. In: ICLR (2024)
Zhang, W., Kosecka, J.: Image based localization in urban environments. In: 3DPVT (2006)
Zhou, Q., Sattler, T., Pollefeys, M., Leal-Taixe, L.: To learn or not to learn: visual localization from essential matrices. In: ICRA (2020)
Zhou, Y., Barnes, C., Jingwan, L., Jimei, Y., Hao, L.: On the continuity of rotation representations in neural networks. In: CVPR (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Brachmann, E. et al. (2025). Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15114. Springer, Cham. https://doi.org/10.1007/978-3-031-72992-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72992-8_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72991-1
Online ISBN: 978-3-031-72992-8
eBook Packages: Computer ScienceComputer Science (R0)