Abstract
Scene coordinate regression has made significant improvements in visual localization by voting with segmentation strategy, which first segments landmark patches and then votes landmark points within each patch. However, such method ignores global correlations between patches and lacks local aggregation of voting directions. In this paper, we present a new visual localization framework based on an efficient vision transformer and voting aggregation loss. Specifically, we introduce a global correlation feature extractor to capture the correlated features of global information and employ differentiable angular loss to enhance the aggregation of local voting directions, thus improving the accuracy and robustness of visual localization. Extensive experiments are conducted on the 7-scenes and Cambridge Landmarks datasets. Results show that our method is superior to the scene coordinate regression method on both datasets, demonstrating the effectiveness of this framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692 (2017)
Brachmann, E., Rother, C.: Learning less is more-6D camera localization via 3D surface regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4654–4662 (2018)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)
Dosovitskiy, A., et al.: An image is worth 16Â \(\times \)Â 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, H., Zhou, Y., Li, A., Gao, S., Li, J., Guo, Y.: Visual localization using semantic segmentation and depth prediction. arXiv preprint arXiv:2005.11922 (2020)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3385–3394 (2019)
Huang, Z., et al.: VS-Net: voting with segmentation for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6101–6111 (2021)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)
Larsson, M., Stenborg, E., Toft, C., Hammarstrand, L., Sattler, T., Kahl, F.: Fine-grained segmentation networks: self-supervised segmentation for improved long-term visual localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 31–41 (2019)
Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11983–11992 (2020)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Mehta, S., Rastegari, M.: Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680 (2022)
Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE (2011)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV 2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725 (2019)
Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2D-to-3D matching. In: 2011 International Conference on Computer Vision, pp. 667–674. IEEE (2011)
Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012, Proceedings, Part I 12, vol. 7572, pp. 752–765. Springer, Cham (2012). https://doi.org/10.1007/978-3-642-33718-5_54
Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1744–1756 (2016)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2930–2937 (2013)
Stenborg, E., Toft, C., Hammarstrand, L.: Long-term visual localization using semantically segmented images. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6484–6490. IEEE (2018)
Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., Torr, P.H.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4400–4408 (2015)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wu, C.: VisualSFM: a visual structure from motion system (2011). http://www.cs.washington.edu/homes/ccwu/vsfm
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Xu, Y., et al.: SelfVoxeLO: self-supervised lidar odometry with voxel-based deep neural networks. In: Conference on Robot Learning, pp. 115–125. PMLR (2021)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, X., Zhuang, Z., Koniusz, P., Li, H.: 6DoF object pose estimation via differentiable proxy voting loss. arXiv preprint arXiv:2002.03923 (2020)
Zhang, G., Dong, Z., Jia, J., Wong, T.-T., Bao, H.: Efficient non-consecutive feature tracking for structure-from-motion. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 422–435. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15555-0_31
Zhong, Z., et al.: Squeeze-and-attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13065–13074 (2020)
Zhou, G., Bescos, B., Dymczyk, M., Pfeiffer, M., Neira, J., Siegwart, R.: Dynamic objects segmentation for visual localization in urban environments. arXiv preprint arXiv:1807.02996 (2018)
Zhu, S., et al.: Parallel structure from motion from local increment to global averaging. arXiv preprint arXiv:1702.08601 (2017)
Liu, J., Nie, Q., Liu, Y., Wang, C.: NeRF-Loc: visual localization with conditional neural radiance field. CoRR abs/2304.07979 (2023)
Zhou, Q., Agostinho, S., Osep, A., Leal-Taix e, L.: Is geometry enough for matching in visual localization? In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022–17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part X. LNCS, vol. 13670, pp. 407–425. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_24
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China (No. 62302220), in part by the China Postdoctoral Foundation (No. 2023M731691), and in part by the Jiangsu Funding Program for Excellent Postdoctoral Talent (No. 2022ZB268).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xie, D., Lu, J., Song, Z., Chen, X. (2023). Incorporating Global Correlation and Local Aggregation for Efficient Visual Localization. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14356. Springer, Cham. https://doi.org/10.1007/978-3-031-46308-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-46308-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46307-5
Online ISBN: 978-3-031-46308-2
eBook Packages: Computer ScienceComputer Science (R0)