Incorporating Global Correlation and Local Aggregation for Efficient Visual Localization

Xie, Dong; Lu, Jianfeng; Song, Zhenbo; Chen, Xuanzhu

doi:10.1007/978-3-031-46308-2_6

Dong Xie¹⁴,
Jianfeng Lu¹⁴,
Zhenbo Song¹⁴ &
…
Xuanzhu Chen¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14356))

Included in the following conference series:

International Conference on Image and Graphics

307 Accesses

Abstract

Scene coordinate regression has made significant improvements in visual localization by voting with segmentation strategy, which first segments landmark patches and then votes landmark points within each patch. However, such method ignores global correlations between patches and lacks local aggregation of voting directions. In this paper, we present a new visual localization framework based on an efficient vision transformer and voting aggregation loss. Specifically, we introduce a global correlation feature extractor to capture the correlated features of global information and employ differentiable angular loss to enhance the aggregation of local voting directions, thus improving the accuracy and robustness of visual localization. Extensive experiments are conducted on the 7-scenes and Cambridge Landmarks datasets. Results show that our method is superior to the scene coordinate regression method on both datasets, demonstrating the effectiveness of this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)
Google Scholar
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Chapter Google Scholar
Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692 (2017)
Google Scholar
Brachmann, E., Rother, C.: Learning less is more-6D camera localization via 3D surface regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4654–4662 (2018)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, H., Zhou, Y., Li, A., Gao, S., Li, J., Guo, Y.: Visual localization using semantic segmentation and depth prediction. arXiv preprint arXiv:2005.11922 (2020)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3385–3394 (2019)
Google Scholar
Huang, Z., et al.: VS-Net: voting with segmentation for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6101–6111 (2021)
Google Scholar
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)
Google Scholar
Larsson, M., Stenborg, E., Toft, C., Hammarstrand, L., Sattler, T., Kahl, F.: Fine-grained segmentation networks: self-supervised segmentation for improved long-term visual localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 31–41 (2019)
Google Scholar
Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11983–11992 (2020)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Article Google Scholar
Mehta, S., Rastegari, M.: Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680 (2022)
Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE (2011)
Google Scholar
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV 2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725 (2019)
Google Scholar
Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2D-to-3D matching. In: 2011 International Conference on Computer Vision, pp. 667–674. IEEE (2011)
Google Scholar
Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012, Proceedings, Part I 12, vol. 7572, pp. 752–765. Springer, Cham (2012). https://doi.org/10.1007/978-3-642-33718-5_54
Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1744–1756 (2016)
Article Google Scholar
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)
Google Scholar
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2930–2937 (2013)
Google Scholar
Stenborg, E., Toft, C., Hammarstrand, L.: Long-term visual localization using semantically segmented images. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6484–6490. IEEE (2018)
Google Scholar
Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., Torr, P.H.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4400–4408 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wu, C.: VisualSFM: a visual structure from motion system (2011). http://www.cs.washington.edu/homes/ccwu/vsfm
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Xu, Y., et al.: SelfVoxeLO: self-supervised lidar odometry with voxel-based deep neural networks. In: Conference on Robot Learning, pp. 115–125. PMLR (2021)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, X., Zhuang, Z., Koniusz, P., Li, H.: 6DoF object pose estimation via differentiable proxy voting loss. arXiv preprint arXiv:2002.03923 (2020)
Zhang, G., Dong, Z., Jia, J., Wong, T.-T., Bao, H.: Efficient non-consecutive feature tracking for structure-from-motion. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 422–435. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15555-0_31
Chapter Google Scholar
Zhong, Z., et al.: Squeeze-and-attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13065–13074 (2020)
Google Scholar
Zhou, G., Bescos, B., Dymczyk, M., Pfeiffer, M., Neira, J., Siegwart, R.: Dynamic objects segmentation for visual localization in urban environments. arXiv preprint arXiv:1807.02996 (2018)
Zhu, S., et al.: Parallel structure from motion from local increment to global averaging. arXiv preprint arXiv:1702.08601 (2017)
Liu, J., Nie, Q., Liu, Y., Wang, C.: NeRF-Loc: visual localization with conditional neural radiance field. CoRR abs/2304.07979 (2023)
Google Scholar
Zhou, Q., Agostinho, S., Osep, A., Leal-Taix e, L.: Is geometry enough for matching in visual localization? In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022–17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part X. LNCS, vol. 13670, pp. 407–425. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_24

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (No. 62302220), in part by the China Postdoctoral Foundation (No. 2023M731691), and in part by the Jiangsu Funding Program for Excellent Postdoctoral Talent (No. 2022ZB268).

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, 210094, China
Dong Xie, Jianfeng Lu, Zhenbo Song & Xuanzhu Chen

Authors

Dong Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenbo Song
View author publications
You can also search for this author in PubMed Google Scholar
Xuanzhu Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianfeng Lu .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Huchuan Lu
University of Sydney, Sydney, NSW, Australia
Wanli Ouyang
Shenzhen University, Shenzhen, China
Hui Huang
Tsinghua University, Beijing, China
Jiwen Lu
Dalian University of Technology, Dalian, China
Risheng Liu
Institute of Automation, CAS, Beijing, China
Jing Dong
University of Technology Sydney, Sydney, NSW, Australia
Min Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, D., Lu, J., Song, Z., Chen, X. (2023). Incorporating Global Correlation and Local Aggregation for Efficient Visual Localization. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14356. Springer, Cham. https://doi.org/10.1007/978-3-031-46308-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-46308-2_6
Published: 30 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46307-5
Online ISBN: 978-3-031-46308-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics