CVLNet: Cross-view Semantic Correspondence Learning for Video-Based Camera Localization

Shi, Yujiao; Yu, Xin; Wang, Shan; Li, Hongdong

doi:10.1007/978-3-031-26319-4_8

Yujiao Shi⁶,
Xin Yu⁷,
Shan Wang⁶ &
…
Hongdong Li⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13841))

Included in the following conference series:

Asian Conference on Computer Vision

977 Accesses
6 Citations

Abstract

This paper tackles the problem of Cross-view Video-based camera Localization (CVL). The task is to localize a query camera by leveraging information from its past observations, i.e., a continuous sequence of images observed at previous time stamps, and matching them to a large overhead-view satellite image. The critical challenge of this task is to learn a powerful global feature descriptor for the sequential ground-view images while considering its domain alignment with reference satellite images. For this purpose, we introduce CVLNet, which first projects the sequential ground-view images into an overhead view by exploring the ground-and-overhead geometric correspondences and then leverages the photo consistency among the projected images to form a global representation. In this way, the cross-view domain differences are bridged. Since the reference satellite images are usually pre-cropped and regularly sampled, there is always a misalignment between the query camera location and its matching satellite image center. Motivated by this, we propose estimating the query camera’s relative displacement to a satellite image before similarity matching. In this displacement estimation process, we also consider the uncertainty of the camera location. For example, a camera is unlikely to be on top of trees. To evaluate the performance of the proposed method, we collect satellite images from Google Map for the KITTI dataset and construct a new cross-view video-based localization benchmark dataset, KITTI-CVL. Extensive experiments have demonstrated the effectiveness of video-based localization over single image-based localization and the superiority of each proposed module over other alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-modal CrossViT using 3D spatial information for visual localization

Article 18 October 2024

Geographically Local Representation Learning with a Spatial Prior for Visual Localization

GAMa: Cross-View Video Geo-Localization

Notes

1.
For clarity, we use “overhead” throughout the paper to denote the projected features from ground-views, and “satellite” to indicate the real satellite image/features.

References

Vo, N.N., Hays, J.: Localizing and orienting street views using overhead imagery. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 494–509. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_30
Chapter Google Scholar
Hu, S., Feng, M., Nguyen, R.M.H., Hee Lee, G.: CVM-Net: cross-view matching network for image-based ground-to-aerial geo-localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Liu, L., Li, H.: Lending orientation to neural networks for cross-view geo-localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Regmi, K., Shah, M.: Bridging the domain gap for ground-to-aerial image matching. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Cai, S., Guo, Y., Khan, S., Hu, J., Wen, G.: Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Shi, Y., Liu, L., Yu, X., Li, H.: Spatial-aware feature aggregation for image based cross-view geo-localization. In: Advances in Neural Information Processing Systems, pp. 10090–10100 (2019)
Google Scholar
Shi, Y., Yu, X., Liu, L., Zhang, T., Li, H.: Optimal feature transport for cross-view image geo-localization. Account. Audit. Account. I, 11990–11997 (2020)
Google Scholar
Shi, Y., Yu, X., Campbell, D., Li, H.: Where am I looking at? Joint location and orientation estimation by cross-view matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4064–4072 (2020)
Google Scholar
Zhu, S., Yang, T., Chen, C.: Revisiting street-to-aerial view image geo-localization and orientation estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 756–765 (2021)
Google Scholar
Toker, A., Zhou, Q., Maximov, M., Leal-Taixé, L.: Coming down to earth: Satellite-to-street view synthesis for geo-localization. In: CVPR (2021)
Google Scholar
Zhu, S., Yang, T., Chen, C.: Vigor: cross-view image geo-localization beyond one-to-one retrieval. In: CVPR (2021)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)
Article Google Scholar
https://developers.google.com/maps/documentation/maps-static/overview
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)
Google Scholar
Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3251–3260 IEEE (2017)
Google Scholar
Liu, L., Li, H., Dai, Y.: Stochastic attraction-repulsion embedding for large scale image localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2570–2579 (2019)
Google Scholar
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3456–3465 (2017)
Google Scholar
Ge, Y., Wang, H., Zhu, F., Zhao, R., Li, H.: Self-supervising fine-grained region similarities for large-scale image localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 369–386. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_22
Chapter Google Scholar
Zhou, Y., Wan, G., Hou, S., Yu, L., Wang, G., Rui, X., Song, S.: DA4AD: end-to-end deep attention-based visual localization for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 271–289. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_17
Chapter Google Scholar
Castaldo, F., Zamir, A., Angst, R., Palmieri, F., Savarese, S.: Semantic cross-view matching. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 9–17 (2015)
Google Scholar
Lin, T.Y., Belongie, S., Hays, J.: Cross-view image geolocalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2013)
Google Scholar
Mousavian, A., Kosecka, J.: Semantic image based geolocation given a map. arXiv preprint arXiv:1609.00278 (2016)
Tian, Y., Chen, C., Shah, M.: Cross-view image matching for geo-localization in urban environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3616 (2017)
Google Scholar
Hu, S., Lee, G.H.: Image-based geo-localization using satellite imagery. Int. J. Comput. Vision 128, 1205–1219 (2020)
Article Google Scholar
Shi, Y., Yu, X., Liu, L., Campbell, D., Koniusz, P., Li, H.: Accurate 3-DOF camera geo-localization via ground-to-satellite image matching. arXiv preprint arXiv:2203.14148 (2022)
Zhu, S., Shah, M., Chen, C.: Transgeo: transformer is all you need for cross-view image geo-localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1162–1171 (2022)
Google Scholar
Elhashash, M., Qin, R.: Cross-view slam solver: global pose estimation of monocular ground-level video frames for 3d reconstruction using a reference 3d model from satellite images. ISPRS J. Photogramm. Remote. Sens. 188, 62–74 (2022)
Article Google Scholar
Guo, Y., Choi, M., Li, K., Boussaid, F., Bennamoun, M.: Soft exemplar highlighting for cross-view image-based geo-localization. IEEE Trans. Image Process. 31, 2094–2105 (2022)
Article Google Scholar
Zhao, J., Zhai, Q., Huang, R., Cheng, H.: Mutual generative transformer learning for cross-view geo-localization. arXiv preprint arXiv:2203.09135 (2022)
Bloesch, M., Omari, S., Hutter, M., Siegwart, R.: Robust visual inertial odometry using a direct ekf-based approach. In,: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).pp. 298–304. IEEE (2015)
Google Scholar
Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Robot. Res. 34, 314–334 (2015)
Article Google Scholar
Chien, H.J., Chuang, C.C., Chen, C.Y., Klette, R.: When to use what feature? sift, surf, orb, or a-kaze features for monocular visual odometry. 2016 International Conference on Image and Vision Computing New Zealand (IVCNZ), pp. 1–6 (2016)
Google Scholar
Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Rob. 32, 1309–1332 (2016)
Article Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In,: 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. pp. 225–234. IEEE (2007)
Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Trans. Rob. 31, 1147–1163 (2015)
Article Google Scholar
Mur-Artal, R., Tardós, J.D.: Orb-slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 33, 1255–1262 (2017)
Article Google Scholar
Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: an accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Trans. Robot. 37, 1874–1890 (2021)
Google Scholar
Mur-Artal, R., Tardós, J.D.: Visual-inertial monocular slam with map reuse. IEEE Robot. Autom. Lett. 2, 796–803 (2017)
Article Google Scholar
Wolcott, R.W., Eustice, R.M.: Visual localization within lidar maps for automated urban driving. 2014 IEEE/RSJ International Conference on Intelligent Robots and System, pp. 176–183 (2014)
Google Scholar
Voodarla, M., Shrivastava, S., Manglani, S., Vora, A., Agarwal, S., Chakravarty, P.: S-BEV: semantic birds-eye view representation for weather and lighting invariant 3-DOF localization (2021)
Google Scholar
Stenborg, E., Toft, C., Hammarstrand, L.: Long-term visual localization using semantically segmented images. In,: IEEE International Conference on Robotics and Automation (ICRA). pp .6484–6490. IEEE (2018)
Google Scholar
Stenborg, E., Sattler, T., Hammarstrand, L.: Using image sequences for long-term visual localization. In: 2020 International Conference on 3D Vision (3DV), pp. 938–948 IEEE (2020)
Google Scholar
Vaca-Castano, G., Zamir, A.R., Shah, M.: City scale geo-spatial trajectory estimation of a moving camera. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1186–1193 IEEE (2012)
Google Scholar
Regmi, K., Shah, M.: Video geo-localization employing geo-temporal feature learning and GPS trajectory smoothing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12126–12135 (2021)
Google Scholar
Yousif, K., Bab-Hadiashar, A., Hoseinnezhad, R.: An overview to visual odometry and visual slam: applications to mobile robotics. Intell. Ind. Syst. 1, 289–311 (2015)
Article Google Scholar
Scaramuzza, D., Fraundorfer, F.: Visual odometry [tutorial]. IEEE Robot. Autom. Mag. 18, 80–92 (2011)
Article Google Scholar
Gao, X., Wang, R., Demmel, N., Cremers, D.: Ldso: direct sparse odometry with loop closure. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2198–2204 IEEE (2018)
Google Scholar
Kasyanov, A., Engelmann, F., Stückler, J., Leibe, B.: Keyframe-based visual-inertial online slam with relocalization. In,: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6662–6669. IEEE (2017)
Google Scholar
Liu, D., Cui, Y., Guo, X., Ding, W., Yang, B., Chen, Y.: Visual localization for autonomous driving: mapping the accurate location in the city maze (2020)
Google Scholar
Hou, Y., Zheng, L., Gould, S.: Multiview Detection with Feature Perspective Transformation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_1
Chapter Google Scholar
Hou, Y., Zheng, L.: Multiview detection with shadow transformer (and view-coherent data augmentation). In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1673–1682 (2021)
Google Scholar
Vora, J., Dutta, S., Jain, K., Karthik, S., Gandhi, V.: Bringing generalization to deep multi-view detection. arXiv preprint arXiv:2109.12227 (2021)
Ma, J., Tong, J., Wang, S., Zhao, W., Zheng, L., Nguyen, C.: Voxelized 3d feature aggregation for multiview detection. arXiv preprint arXiv:2112.03471 (2021)
Zhang, Q., Lin, W., Chan, A.B.: Cross-view cross-scene multi-view crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 557–567 (2021)
Google Scholar
Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8297–8306 (2019)
Google Scholar
Zhang, Q., Chan, A.B.: 3d crowd counting via multi-view fusion with 3d gaussian kernels. Proceedings of the AAAI Conference on Artificial Intelligence. 34, 12837–12844 (2020)
Article Google Scholar
Zhang, Q., Chan, A.B.: Wide-area crowd counting: Multi-view fusion networks for counting in large scenes. Int. J. Comput Vis. 130, 1938–1960 (2022)
Google Scholar
Chen, L., et al.: Persformer: 3D lane detection via perspective transformer and the openlane benchmark. arXiv preprint arXiv:2203.11089 (2022)
Shi, Y., Campbell, D.J., Yu, X., Li, H.: Geometry-guided street-view panorama synthesis from satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell. 44, 10009–10022(2022)
Google Scholar
Shi, Y., Li, H.: Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17010–17020 (2022)
Google Scholar
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 . (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar

Download references

Acknowledgments

This research is funded in part by ARC-Discovery grants (DP190102261 and DP220100800 to HL, DP220100800 to XY) and ARC-DECRA grant (DE230100477 to XY). YS is a China Scholarship Council (CSC)-funded Ph.D. student to ANU. We thank all anonymous reviewers and ACs for their constructive suggestions.

Author information

Authors and Affiliations

Australian National University, Canberra, Australia
Yujiao Shi, Shan Wang & Hongdong Li
University of Technology Sydney, Sydney, Australia
Xin Yu

Authors

Yujiao Shi
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongdong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujiao Shi .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12776 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, Y., Yu, X., Wang, S., Li, H. (2023). CVLNet: Cross-view Semantic Correspondence Learning for Video-Based Camera Localization. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13841. Springer, Cham. https://doi.org/10.1007/978-3-031-26319-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-26319-4_8
Published: 04 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26318-7
Online ISBN: 978-3-031-26319-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CVLNet: Cross-view Semantic Correspondence Learning for Video-Based Camera Localization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modal CrossViT using 3D spatial information for visual localization

Geographically Local Representation Learning with a Spatial Prior for Visual Localization

GAMa: Cross-View Video Geo-Localization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12776 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CVLNet: Cross-view Semantic Correspondence Learning for Video-Based Camera Localization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modal CrossViT using 3D spatial information for visual localization

Geographically Local Representation Learning with a Spatial Prior for Visual Localization

GAMa: Cross-View Video Geo-Localization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12776 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation