skip to main content
10.1145/3607834.3616562acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Modern Backbone for Efficient Geo-localization

Published: 29 October 2023 Publication History

Abstract

With the development of autonomous driving technology, vision geo-localization has obtained a consistently growing following. How to match correct image pair from different perspectives is the key technology. Existing geo-localization methods focus on designing complex attention mechanism based on traditional backbone, e.g., VGG, ResNet, but neglect the importance of backbone network. In this article, we propose a modern backbone based geo-localization method (MBEG). MBEG introduces the latest vision fundamental network EVA-02 as backbone, which has been fully trained in large datasets. In addition, the feature rotate encoding strategy is presented to eliminate the effects of image rotation. We also apply the knowledge distillation to squeeze network's parameters for actual application. Our work exhibited excellent performance on the University-1652 dataset, and our solution attained the top-1 ranking in the UAVs in Multimedia Challenge for the University-160k dataset.

References

[1]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[2]
Duc Viet Bui, Masao Kubo, and Hiroshi Sato. 2022. A Part-aware Attention Neural Network for Cross-view Geo-localization between UAV and Satellite. Journal of Robotics, Networking and Artificial Life, Vol. 9, 3 (2022), 275--284.
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650--9660.
[4]
Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. 2021. A transformer-based feature segmentation and region alignment method for UAV-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 7 (2021), 4376--4389.
[5]
Fabian Deuser, Konrad Habel, and Norbert Oswald. 2023. Sample4Geo: Hard Negative Sampling For Cross-View Geo-Localisation. arXiv preprint arXiv:2303.11851 (2023).
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[7]
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023 a. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023).
[8]
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023 b. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19358--19369.
[9]
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, Vol. 129 (2021), 1789--1819.
[10]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000--16009.
[11]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10965--10975.
[12]
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. 2022. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12009--12019.
[13]
Sachin Mehta and Mohammad Rastegari. 2022. Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680 (2022).
[14]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[15]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019).
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[17]
Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. 2021. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 2 (2021), 867--879.
[18]
Tingyu Wang, Zhedong Zheng, Zunjie Zhu, Yuhan Gao, Yi Yang, and Chenggang Yan. 2022b. Learning cross-view geo-localization embeddings via dynamic weighted decorrelation regularization. arXiv preprint arXiv:2211.05296 (2022).
[19]
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. 2022a. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022).
[20]
Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. 2023. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14408--14419.
[21]
Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, and Baining Guo. 2022. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141 (2022).
[22]
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653--9663.
[23]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12104--12113.
[24]
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. 2022. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022).
[25]
Zhedong Zheng, Yujiao Shi, Tingyu Wang, Jun Liu, Jianwu Fang, Yunchao Wei, and Tat-seng Chua. 2023. UAVM '23: 2023 Workshop on UAVs in Multimedia: Capturing the World from a New Perspective. In Proceedings of the 31th ACM International Conference on Multimedia Workshop.
[26]
Zhedong Zheng, Yunchao Wei, and Yi Yang. 2020. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM international conference on Multimedia. 1395--1403.
[27]
Runzhe Zhu, Mingze Yang, Ling Yin, Fei Wu, and Yuncheng Yang. 2023 b. Uav's status is worth considering: A fusion representations matching method for geo-localization. Sensors, Vol. 23, 2 (2023), 720.
[28]
Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. 2023 c. SUES-200: A Multi-height Multi-scene Cross-view Image Benchmark Across Drone and Satellite. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1--1. https://doi.org/10.1109/TCSVT.2023.3249204
[29]
Sijie Zhu, Mubarak Shah, and Chen Chen. 2022. Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1162--1171.
[30]
Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. 2023 a. Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization. arXiv preprint arXiv:2302.01572 (2023). io

Cited By

View all
  • (2024)Multi-weather Cross-view Geo-localization Using Denoising Diffusion ModelsProceedings of the 2nd Workshop on UAVs in Multimedia: Capturing the World from a New Perspective10.1145/3689095.3689103(35-39)Online publication date: 28-Oct-2024
  • (2024)MGAW: An Effective Method for Geo-localization in Adverse WeatherProceedings of the 2nd Workshop on UAVs in Multimedia: Capturing the World from a New Perspective10.1145/3689095.3689101(19-23)Online publication date: 28-Oct-2024
  • (2024)Optimizing Geo-Localization with k-Means Re-Ranking in Challenging Weather ConditionsProceedings of the 2nd Workshop on UAVs in Multimedia: Capturing the World from a New Perspective10.1145/3689095.3689099(9-13)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
UAVM '23: Proceedings of the 2023 Workshop on UAVs in Multimedia: Capturing the World from a New Perspective
November 2023
86 pages
ISBN:9798400702860
DOI:10.1145/3607834
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-view matching
  2. geo-localization
  3. knowledge distillation
  4. modern backbone
  5. transformer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)8
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-weather Cross-view Geo-localization Using Denoising Diffusion ModelsProceedings of the 2nd Workshop on UAVs in Multimedia: Capturing the World from a New Perspective10.1145/3689095.3689103(35-39)Online publication date: 28-Oct-2024
  • (2024)MGAW: An Effective Method for Geo-localization in Adverse WeatherProceedings of the 2nd Workshop on UAVs in Multimedia: Capturing the World from a New Perspective10.1145/3689095.3689101(19-23)Online publication date: 28-Oct-2024
  • (2024)Optimizing Geo-Localization with k-Means Re-Ranking in Challenging Weather ConditionsProceedings of the 2nd Workshop on UAVs in Multimedia: Capturing the World from a New Perspective10.1145/3689095.3689099(9-13)Online publication date: 28-Oct-2024
  • (2024)Rethinking Pooling for Multi-Granularity Features in Aerial-View Geo-LocalizationIEEE Signal Processing Letters10.1109/LSP.2024.348433031(3005-3009)Online publication date: 2024
  • (2024)EarthLoc: Astronaut Photography Localization by Indexing Earth from Space2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01212(12754-12764)Online publication date: 16-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media