Learning discriminative representations via variational self-distillation for cross-view geo-localization

https://doi.org/10.1016/j.compeleceng.2022.108335Get rights and content

Highlights

  • Variational self-distillation is used for cross-view geo-localization.

  • Square-ring partition strategy is adopted to enrich discriminative clues.

  • Information bottleneck module is designed to discard redundancy information.

Abstract

Cross-view geo-localization is to localize the same geographic target in images from different perspectives, e.g., satellite-view and drone-view. The primary challenge faced by existing methods is the large visual appearance changes across views. Most previous work utilizes the deep neural network to obtain the discriminative representations and directly uses them to accomplish the geo-localization task. However, these approaches ignore that the redundancy retained in the extracted features negatively impacts the result. In this paper, we argue that the information bottleneck (IB) can retain the most relevant information while removing as much redundancy as possible. The variational self-distillation (VSD) strategy provides an accurate and analytical solution to estimate the mutual information. To this end, we propose to learn discriminative representations via variational self-distillation (dubbed LDRVSD). Extensive experiments are conducted on two widely-used datasets University-1652 and CVACT, showing the remarkable performance improvements obtained by our LDRVSD method compared with several state-of-the-art approaches.

Introduction

Cross-view geo-localization task aims to match two relevant images based on different viewpoints, which can be treated as a retrieval task [1], [2], [3], [4], [5]. It enjoys widespread usage [6], [7], [8]. Fig. 1 shows two scenarios of this task. To be more specific, given an image taken from one view (query), another-view image with the same label can be automatically selected with the model, whereas those tagged with a different geographic target should not be matched.

Due to the continuous progress of deep learning, cross-view geo-localization enjoys increasing development during recent years. In general, the dominant approach is to learn discriminative and invariant representations among changes of viewpoints, similar to some cross-modal retrieval tasks [9], [10], [11] that use metric learning to constrain learned features, pulling features with the same label together and pulling apart those with different classes. Also, the attention mechanism and orientation strategy are considered to preserve useful information effectively [2], [7], [12]. Since the graphic building is captured by several viewpoints, the network should learn which region belongs to the most relevant one according to the label by assigning different attention weights or using the polar transformation strategy. Inspired by the human visual system where not only the target but also the surroundings are captured, in work [8], they apply a partition strategy to obtain contextual information in an image, which can enable the larger receptive fields with part alignment. Such technology has proven useful, but neglects that more redundancy is retained with the usage of the partition strategy. It is noted that extracting the global feature of images generates excess information that is irrelevant to the given task, which brings about an enormous negative impact on the generalization and robustness of the network. Finally, we find that most of the previous cross-view geo-localization work did not eliminate redundant information from the perspective of information bottleneck to improve the retrieval performance, and ignored the impact of redundant information.

To address this problem, we introduce the information bottleneck principle to the cross-view geo-localization task. The goal of this theory is to achieve sufficient representations with minimized superfluous information. It can obtain the best lower-dimensional and view-invariant representations from the input observation. But some problems exist in mutual information optimization. For example, the result depends heavily on the mutual information estimator, and it is difficult to achieve the trade-off between high compression and accurate prediction [13]. To solve them, we apply the variational self-distillation strategy [14] which is designed to fit the mutual information but without explicitly estimating it, it uses KL-divergence to measure the discrepancy between two predicted distributions. Based on these works, we propose a novel IB-based framework termed LDRVSD which is shown in Fig. 2. We believe that integrating the information bottleneck can improve the network robustness, resulting in increased performance. Also, we demonstrate the effectiveness of LDRVSD on a newly-proposed benchmark University-1652 [6] and achieve considerable performance gain over the baseline method. We also conduct experiments on CVACT for ground-to-satellite image matching and obtain remarkable improvements.

Our main contributions of this work can be included as follows: (1) We propose a novel and effective network named learn discriminative representations via variational self-distillation (LDRVSD). Different from existing works, LDRVSD adopts the information bottleneck theory which is to obtain a lower-dimensional representation of the input observation and maintain its consistency. The dimension of the final feature output is 384, which is smaller than the previous 512 dimension. We improve the retrieval performance and also rise the retrieval speed. (2) We investigate and demonstrate the effectiveness of our proposed model on two frequently-used datasets, i.e. University-1652 for drone-to-satellite image matching and CVACT for ground-to-satellite image matching. Both experiments achieve excellent performance.

The organization of this paper is as follows. In the second section, the development status of the cross-view geo-localization task and some new methods of information bottleneck theory are discussed. We analyze the problems existing in the previous cross-view geo-localization task and use the advantages of information bottleneck theory to solve the problems. In the third section, we describe the proposed LDRVSD method in detail. In the fourth section, we compared our results with other advanced methods and added a series of ablation experiments to validate our ideas. The fifth section is the conclusion of the paper.

Section snippets

Cross-view Geo-localization

With an increasing number of potential applications, cross-view geo-localization task has received increased attention. The main challenge of the task is to learn robust and discriminative image representations due to the large appearance gap between the different views. Some researchers treat it as an image retrieval task [1], [2] and try to view-invariant representations to bridge the gap between images from different platforms. The emergence of CNN has brought great progress to the feature

Proposed method

As visualized in Fig. 2, the framework of our approach is composed of two main components: a feature extraction module, followed by an information bottleneck module. These two modules are described in detail in this section.

Datasets

University-1652 is a newly proposed multi-view data set, adding a new drone view, which has fewer obstacles compared with the ground view. In the dataset, there are 1652 buildings from 72 universities around the world, among which the training set includes 701 buildings from 33 universities, and the test set includes 951 buildings at 39 other universities. The task on it is drone-to-satellite image matching. CVACT is a large cross-view dataset containing 35,532 pairs of ground and satellite

Conclusion

Existing cross-view geo-localization methods aim to find invariant features in different viewpoints, but ignore superfluous information contained in the learned representations. Inspired by the information bottleneck theory, we propose a novel and efficient framework named LDRVSD, which can discard the information that is not useful for a given task, optimized by the variational self-distillation strategy. Also, the dimension of output representation of each part is reduced to 384, which raises

CRediT authorship contribution statement

Qian Hu: Methodology, Software, Data curation, Writing – original draft. Wansi Li: Conceptualization, Methodology, Writing – original draft. Xing Xu: Resource acquisition, Supervision, Writing – review & editing. Ning Liu: Validation, Formal analysis, Supervision, Writing – review & editing. Lei Wang: Resource acquisition, Supervision, Writing – review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2022.108335.

Qian Hu is currently pursuing a master’s degree with the Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. His research interests include deep learning and computer vision.

References (28)

  • LinT. et al.

    Learning deep representations for ground-to-aerial geolocalization

  • Liu L, Li H. Lending Orientation to Neural Networks for Cross-View Geo-Localization. In: IEEE conference on computer...
  • XuX. et al.

    Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval

    IEEE Trans Cybern

    (2020)
  • WangB. et al.

    Adversarial cross-modal retrieval

  • XuX. et al.

    Learning discriminative binary codes for large-scale cross-modal retrieval

    IEEE Trans Image Process

    (2017)
  • ZhengZ. et al.

    University-1652: A multi-view multi-source benchmark for drone-based geo-localization

  • Shi Y, Liu L, Yu X, Li H. Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization. In: Advances...
  • WangT. et al.

    Each part matters: Local patterns facilitate cross-view geo-localization

    IEEE Trans Circuits Syst Video Technol

    (2022)
  • XuX. et al.

    Learning cross-modal common representations by private-shared subspaces separation

    IEEE Trans Cybern

    (2020)
  • XuX. et al.

    Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited

    IEEE Trans Pattern Anal Mach Intell

    (2020)
  • XuX. et al.

    Cross-modal attention with semantic consistence for image-text matching

    IEEE Trans Neural Netw Learn Syst

    (2020)
  • FuY. et al.

    STA: Spatial-temporal attention for large-scale video-based person re-identification

  • AlemiA.A. et al.

    Deep variational information bottleneck

  • Tian X, Zhang Z, Lin S, Qu Y, Xie Y, Ma L. Farewell to Mutual Information: Variational Distillation for Cross-Modal...
  • Cited by (0)

    Qian Hu is currently pursuing a master’s degree with the Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. His research interests include deep learning and computer vision.

    Wansi Li is currently pursuing a master’s degree with the Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. Her research interests include deep learning and cross-modal retrieval.

    Xing Xu is currently with the School of Computer Science and Engineering, University of Electronic of Science and Technology of China, China. His current research interests mainly focus on multimedia information retrieval, pattern recognition and computer vision.

    Ning Liu is currently a lecturer in School of Information Science and Technology, Beijing Forestry University, China. His main research interests include computer vision, natural language processing and big data analysis.

    Lei Wang is currently pursuing a doctor’s degree with the School of Computer Science, Singapore Management University, Singapore. His research interests include computer vision and Natural Language Processing.

    This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Hamed Vahdat-Nejad.

    View full text