Learning discriminative representations via variational self-distillation for cross-view geo-localization☆
Introduction
Cross-view geo-localization task aims to match two relevant images based on different viewpoints, which can be treated as a retrieval task [1], [2], [3], [4], [5]. It enjoys widespread usage [6], [7], [8]. Fig. 1 shows two scenarios of this task. To be more specific, given an image taken from one view (query), another-view image with the same label can be automatically selected with the model, whereas those tagged with a different geographic target should not be matched.
Due to the continuous progress of deep learning, cross-view geo-localization enjoys increasing development during recent years. In general, the dominant approach is to learn discriminative and invariant representations among changes of viewpoints, similar to some cross-modal retrieval tasks [9], [10], [11] that use metric learning to constrain learned features, pulling features with the same label together and pulling apart those with different classes. Also, the attention mechanism and orientation strategy are considered to preserve useful information effectively [2], [7], [12]. Since the graphic building is captured by several viewpoints, the network should learn which region belongs to the most relevant one according to the label by assigning different attention weights or using the polar transformation strategy. Inspired by the human visual system where not only the target but also the surroundings are captured, in work [8], they apply a partition strategy to obtain contextual information in an image, which can enable the larger receptive fields with part alignment. Such technology has proven useful, but neglects that more redundancy is retained with the usage of the partition strategy. It is noted that extracting the global feature of images generates excess information that is irrelevant to the given task, which brings about an enormous negative impact on the generalization and robustness of the network. Finally, we find that most of the previous cross-view geo-localization work did not eliminate redundant information from the perspective of information bottleneck to improve the retrieval performance, and ignored the impact of redundant information.
To address this problem, we introduce the information bottleneck principle to the cross-view geo-localization task. The goal of this theory is to achieve sufficient representations with minimized superfluous information. It can obtain the best lower-dimensional and view-invariant representations from the input observation. But some problems exist in mutual information optimization. For example, the result depends heavily on the mutual information estimator, and it is difficult to achieve the trade-off between high compression and accurate prediction [13]. To solve them, we apply the variational self-distillation strategy [14] which is designed to fit the mutual information but without explicitly estimating it, it uses to measure the discrepancy between two predicted distributions. Based on these works, we propose a novel IB-based framework termed LDRVSD which is shown in Fig. 2. We believe that integrating the information bottleneck can improve the network robustness, resulting in increased performance. Also, we demonstrate the effectiveness of LDRVSD on a newly-proposed benchmark University-1652 [6] and achieve considerable performance gain over the baseline method. We also conduct experiments on CVACT for ground-to-satellite image matching and obtain remarkable improvements.
Our main contributions of this work can be included as follows: (1) We propose a novel and effective network named learn discriminative representations via variational self-distillation (LDRVSD). Different from existing works, LDRVSD adopts the information bottleneck theory which is to obtain a lower-dimensional representation of the input observation and maintain its consistency. The dimension of the final feature output is 384, which is smaller than the previous 512 dimension. We improve the retrieval performance and also rise the retrieval speed. (2) We investigate and demonstrate the effectiveness of our proposed model on two frequently-used datasets, i.e. University-1652 for drone-to-satellite image matching and CVACT for ground-to-satellite image matching. Both experiments achieve excellent performance.
The organization of this paper is as follows. In the second section, the development status of the cross-view geo-localization task and some new methods of information bottleneck theory are discussed. We analyze the problems existing in the previous cross-view geo-localization task and use the advantages of information bottleneck theory to solve the problems. In the third section, we describe the proposed LDRVSD method in detail. In the fourth section, we compared our results with other advanced methods and added a series of ablation experiments to validate our ideas. The fifth section is the conclusion of the paper.
Section snippets
Cross-view Geo-localization
With an increasing number of potential applications, cross-view geo-localization task has received increased attention. The main challenge of the task is to learn robust and discriminative image representations due to the large appearance gap between the different views. Some researchers treat it as an image retrieval task [1], [2] and try to view-invariant representations to bridge the gap between images from different platforms. The emergence of CNN has brought great progress to the feature
Proposed method
As visualized in Fig. 2, the framework of our approach is composed of two main components: a feature extraction module, followed by an information bottleneck module. These two modules are described in detail in this section.
Datasets
University-1652 is a newly proposed multi-view data set, adding a new drone view, which has fewer obstacles compared with the ground view. In the dataset, there are 1652 buildings from 72 universities around the world, among which the training set includes 701 buildings from 33 universities, and the test set includes 951 buildings at 39 other universities. The task on it is drone-to-satellite image matching. CVACT is a large cross-view dataset containing 35,532 pairs of ground and satellite
Conclusion
Existing cross-view geo-localization methods aim to find invariant features in different viewpoints, but ignore superfluous information contained in the learned representations. Inspired by the information bottleneck theory, we propose a novel and efficient framework named LDRVSD, which can discard the information that is not useful for a given task, optimized by the variational self-distillation strategy. Also, the dimension of output representation of each part is reduced to 384, which raises
CRediT authorship contribution statement
Qian Hu: Methodology, Software, Data curation, Writing – original draft. Wansi Li: Conceptualization, Methodology, Writing – original draft. Xing Xu: Resource acquisition, Supervision, Writing – review & editing. Ning Liu: Validation, Formal analysis, Supervision, Writing – review & editing. Lei Wang: Resource acquisition, Supervision, Writing – review & editing.
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2022.108335.
Qian Hu is currently pursuing a master’s degree with the Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. His research interests include deep learning and computer vision.
References (28)
- et al.
Learning deep representations for ground-to-aerial geolocalization
- Liu L, Li H. Lending Orientation to Neural Networks for Cross-View Geo-Localization. In: IEEE conference on computer...
- et al.
Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval
IEEE Trans Cybern
(2020) - et al.
Adversarial cross-modal retrieval
- et al.
Learning discriminative binary codes for large-scale cross-modal retrieval
IEEE Trans Image Process
(2017) - et al.
University-1652: A multi-view multi-source benchmark for drone-based geo-localization
- Shi Y, Liu L, Yu X, Li H. Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization. In: Advances...
- et al.
Each part matters: Local patterns facilitate cross-view geo-localization
IEEE Trans Circuits Syst Video Technol
(2022) - et al.
Learning cross-modal common representations by private-shared subspaces separation
IEEE Trans Cybern
(2020) - et al.
Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited
IEEE Trans Pattern Anal Mach Intell
(2020)
Cross-modal attention with semantic consistence for image-text matching
IEEE Trans Neural Netw Learn Syst
STA: Spatial-temporal attention for large-scale video-based person re-identification
Deep variational information bottleneck
Cited by (0)
Qian Hu is currently pursuing a master’s degree with the Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. His research interests include deep learning and computer vision.
Wansi Li is currently pursuing a master’s degree with the Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. Her research interests include deep learning and cross-modal retrieval.
Xing Xu is currently with the School of Computer Science and Engineering, University of Electronic of Science and Technology of China, China. His current research interests mainly focus on multimedia information retrieval, pattern recognition and computer vision.
Ning Liu is currently a lecturer in School of Information Science and Technology, Beijing Forestry University, China. His main research interests include computer vision, natural language processing and big data analysis.
Lei Wang is currently pursuing a doctor’s degree with the School of Computer Science, Singapore Management University, Singapore. His research interests include computer vision and Natural Language Processing.
- ☆
This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Hamed Vahdat-Nejad.