Learning Spatial-Aware Cross-View Embeddings for Ground-to-Aerial Geolocalization

Cao, Rui; Zhu, Jiasong; Li, Qing; Zhang, Qian; Li, Qingquan; Liu, Bozhi; Qiu, Guoping

doi:10.1007/978-3-030-34120-6_5

Rui Cao¹⁴,
Jiasong Zhu¹⁶,
Qing Li¹⁷,
Qian Zhang¹⁴,
Qingquan Li¹⁶,
Bozhi Liu¹⁵ &
…
Guoping Qiu^15,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

International Conference on Image and Graphics

2102 Accesses
2 Citations

Abstract

Image-based geolocalization is an important alternative to GPS-based localization in GPS-denied situations. Among them, ground-to-aerial geolocalization is particularly promising but also difficult due to drastic viewpoint and appearance differences between ground and aerial images. In this paper, we propose a novel spatial-aware Siamese-like network to address the issue by exploiting the spatial transformer layer to effectively alleviate the large view variation and learn location discriminative embeddings from the cross-view images. Furthermore, we propose to combine the triplet ranking loss with a simple and effective location identity loss to further enhance the performances. We test our method on a publicly available dataset and the results show that the proposed method outperforms state-of-the-art by a large margin.

You have full access to this open access chapter, Download conference paper PDF

Feature Alignment Method for Cross-View Image Geo-localization

Content-Aware Hierarchical Representation Selection for Cross-View Geo-Localization

Geographically Local Representation Learning with a Spatial Prior for Visual Localization

Keywords

1 Introduction

Localization is an essential component for many location-based services (LBS). Traditional outdoor localization methods rely on global positioning system (GPS). However, they do not function properly in urban areas with high-rise buildings. Image-based localization methods are regarded as promising alternatives in GPS-denied situations. They are direct and compatible with human understanding. Besides, they can also be used for place recognition when we simply want to find out where a photo is taken.

The image-based geolocalization is normally treated as an image retrieval problem. The predicted location of a query image is set as the geographical coordinate of the most similar image from a geotagged image database. The image-based geolocalization methods can be categorized into ground-to-ground geolocalization [1, 5, 8, 18] and ground-to-aerial geolocalization [2, 3, 7, 10, 12, 13, 15,16,17]. For ground-to-ground geolocalization, the reference image database is composed of ground-level images. This method requires a large number of accurately geotagged ground images to cover the earth surface which are difficult to acquire. While for ground-to-aerial geolocalization, the image database is made up of overhead images, as illustrated in Fig. 1. This relieves the difficulty of building a large geotagged image database because aerial images can cover the whole areas of the earth surface and are usually ready with precise geographical coordinates.

However, ground-to-aerial geolocalization is extremely difficult since the ground-level images (horizontal view) are taken in a very different perspective compared with overhead images (nadir view). The drastic viewpoint variation results in small overlap areas between the two types of images, and also leads to problems like dramatic appearance differences, occlusion, and illumination variation.

The existing cross-view image geolocalization works tackle the issues through matching building facades [2], line segments [12], and handcrafted features [3]. Some works exploit extra information such as land cover maps [13]. With the development of deep learning, the powerful deep features are also utilized [15, 17].

Recently, deep metric learning has also been used to address the problem and shown to be an effective paradigm for cross-view image geolocalization [7, 10, 16]. It exploits the discriminative power of deep neural networks to embed cross-view images into a joint embedding metric space in which simple metrics like the Euclidean distance can be directly used to measure the semantic similarity between them.

In this paper, following the deep metric learning paradigm, we propose a novel spatial-aware Siamese-like network to address the ground-to-aerial geolocalization problem. Compared with previous methods, we exploit the spatial transformer layer (STL) [9] to tackle the large view variation problem, which can help to learn location discriminative embeddings for the challenging task. Besides, we design a loss that combines the triplet ranking loss with a simple and effective location identity loss to train the proposed network, which further enhances the geolocalization performances. We have conducted extensive experiments on a publicly available dataset of cross-view image pairs to test our method, and the results show that the proposed method has significantly outperformed the state-of-the-art.

The remainder of the paper is organized as follows. We firstly formulate the ground-to-aerial geolocalization problem in Sect. 2. In Sect. 3, we describe the proposed spatial-aware Siamese-like network as well as the loss function we use to train the network. In Sect. 4, we elaborate the experiments and analyze the results. Finally, we conclude in Sect. 5.

2 Problem Statement

The goal of ground-to-aerial geolocalization is to find the location $\ell _g^i$ where a ground query image $I_g^i$ is taken, given a geotagged overhead image database $\mathcal {I}_r=\{\left\langle I_r^k, \ell _r^k \right\rangle \}~(k=1,2,...,N)$ as reference:

$$\begin{aligned} \ell _g^i = h(I_g^i, \mathcal {I}_r). \end{aligned}$$

(1)

As illustrated in Fig. 1, the task can be formulated as an image retrieval problem, i.e. finding an aerial image $I_r^*$ from the reference image database $\mathcal {I}_r$, which is the most similar to the query image $I_g^i$. Then the center location $\ell _r^*$ of $I_r^*$ would be regarded as the estimated location $\hat{\ell _g^i}$ of $I_g^i$:

$$\begin{aligned} \hat{\ell _g^i} = \ell _r^*, ~\mathrm {where}~ I_r^* = \mathop {\hbox {arg min}}\limits _k d(f_g(I_g^i), f_r(I_r^k)), \end{aligned}$$

(2)

where $f_g$ and $f_r$ are functions that map the ground and overhead images into a comparable embedding space $\mathbb {R}^F$ respectively, and $d(\cdot ,\cdot )$ is a metric distance measuring the dissimilarity of two embedding vectors in the space. Therefore, the key to the problem is matching the ground image to the most similar aerial image.

3 Methodology

In this section, we describe our proposed network that can effectively learn spatial-aware cross-view embedding features for ground-to-aerial image matching, and we also elaborate the loss functions we use to train the network.

3.1 Spatial-Aware Siamese-Like Network for Cross-View Image Matching

Considering that the ground and aerial images are captured at totally different views and their visual contents are of large difference, we propose a network to learn spatial-aware cross-view features that can match them effectively. The architecture of the proposed network is shown in Fig. 2. It is a Siamese-like network, consisting of two sub-networks of the same structure but different parameters, whereas traditional Siamese network has two identical sub-networks of the same structure and weights. The goal of the proposed network is to learn two embedding functions $f(x;\theta _g), f(x;\theta _r): \mathbb {R}^I \rightarrow \mathbb {R}^F$ that map the input ground and overhead images to a joint feature space so that semantically similar ground-aerial image pairs in $\mathbb {R}^I$ are metrically close in $\mathbb {R}^F$. The two functions parameterized by $\theta _g$ and $\theta _r$ represent the two sub-networks respectively.

Each sub-network is fully convolutional, employing the convolutional parts of the AlexNet [11] or VGG16 [14] as the basic networks for feature extraction. The spatial transformer layer (STL) [9] is appended to the last layer of each sub-network to enable them the capacity of learning spatial transformations automatically which allow the network learns the best representation for cross-view matching. The output feature maps are then vectorized by global average pooling (GAP) to obtain fixed-length feature vectors, which are $L_2$-normalized to compute the final loss. In the testing phase, the $L_2$-normalized embedding vector can be exploited as the representative feature for cross-view image matching.

Spatial Transformer Layer (STL). STL [9] can warp the input feature map via specified transformation. In this paper, affine transformation is exploited, which can alleviate the large view variation between cross-view images through learning translation, rotation, scale, and skew transformations, as well as cropping. STL is a learnable differentiable module that learns a spatial transformation during training and can be applied to an input feature map in a single forward manner. The architecture of it is shown in Fig. 3. It is composed of three components, i.e. a localization net, a grid generator, and a sampler. The localization net $f_{loc}$ learns the parameters $\theta $ of the spatial transformation $\mathcal {T}_\theta $, $\theta = f_{loc}(U)$. The grid generator is then used to generate sampling points $\mathcal {T}_\theta (G)$ from the input feature map U, given the regular grid $G=\{G_i\}$ of the output feature map V:

$$\begin{aligned} \begin{pmatrix} x_i^s \\ y_i^s \end{pmatrix} = \mathcal {T}_\theta (G_i) = M_\theta \begin{pmatrix} x_i^t \\ y_i^t \\ 1 \end{pmatrix} \end{aligned}$$

(3)

where $(x_i^s, y_i^s)$ is the source point in the input feature map U, while $(x_i^t, y_i^t)$ is the target point in the output feature map V. $M_\theta $ is a $2 \times 3$ affine transformation matrix with 6 parameters. Two fully connected layers with 32 neurons each are used for the localization net to regress the 6 parameters. The sampler generates the final output feature map V by sampling from the input feature map U according to the generated grid $\mathcal {T}_\theta (G)$ from the grid generator.

3.2 Loss Function

The overall loss function we use to train our network consists of two components, i.e. the triplet ranking loss $\mathcal {L}_{tri}$ and the location identity loss $\mathcal {L}_{id}$:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{tri} + \lambda \mathcal {L}_{id}, \end{aligned}$$

(4)

where $\lambda $ controls the relative importance of the two losses.

Triplet Ranking Loss. The triplet loss characterizes a relative similarity ranking order between image triplets. It has been demonstrated to be effective for cross-view image matching [7, 16]. The goal of the loss is to make an image closer to its paired cross-view image than any other cross-view images.

Let the metric that measures the similarity of images in the embedding space $\mathbb {R}^F$ be squared Euclidean distance $d(x, y) = \left\| x - y \right\| _2^2$. Then, for a triplet of images $I_i^a$ (anchor), $I_j^p$ (positive), and $I_j^n$ (negative), $l_a, l_p, l_n$ are their corresponding geotags, $\left\langle x_i^a, x_j^p, x_j^n \right\rangle $ = $\left\langle f(I_i^a; \theta _i), f(I_j^p; \theta _j), f(I_j^n; \theta _j)\right\rangle $ are corresponding embeddings, thereby the triplet ranking loss can be formulated as follows:

$$\begin{aligned} \mathcal {L}_{tri} = \sum _{i, j} \sum _{\begin{array}{c} a,p,n \\ l_a = l_p \ne l_n \end{array}}{[d(x_i^a,x_j^p) - d(x_i^a,x_j^n) + \alpha ]_{+}}, \end{aligned}$$

(5)

where $[x]_+$ represents max(x, 0), $\alpha $ denotes the margin. $d(x_i^a,x_j^p)$ and $d(x_i^a, x_j^n)$ are the distances between the anchor-positive and anchor-negative pairs respectively. i and j are indicators for different types of images, $i \ne j$ and $i,j \in \{g,r\}$, with g for ground image and r for overhead reference image.

For cross-view geolocalization, there is only one paired cross-view image as the positive sample for each anchor image. In terms of the anchor image type, image triplets can be categorized into ground-to-aerial type $\left\langle g,r,r \right\rangle $ and aerial-to-ground type $\left\langle r,g,g \right\rangle $. We exhaust all the valid triplets within a mini-batch to compute the loss during training following previous works [7, 16]. There would be $2m(m-1)$ valid triplets within each mini-batch of m cross-view image pairs, with $m(m-1)$ for $\left\langle g,r,r \right\rangle $ and $\left\langle r,g,g \right\rangle $ triplets each.

Location Identity Loss. For every cross-view image pair, the ground and overhead images represent the same location. However, they present very different visual contents since they are captured in totally different views. Inspired by the idea of deep feature consistency in facial attribute manipulation [6] and considering the uniqueness of the scene of every spatial location, we introduce a new location identity loss to enforce the feature consistency of cross-view image pairs, which can help preserve the unique identity of each place and learn location discriminative features. It tries to minimize the distance between the embedding features of two paired cross-view images captured at the same location. The formulation of the location identity loss is shown as follows:

$$\begin{aligned} \mathcal {L}_{id} = \sum _k \left\| f(I_g^k; \theta _g) - f(I_r^k; \theta _r) \right\| _2^2, \end{aligned}$$

(6)

where $I_g^k$ and $I_r^k$ are the k-th paired ground and overhead images respectively. Intuitively, the learned embedding functions $f(x;\theta _g)$ and $f(x;\theta _r)$ should make the cross-view image pairs as close to each other as possible in the embedding space $\mathbb {R}^F$.

4 Experiments

4.1 Dataset

The CVUSA dataset [19] includes image pairs of panoramic street view images and overhead aerial imagery collected across the US. There are 35,532 image pairs for training, and 8,884 pairs for testing. The size of the ground panoramas is 224 $\times $ 1232, while the aerial image size is 750 $\times $ 750.

4.2 Experiment Setup

The street view images are resized to 112 $\times $ 616 in both the training and testing phase. While the aerial images are firstly resized to 300 $\times $ 300, then randomly cropped to 256 $\times $ 256 and rotated by 90n $(n=0,1,2,3)$ degrees in training phase, and they are directly resized to 256 $\times $ 256 in testing phase.

The networks are implemented based on the PyTorch framework. Adam optimizer is used to train the networks, with a learning rate of 0.00001 and a batch size of 20. The parameters of convolutional layers of the networks are initialized by corresponding base network weights pretrained on ImageNet [4], while the weights of STLs are initialized by identity transformation. The maximum training iteration is set to 20 epochs. The weight $\lambda $ is empirically set to 0.005. The margin $\alpha $ of the triplet loss is empirically set to 0.2.

Evaluation Metric. We adopt the recall accuracy at top 1% as our evaluation metric, the same as previous works [7, 16, 17]. A query is regarded as correct if the corresponding aerial image of the given ground query image is within the top 1% retrieval results.

4.3 Results and Analysis

To evaluate the effectiveness of the proposed network, we compare it with baselines and previous methods. There are two major types of baselines, one is Siamese network which has two identical sub-networks with shared weights, the other is Siamese-like network with two sub-networks of the same structure but different weights. Both the Siamese and Siamese-like networks have the same structure as the proposed network, except the spatial transformer layers. AlexNet and VGG16 are used as backbone networks. In addition, the proposed network and baselines are all trained with the proposed loss which combines the triplet ranking loss and location identity loss.

Table 1. Top 1% recall accuracy of the proposed network, baselines, and previous state-of-the-art methods.

Full size table

The top 1% recall results of the proposed network and the baselines, and the previous state-of-the-art results (reported in [7]) are presented in Table 1. As it can be seen, the VGG16-based Siamese-like network outperforms the shared-weight Siamese network (VGG16-based) by 11.1%, and the proposed network further increases the accuracy by 1.2%, reaching 95.8%, with 4.4% higher than previous state-of-the-art result of 91.4% from CVM-Net-I [7] which also use VGG16 as backbone network. The results demonstrate the efficacy of our proposed method in improving the cross-view image matching accuracy.

Ablation Study. To further validate the effectiveness of our proposed network and the location identity loss, we conduct ablation study by comparing the top 1% recall results of different networks training with different losses. There are three network architectures, i.e. Siamese network, Siamese-like network, and our proposed spatial-aware Siamese-like network. Compared to the proposed network, the baseline Siamese-like networks remove the spatial transformer layers, and the Siamese networks further share weights for the two sub-networks. They are trained under the triplet ranking loss $\mathcal {L}_{tri}$ only, or the proposed loss which combines the triplet ranking loss $\mathcal {L}_{tri}$ and location identity loss $\mathcal {L}_{id}$. We also compare the results of using different backbone networks, i.e. AlexNet and VGG16.

Table 2. Top 1% recall accuracy of different network architectures training with different losses. ($\lambda =0.005$)

Full size table

The top 1% recall accuracy of the three different networks training under different losses are shown in Table 2. It can be seen that the deeper the network, the better the results, since the VGG16-based networks perform significantly better than the AlexNet-based counterparts. Siamese-like networks outperform Siamese networks dramatically, which owes to the removal of shared-weight constraint and thus increasing the model capacity to effectively learn view-specific features.

The proposed networks with STLs further improve the results noticeably under both base networks, from 65.0% to 71.1% (6.1% increase) for AlexNet and from 92.9% to 94.2% (1.3% increase) for VGG16, showing the efficacy of STLs in alleviating large view variation by explicitly learning spatial transformations. It is also interesting that the improvement is more significant when the backbone network is not deep enough (with 6.1% increase for AlexNet and 1.3% for VGG16). Furthermore, the proposed location identity loss can improve the performances on all the networks (as shown by the increase $\varDelta $ in Table 2), which demonstrates its effectiveness in regularizing the networks learning location-discriminative deep features for cross-view image matching.

Qualitative Results. Some retrieval examples of the proposed network with the best performance (95.8%) are presented in Fig. 4. There are four typical scenes, i.e. medium residential area, sparse residential area, open cropland, and forest road. For each case, the ground query image is on the leftmost column, and the top five retrieval results are listed on the right, with the ground truth enveloped by orange box.

We can see that, for all the four cases, the top-5 retrieved aerial images on the right all present very similar patterns and appearances, and are clearly drawn from the same scene categories respectively. It is even hard for human eyes to find the correct matches for the ground images. This further demonstrates the effectiveness of our proposed network and the corresponding training loss in learning visually discriminative features for cross-view image matching.

However, it should also be noted that it is almost impossible to distinguish the correct paired aerial image for the forest road case, since the retrieved results are all well matched with the ground query image visually, and there seems to be no clues to find the real match. This indicates the limitation of simple image retrieval-based ground-to-aerial geolocalization approach as it is incapable of distinguishing visually similar image scenes. The method can be used as a coarse localization approach. To achieve accurate localization in real world applications, extra supplementary sources of data are needed or query image sequence can be exploited to reduce the ambiguity of similar scenes.

5 Concluding Remarks

Image-based ground-to-aerial geolocalization is a promising approach for localization in GPS-denied situations. However, it is very challenging due to drastic viewpoint difference between the cross-view images. In this paper, we propose a novel spatial-aware Siamese-like network to address the problem, which exploits the spatial transformer layer to explicitly learn spatial transformations between the ground and overhead images to tackle the large view variation. Moreover, we propose to combine the triplet ranking loss and the simple and effective location identity loss to train the proposed network to further enhance the performances. We evaluate our method on a publicly available dataset of cross-view image pairs, and the results demonstrate that the proposed method has achieved state-of-the-art performances. In the future, we plan to utilize extra supplementary data sources and employ image sequences, in conjunction with cross-view image matching, to meet accurate geolocalization need in real world applications.

References

Arandjelovic, R., Gronát, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 5297–5307 (2016)
Google Scholar
Bansal, M., Sawhney, H.S., Cheng, H., Daniilidis, K.: Geo-localization of street views with aerial image databases. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1125–1128. ACM, New York (2011)
Google Scholar
Chu, H., Mei, H., Bansal, M., Walter, M.R.: Accurate Vision-based Vehicle Localization using Satellite Imagery. arXiv:1510.09171 [cs] (2015)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pp. 248–255 (2009)
Google Scholar
Hays, J., Efros, A.A.: IM2GPS: estimating geographic information from a single image. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24–26 June 2008, Anchorage, Alaska, USA (2008)
Google Scholar
Hou, X., Shen, L., Sun, K., Qiu, G.: Deep feature consistent variational autoencoder. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1133–1141 (2017)
Google Scholar
Hu, S., Feng, M., Nguyen, R.M.H., Lee, G.H.: CVM-Net: cross-view matching network for image-based ground-to-aerial geo-localization. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7258–7267 (2018)
Google Scholar
Iscen, A., Tolias, G., Avrithis, Y.S., Furon, T., Chum, O.: Panorama to Panorama matching for location recognition. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, 6–9 June 2017, Bucharest, Romania, pp. 392–396 (2017)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 7–12 December 2015, Montreal, Quebec, Canada, pp. 2017–2025 (2015)
Google Scholar
Kim, D.K., Walter, M.R.: Satellite image-based localization via learned embeddings. In: 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, 29 May–3 June 2017, pp. 2073–2080 (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, 3–6 December 2012, Lake Tahoe, Nevada, US, pp. 1106–1114 (2012)
Google Scholar
Li, A., Morariu, V.I., Davis, L.S.: Planar structure matching under projective uncertainty for geolocation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 265–280. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_18
Chapter Google Scholar
Lin, T.Y., Belongie, S.J., Hays, J.: Cross-view image geolocalization. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013, pp. 891–898 (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs] (2014)
Tian, Y., Chen, C., Shah, M.: Cross-view image matching for geo-localization in urban environments. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 21–26 July 2017, Honolulu, HI, USA, pp. 1998–2006 (2017)
Google Scholar
Vo, N.N., Hays, J.: Localizing and orienting street views using overhead imagery. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 494–509. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_30
Chapter Google Scholar
Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, 7–13 December 2015, Santiago, Chile, pp. 3961–3969 (2015)
Google Scholar
Zamir, A.R., Shah, M.: Accurate image localization based on Google maps street view. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 255–268. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_19
Chapter Google Scholar
Zhai, M., Bessinger, Z., Workman, S., Jacobs, N.: Predicting ground-level scene layout from aerial imagery. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 21–26 July 2017, Honolulu, HI, USA, pp. 4132–4140 (2017)
Google Scholar

Download references

Acknowledgments

The authors acknowledge the financial support from the International Doctoral Innovation Centre, Ningbo Education Bureau, Ningbo Science and Technology Bureau, and the University of Nottingham. This work was supported in part by the UK Engineering and Physical Sciences Research Council [grant number EP/L015463/1], the National Natural Science Foundation of China (No. 41871329), the Shenzhen Future Industry Development Funding Program (No. 201607281039561400), the Shenzhen Scientific Research and Development Funding Program (No. JCYJ20170818092931604).

Author information

Authors and Affiliations

International Doctoral Innovation Centre and School of Computer Science, University of Nottingham, Ningbo, China
Rui Cao & Qian Zhang
College of Information Engineering and Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, China
Bozhi Liu & Guoping Qiu
Shenzhen Key Laboratory of Spatial Smart Sensing and Services, Shenzhen University, Shenzhen, China
Jiasong Zhu & Qingquan Li
School of Computer Science, University of Nottingham, Nottingham, UK
Qing Li & Guoping Qiu

Authors

Rui Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jiasong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar
Qian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qingquan Li
View author publications
You can also search for this author in PubMed Google Scholar
Bozhi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guoping Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoping Qiu .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, R. et al. (2019). Learning Spatial-Aware Cross-View Embeddings for Ground-to-Aerial Geolocalization. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_5
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)