Abstract
Image-based geolocalization is an important alternative to GPS-based localization in GPS-denied situations. Among them, ground-to-aerial geolocalization is particularly promising but also difficult due to drastic viewpoint and appearance differences between ground and aerial images. In this paper, we propose a novel spatial-aware Siamese-like network to address the issue by exploiting the spatial transformer layer to effectively alleviate the large view variation and learn location discriminative embeddings from the cross-view images. Furthermore, we propose to combine the triplet ranking loss with a simple and effective location identity loss to further enhance the performances. We test our method on a publicly available dataset and the results show that the proposed method outperforms state-of-the-art by a large margin.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Localization is an essential component for many location-based services (LBS). Traditional outdoor localization methods rely on global positioning system (GPS). However, they do not function properly in urban areas with high-rise buildings. Image-based localization methods are regarded as promising alternatives in GPS-denied situations. They are direct and compatible with human understanding. Besides, they can also be used for place recognition when we simply want to find out where a photo is taken.
The image-based geolocalization is normally treated as an image retrieval problem. The predicted location of a query image is set as the geographical coordinate of the most similar image from a geotagged image database. The image-based geolocalization methods can be categorized into ground-to-ground geolocalization [1, 5, 8, 18] and ground-to-aerial geolocalization [2, 3, 7, 10, 12, 13, 15,16,17]. For ground-to-ground geolocalization, the reference image database is composed of ground-level images. This method requires a large number of accurately geotagged ground images to cover the earth surface which are difficult to acquire. While for ground-to-aerial geolocalization, the image database is made up of overhead images, as illustrated in Fig. 1. This relieves the difficulty of building a large geotagged image database because aerial images can cover the whole areas of the earth surface and are usually ready with precise geographical coordinates.
However, ground-to-aerial geolocalization is extremely difficult since the ground-level images (horizontal view) are taken in a very different perspective compared with overhead images (nadir view). The drastic viewpoint variation results in small overlap areas between the two types of images, and also leads to problems like dramatic appearance differences, occlusion, and illumination variation.
The existing cross-view image geolocalization works tackle the issues through matching building facades [2], line segments [12], and handcrafted features [3]. Some works exploit extra information such as land cover maps [13]. With the development of deep learning, the powerful deep features are also utilized [15, 17].
Recently, deep metric learning has also been used to address the problem and shown to be an effective paradigm for cross-view image geolocalization [7, 10, 16]. It exploits the discriminative power of deep neural networks to embed cross-view images into a joint embedding metric space in which simple metrics like the Euclidean distance can be directly used to measure the semantic similarity between them.
In this paper, following the deep metric learning paradigm, we propose a novel spatial-aware Siamese-like network to address the ground-to-aerial geolocalization problem. Compared with previous methods, we exploit the spatial transformer layer (STL) [9] to tackle the large view variation problem, which can help to learn location discriminative embeddings for the challenging task. Besides, we design a loss that combines the triplet ranking loss with a simple and effective location identity loss to train the proposed network, which further enhances the geolocalization performances. We have conducted extensive experiments on a publicly available dataset of cross-view image pairs to test our method, and the results show that the proposed method has significantly outperformed the state-of-the-art.
The remainder of the paper is organized as follows. We firstly formulate the ground-to-aerial geolocalization problem in Sect. 2. In Sect. 3, we describe the proposed spatial-aware Siamese-like network as well as the loss function we use to train the network. In Sect. 4, we elaborate the experiments and analyze the results. Finally, we conclude in Sect. 5.
2 Problem Statement
The goal of ground-to-aerial geolocalization is to find the location \(\ell _g^i\) where a ground query image \(I_g^i\) is taken, given a geotagged overhead image database \(\mathcal {I}_r=\{\left\langle I_r^k, \ell _r^k \right\rangle \}~(k=1,2,...,N)\) as reference:
As illustrated in Fig. 1, the task can be formulated as an image retrieval problem, i.e. finding an aerial image \(I_r^*\) from the reference image database \(\mathcal {I}_r\), which is the most similar to the query image \(I_g^i\). Then the center location \(\ell _r^*\) of \(I_r^*\) would be regarded as the estimated location \(\hat{\ell _g^i}\) of \(I_g^i\):
where \(f_g\) and \(f_r\) are functions that map the ground and overhead images into a comparable embedding space \(\mathbb {R}^F\) respectively, and \(d(\cdot ,\cdot )\) is a metric distance measuring the dissimilarity of two embedding vectors in the space. Therefore, the key to the problem is matching the ground image to the most similar aerial image.
3 Methodology
In this section, we describe our proposed network that can effectively learn spatial-aware cross-view embedding features for ground-to-aerial image matching, and we also elaborate the loss functions we use to train the network.
3.1 Spatial-Aware Siamese-Like Network for Cross-View Image Matching
Considering that the ground and aerial images are captured at totally different views and their visual contents are of large difference, we propose a network to learn spatial-aware cross-view features that can match them effectively. The architecture of the proposed network is shown in Fig. 2. It is a Siamese-like network, consisting of two sub-networks of the same structure but different parameters, whereas traditional Siamese network has two identical sub-networks of the same structure and weights. The goal of the proposed network is to learn two embedding functions \(f(x;\theta _g), f(x;\theta _r): \mathbb {R}^I \rightarrow \mathbb {R}^F\) that map the input ground and overhead images to a joint feature space so that semantically similar ground-aerial image pairs in \(\mathbb {R}^I\) are metrically close in \(\mathbb {R}^F\). The two functions parameterized by \(\theta _g\) and \(\theta _r\) represent the two sub-networks respectively.
Each sub-network is fully convolutional, employing the convolutional parts of the AlexNet [11] or VGG16 [14] as the basic networks for feature extraction. The spatial transformer layer (STL) [9] is appended to the last layer of each sub-network to enable them the capacity of learning spatial transformations automatically which allow the network learns the best representation for cross-view matching. The output feature maps are then vectorized by global average pooling (GAP) to obtain fixed-length feature vectors, which are \(L_2\)-normalized to compute the final loss. In the testing phase, the \(L_2\)-normalized embedding vector can be exploited as the representative feature for cross-view image matching.
Spatial Transformer Layer (STL). STL [9] can warp the input feature map via specified transformation. In this paper, affine transformation is exploited, which can alleviate the large view variation between cross-view images through learning translation, rotation, scale, and skew transformations, as well as cropping. STL is a learnable differentiable module that learns a spatial transformation during training and can be applied to an input feature map in a single forward manner. The architecture of it is shown in Fig. 3. It is composed of three components, i.e. a localization net, a grid generator, and a sampler. The localization net \(f_{loc}\) learns the parameters \(\theta \) of the spatial transformation \(\mathcal {T}_\theta \), \(\theta = f_{loc}(U)\). The grid generator is then used to generate sampling points \(\mathcal {T}_\theta (G)\) from the input feature map U, given the regular grid \(G=\{G_i\}\) of the output feature map V:
where \((x_i^s, y_i^s)\) is the source point in the input feature map U, while \((x_i^t, y_i^t)\) is the target point in the output feature map V. \(M_\theta \) is a \(2 \times 3\) affine transformation matrix with 6 parameters. Two fully connected layers with 32 neurons each are used for the localization net to regress the 6 parameters. The sampler generates the final output feature map V by sampling from the input feature map U according to the generated grid \(\mathcal {T}_\theta (G)\) from the grid generator.
3.2 Loss Function
The overall loss function we use to train our network consists of two components, i.e. the triplet ranking loss \(\mathcal {L}_{tri}\) and the location identity loss \(\mathcal {L}_{id}\):
where \(\lambda \) controls the relative importance of the two losses.
Triplet Ranking Loss. The triplet loss characterizes a relative similarity ranking order between image triplets. It has been demonstrated to be effective for cross-view image matching [7, 16]. The goal of the loss is to make an image closer to its paired cross-view image than any other cross-view images.
Let the metric that measures the similarity of images in the embedding space \(\mathbb {R}^F\) be squared Euclidean distance \(d(x, y) = \left\| x - y \right\| _2^2\). Then, for a triplet of images \(I_i^a\) (anchor), \(I_j^p\) (positive), and \(I_j^n\) (negative), \(l_a, l_p, l_n\) are their corresponding geotags, \(\left\langle x_i^a, x_j^p, x_j^n \right\rangle \) = \(\left\langle f(I_i^a; \theta _i), f(I_j^p; \theta _j), f(I_j^n; \theta _j)\right\rangle \) are corresponding embeddings, thereby the triplet ranking loss can be formulated as follows:
where \([x]_+\) represents max(x, 0), \(\alpha \) denotes the margin. \(d(x_i^a,x_j^p)\) and \(d(x_i^a, x_j^n)\) are the distances between the anchor-positive and anchor-negative pairs respectively. i and j are indicators for different types of images, \(i \ne j\) and \(i,j \in \{g,r\}\), with g for ground image and r for overhead reference image.
For cross-view geolocalization, there is only one paired cross-view image as the positive sample for each anchor image. In terms of the anchor image type, image triplets can be categorized into ground-to-aerial type \(\left\langle g,r,r \right\rangle \) and aerial-to-ground type \(\left\langle r,g,g \right\rangle \). We exhaust all the valid triplets within a mini-batch to compute the loss during training following previous works [7, 16]. There would be \(2m(m-1)\) valid triplets within each mini-batch of m cross-view image pairs, with \(m(m-1)\) for \(\left\langle g,r,r \right\rangle \) and \(\left\langle r,g,g \right\rangle \) triplets each.
Location Identity Loss. For every cross-view image pair, the ground and overhead images represent the same location. However, they present very different visual contents since they are captured in totally different views. Inspired by the idea of deep feature consistency in facial attribute manipulation [6] and considering the uniqueness of the scene of every spatial location, we introduce a new location identity loss to enforce the feature consistency of cross-view image pairs, which can help preserve the unique identity of each place and learn location discriminative features. It tries to minimize the distance between the embedding features of two paired cross-view images captured at the same location. The formulation of the location identity loss is shown as follows:
where \(I_g^k\) and \(I_r^k\) are the k-th paired ground and overhead images respectively. Intuitively, the learned embedding functions \(f(x;\theta _g)\) and \(f(x;\theta _r)\) should make the cross-view image pairs as close to each other as possible in the embedding space \(\mathbb {R}^F\).
4 Experiments
4.1 Dataset
The CVUSA dataset [19] includes image pairs of panoramic street view images and overhead aerial imagery collected across the US. There are 35,532 image pairs for training, and 8,884 pairs for testing. The size of the ground panoramas is 224Â \(\times \)Â 1232, while the aerial image size is 750Â \(\times \)Â 750.
4.2 Experiment Setup
The street view images are resized to 112Â \(\times \)Â 616 in both the training and testing phase. While the aerial images are firstly resized to 300Â \(\times \)Â 300, then randomly cropped to 256Â \(\times \)Â 256 and rotated by 90n \((n=0,1,2,3)\) degrees in training phase, and they are directly resized to 256Â \(\times \)Â 256 in testing phase.
The networks are implemented based on the PyTorch framework. Adam optimizer is used to train the networks, with a learning rate of 0.00001 and a batch size of 20. The parameters of convolutional layers of the networks are initialized by corresponding base network weights pretrained on ImageNet [4], while the weights of STLs are initialized by identity transformation. The maximum training iteration is set to 20 epochs. The weight \(\lambda \) is empirically set to 0.005. The margin \(\alpha \) of the triplet loss is empirically set to 0.2.
Evaluation Metric. We adopt the recall accuracy at top 1% as our evaluation metric, the same as previous works [7, 16, 17]. A query is regarded as correct if the corresponding aerial image of the given ground query image is within the top 1% retrieval results.
4.3 Results and Analysis
To evaluate the effectiveness of the proposed network, we compare it with baselines and previous methods. There are two major types of baselines, one is Siamese network which has two identical sub-networks with shared weights, the other is Siamese-like network with two sub-networks of the same structure but different weights. Both the Siamese and Siamese-like networks have the same structure as the proposed network, except the spatial transformer layers. AlexNet and VGG16 are used as backbone networks. In addition, the proposed network and baselines are all trained with the proposed loss which combines the triplet ranking loss and location identity loss.
The top 1% recall results of the proposed network and the baselines, and the previous state-of-the-art results (reported in [7]) are presented in Table 1. As it can be seen, the VGG16-based Siamese-like network outperforms the shared-weight Siamese network (VGG16-based) by 11.1%, and the proposed network further increases the accuracy by 1.2%, reaching 95.8%, with 4.4% higher than previous state-of-the-art result of 91.4% from CVM-Net-I [7] which also use VGG16 as backbone network. The results demonstrate the efficacy of our proposed method in improving the cross-view image matching accuracy.
Ablation Study. To further validate the effectiveness of our proposed network and the location identity loss, we conduct ablation study by comparing the top 1% recall results of different networks training with different losses. There are three network architectures, i.e. Siamese network, Siamese-like network, and our proposed spatial-aware Siamese-like network. Compared to the proposed network, the baseline Siamese-like networks remove the spatial transformer layers, and the Siamese networks further share weights for the two sub-networks. They are trained under the triplet ranking loss \(\mathcal {L}_{tri}\) only, or the proposed loss which combines the triplet ranking loss \(\mathcal {L}_{tri}\) and location identity loss \(\mathcal {L}_{id}\). We also compare the results of using different backbone networks, i.e. AlexNet and VGG16.
The top 1% recall accuracy of the three different networks training under different losses are shown in Table 2. It can be seen that the deeper the network, the better the results, since the VGG16-based networks perform significantly better than the AlexNet-based counterparts. Siamese-like networks outperform Siamese networks dramatically, which owes to the removal of shared-weight constraint and thus increasing the model capacity to effectively learn view-specific features.
The proposed networks with STLs further improve the results noticeably under both base networks, from 65.0% to 71.1% (6.1% increase) for AlexNet and from 92.9% to 94.2% (1.3% increase) for VGG16, showing the efficacy of STLs in alleviating large view variation by explicitly learning spatial transformations. It is also interesting that the improvement is more significant when the backbone network is not deep enough (with 6.1% increase for AlexNet and 1.3% for VGG16). Furthermore, the proposed location identity loss can improve the performances on all the networks (as shown by the increase \(\varDelta \) in Table 2), which demonstrates its effectiveness in regularizing the networks learning location-discriminative deep features for cross-view image matching.
Qualitative Results. Some retrieval examples of the proposed network with the best performance (95.8%) are presented in Fig. 4. There are four typical scenes, i.e. medium residential area, sparse residential area, open cropland, and forest road. For each case, the ground query image is on the leftmost column, and the top five retrieval results are listed on the right, with the ground truth enveloped by orange box.
We can see that, for all the four cases, the top-5 retrieved aerial images on the right all present very similar patterns and appearances, and are clearly drawn from the same scene categories respectively. It is even hard for human eyes to find the correct matches for the ground images. This further demonstrates the effectiveness of our proposed network and the corresponding training loss in learning visually discriminative features for cross-view image matching.
However, it should also be noted that it is almost impossible to distinguish the correct paired aerial image for the forest road case, since the retrieved results are all well matched with the ground query image visually, and there seems to be no clues to find the real match. This indicates the limitation of simple image retrieval-based ground-to-aerial geolocalization approach as it is incapable of distinguishing visually similar image scenes. The method can be used as a coarse localization approach. To achieve accurate localization in real world applications, extra supplementary sources of data are needed or query image sequence can be exploited to reduce the ambiguity of similar scenes.
5 Concluding Remarks
Image-based ground-to-aerial geolocalization is a promising approach for localization in GPS-denied situations. However, it is very challenging due to drastic viewpoint difference between the cross-view images. In this paper, we propose a novel spatial-aware Siamese-like network to address the problem, which exploits the spatial transformer layer to explicitly learn spatial transformations between the ground and overhead images to tackle the large view variation. Moreover, we propose to combine the triplet ranking loss and the simple and effective location identity loss to train the proposed network to further enhance the performances. We evaluate our method on a publicly available dataset of cross-view image pairs, and the results demonstrate that the proposed method has achieved state-of-the-art performances. In the future, we plan to utilize extra supplementary data sources and employ image sequences, in conjunction with cross-view image matching, to meet accurate geolocalization need in real world applications.
References
Arandjelovic, R., Gronát, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 5297–5307 (2016)
Bansal, M., Sawhney, H.S., Cheng, H., Daniilidis, K.: Geo-localization of street views with aerial image databases. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1125–1128. ACM, New York (2011)
Chu, H., Mei, H., Bansal, M., Walter, M.R.: Accurate Vision-based Vehicle Localization using Satellite Imagery. arXiv:1510.09171 [cs] (2015)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pp. 248–255 (2009)
Hays, J., Efros, A.A.: IM2GPS: estimating geographic information from a single image. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24–26 June 2008, Anchorage, Alaska, USA (2008)
Hou, X., Shen, L., Sun, K., Qiu, G.: Deep feature consistent variational autoencoder. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1133–1141 (2017)
Hu, S., Feng, M., Nguyen, R.M.H., Lee, G.H.: CVM-Net: cross-view matching network for image-based ground-to-aerial geo-localization. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7258–7267 (2018)
Iscen, A., Tolias, G., Avrithis, Y.S., Furon, T., Chum, O.: Panorama to Panorama matching for location recognition. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR 2017, 6–9 June 2017, Bucharest, Romania, pp. 392–396 (2017)
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 7–12 December 2015, Montreal, Quebec, Canada, pp. 2017–2025 (2015)
Kim, D.K., Walter, M.R.: Satellite image-based localization via learned embeddings. In: 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, 29 May–3 June 2017, pp. 2073–2080 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, 3–6 December 2012, Lake Tahoe, Nevada, US, pp. 1106–1114 (2012)
Li, A., Morariu, V.I., Davis, L.S.: Planar structure matching under projective uncertainty for geolocation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 265–280. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_18
Lin, T.Y., Belongie, S.J., Hays, J.: Cross-view image geolocalization. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013, pp. 891–898 (2013)
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs] (2014)
Tian, Y., Chen, C., Shah, M.: Cross-view image matching for geo-localization in urban environments. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 21–26 July 2017, Honolulu, HI, USA, pp. 1998–2006 (2017)
Vo, N.N., Hays, J.: Localizing and orienting street views using overhead imagery. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 494–509. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_30
Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, 7–13 December 2015, Santiago, Chile, pp. 3961–3969 (2015)
Zamir, A.R., Shah, M.: Accurate image localization based on Google maps street view. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 255–268. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_19
Zhai, M., Bessinger, Z., Workman, S., Jacobs, N.: Predicting ground-level scene layout from aerial imagery. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 21–26 July 2017, Honolulu, HI, USA, pp. 4132–4140 (2017)
Acknowledgments
The authors acknowledge the financial support from the International Doctoral Innovation Centre, Ningbo Education Bureau, Ningbo Science and Technology Bureau, and the University of Nottingham. This work was supported in part by the UK Engineering and Physical Sciences Research Council [grant number EP/L015463/1], the National Natural Science Foundation of China (No. 41871329), the Shenzhen Future Industry Development Funding Program (No. 201607281039561400), the Shenzhen Scientific Research and Development Funding Program (No. JCYJ20170818092931604).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Cao, R. et al. (2019). Learning Spatial-Aware Cross-View Embeddings for Ground-to-Aerial Geolocalization. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-34120-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)