Keywords

1 Introduction

Localization is an essential component for many location-based services (LBS). Traditional outdoor localization methods rely on global positioning system (GPS). However, they do not function properly in urban areas with high-rise buildings. Image-based localization methods are regarded as promising alternatives in GPS-denied situations. They are direct and compatible with human understanding. Besides, they can also be used for place recognition when we simply want to find out where a photo is taken.

Fig. 1.
figure 1

Illustration of image retrieval-based ground-to-aerial geolocalization task. The goal is to find where a ground query image is taken. Normally, there are three steps: (1) matching the query image with geotagged aerial images, (2) ranking retrieved results by similarity, (3) geotagging the query image with the location of the most similar aerial image.

The image-based geolocalization is normally treated as an image retrieval problem. The predicted location of a query image is set as the geographical coordinate of the most similar image from a geotagged image database. The image-based geolocalization methods can be categorized into ground-to-ground geolocalization [1, 5, 8, 18] and ground-to-aerial geolocalization [2, 3, 7, 10, 12, 13, 15,16,17]. For ground-to-ground geolocalization, the reference image database is composed of ground-level images. This method requires a large number of accurately geotagged ground images to cover the earth surface which are difficult to acquire. While for ground-to-aerial geolocalization, the image database is made up of overhead images, as illustrated in Fig. 1. This relieves the difficulty of building a large geotagged image database because aerial images can cover the whole areas of the earth surface and are usually ready with precise geographical coordinates.

However, ground-to-aerial geolocalization is extremely difficult since the ground-level images (horizontal view) are taken in a very different perspective compared with overhead images (nadir view). The drastic viewpoint variation results in small overlap areas between the two types of images, and also leads to problems like dramatic appearance differences, occlusion, and illumination variation.

The existing cross-view image geolocalization works tackle the issues through matching building facades [2], line segments [12], and handcrafted features [3]. Some works exploit extra information such as land cover maps [13]. With the development of deep learning, the powerful deep features are also utilized [15, 17].

Recently, deep metric learning has also been used to address the problem and shown to be an effective paradigm for cross-view image geolocalization [7, 10, 16]. It exploits the discriminative power of deep neural networks to embed cross-view images into a joint embedding metric space in which simple metrics like the Euclidean distance can be directly used to measure the semantic similarity between them.

In this paper, following the deep metric learning paradigm, we propose a novel spatial-aware Siamese-like network to address the ground-to-aerial geolocalization problem. Compared with previous methods, we exploit the spatial transformer layer (STL) [9] to tackle the large view variation problem, which can help to learn location discriminative embeddings for the challenging task. Besides, we design a loss that combines the triplet ranking loss with a simple and effective location identity loss to train the proposed network, which further enhances the geolocalization performances. We have conducted extensive experiments on a publicly available dataset of cross-view image pairs to test our method, and the results show that the proposed method has significantly outperformed the state-of-the-art.

The remainder of the paper is organized as follows. We firstly formulate the ground-to-aerial geolocalization problem in Sect. 2. In Sect. 3, we describe the proposed spatial-aware Siamese-like network as well as the loss function we use to train the network. In Sect. 4, we elaborate the experiments and analyze the results. Finally, we conclude in Sect. 5.

2 Problem Statement

The goal of ground-to-aerial geolocalization is to find the location \(\ell _g^i\) where a ground query image \(I_g^i\) is taken, given a geotagged overhead image database \(\mathcal {I}_r=\{\left\langle I_r^k, \ell _r^k \right\rangle \}~(k=1,2,...,N)\) as reference:

$$\begin{aligned} \ell _g^i = h(I_g^i, \mathcal {I}_r). \end{aligned}$$
(1)

As illustrated in Fig. 1, the task can be formulated as an image retrieval problem, i.e. finding an aerial image \(I_r^*\) from the reference image database \(\mathcal {I}_r\), which is the most similar to the query image \(I_g^i\). Then the center location \(\ell _r^*\) of \(I_r^*\) would be regarded as the estimated location \(\hat{\ell _g^i}\) of \(I_g^i\):

$$\begin{aligned} \hat{\ell _g^i} = \ell _r^*, ~\mathrm {where}~ I_r^* = \mathop {\hbox {arg min}}\limits _k d(f_g(I_g^i), f_r(I_r^k)), \end{aligned}$$
(2)

where \(f_g\) and \(f_r\) are functions that map the ground and overhead images into a comparable embedding space \(\mathbb {R}^F\) respectively, and \(d(\cdot ,\cdot )\) is a metric distance measuring the dissimilarity of two embedding vectors in the space. Therefore, the key to the problem is matching the ground image to the most similar aerial image.

3 Methodology

In this section, we describe our proposed network that can effectively learn spatial-aware cross-view embedding features for ground-to-aerial image matching, and we also elaborate the loss functions we use to train the network.

3.1 Spatial-Aware Siamese-Like Network for Cross-View Image Matching

Considering that the ground and aerial images are captured at totally different views and their visual contents are of large difference, we propose a network to learn spatial-aware cross-view features that can match them effectively. The architecture of the proposed network is shown in Fig. 2. It is a Siamese-like network, consisting of two sub-networks of the same structure but different parameters, whereas traditional Siamese network has two identical sub-networks of the same structure and weights. The goal of the proposed network is to learn two embedding functions \(f(x;\theta _g), f(x;\theta _r): \mathbb {R}^I \rightarrow \mathbb {R}^F\) that map the input ground and overhead images to a joint feature space so that semantically similar ground-aerial image pairs in \(\mathbb {R}^I\) are metrically close in \(\mathbb {R}^F\). The two functions parameterized by \(\theta _g\) and \(\theta _r\) represent the two sub-networks respectively.

Fig. 2.
figure 2

Overview of the proposed network. (STL: spatial transformer layer, GAP: global average pooling, L2: \(L_2\)-normalization)

Each sub-network is fully convolutional, employing the convolutional parts of the AlexNet [11] or VGG16 [14] as the basic networks for feature extraction. The spatial transformer layer (STL) [9] is appended to the last layer of each sub-network to enable them the capacity of learning spatial transformations automatically which allow the network learns the best representation for cross-view matching. The output feature maps are then vectorized by global average pooling (GAP) to obtain fixed-length feature vectors, which are \(L_2\)-normalized to compute the final loss. In the testing phase, the \(L_2\)-normalized embedding vector can be exploited as the representative feature for cross-view image matching.

Fig. 3.
figure 3

Overview of the spatial transformer layer (STL).

Spatial Transformer Layer (STL). STL [9] can warp the input feature map via specified transformation. In this paper, affine transformation is exploited, which can alleviate the large view variation between cross-view images through learning translation, rotation, scale, and skew transformations, as well as cropping. STL is a learnable differentiable module that learns a spatial transformation during training and can be applied to an input feature map in a single forward manner. The architecture of it is shown in Fig. 3. It is composed of three components, i.e. a localization net, a grid generator, and a sampler. The localization net \(f_{loc}\) learns the parameters \(\theta \) of the spatial transformation \(\mathcal {T}_\theta \), \(\theta = f_{loc}(U)\). The grid generator is then used to generate sampling points \(\mathcal {T}_\theta (G)\) from the input feature map U, given the regular grid \(G=\{G_i\}\) of the output feature map V:

$$\begin{aligned} \begin{pmatrix} x_i^s \\ y_i^s \end{pmatrix} = \mathcal {T}_\theta (G_i) = M_\theta \begin{pmatrix} x_i^t \\ y_i^t \\ 1 \end{pmatrix} \end{aligned}$$
(3)

where \((x_i^s, y_i^s)\) is the source point in the input feature map U, while \((x_i^t, y_i^t)\) is the target point in the output feature map V. \(M_\theta \) is a \(2 \times 3\) affine transformation matrix with 6 parameters. Two fully connected layers with 32 neurons each are used for the localization net to regress the 6 parameters. The sampler generates the final output feature map V by sampling from the input feature map U according to the generated grid \(\mathcal {T}_\theta (G)\) from the grid generator.

3.2 Loss Function

The overall loss function we use to train our network consists of two components, i.e. the triplet ranking loss \(\mathcal {L}_{tri}\) and the location identity loss \(\mathcal {L}_{id}\):

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{tri} + \lambda \mathcal {L}_{id}, \end{aligned}$$
(4)

where \(\lambda \) controls the relative importance of the two losses.

Triplet Ranking Loss. The triplet loss characterizes a relative similarity ranking order between image triplets. It has been demonstrated to be effective for cross-view image matching [7, 16]. The goal of the loss is to make an image closer to its paired cross-view image than any other cross-view images.

Let the metric that measures the similarity of images in the embedding space \(\mathbb {R}^F\) be squared Euclidean distance \(d(x, y) = \left\| x - y \right\| _2^2\). Then, for a triplet of images \(I_i^a\) (anchor), \(I_j^p\) (positive), and \(I_j^n\) (negative), \(l_a, l_p, l_n\) are their corresponding geotags, \(\left\langle x_i^a, x_j^p, x_j^n \right\rangle \) = \(\left\langle f(I_i^a; \theta _i), f(I_j^p; \theta _j), f(I_j^n; \theta _j)\right\rangle \) are corresponding embeddings, thereby the triplet ranking loss can be formulated as follows:

$$\begin{aligned} \mathcal {L}_{tri} = \sum _{i, j} \sum _{\begin{array}{c} a,p,n \\ l_a = l_p \ne l_n \end{array}}{[d(x_i^a,x_j^p) - d(x_i^a,x_j^n) + \alpha ]_{+}}, \end{aligned}$$
(5)

where \([x]_+\) represents max(x, 0), \(\alpha \) denotes the margin. \(d(x_i^a,x_j^p)\) and \(d(x_i^a, x_j^n)\) are the distances between the anchor-positive and anchor-negative pairs respectively. i and j are indicators for different types of images, \(i \ne j\) and \(i,j \in \{g,r\}\), with g for ground image and r for overhead reference image.

For cross-view geolocalization, there is only one paired cross-view image as the positive sample for each anchor image. In terms of the anchor image type, image triplets can be categorized into ground-to-aerial type \(\left\langle g,r,r \right\rangle \) and aerial-to-ground type \(\left\langle r,g,g \right\rangle \). We exhaust all the valid triplets within a mini-batch to compute the loss during training following previous works [7, 16]. There would be \(2m(m-1)\) valid triplets within each mini-batch of m cross-view image pairs, with \(m(m-1)\) for \(\left\langle g,r,r \right\rangle \) and \(\left\langle r,g,g \right\rangle \) triplets each.

Location Identity Loss. For every cross-view image pair, the ground and overhead images represent the same location. However, they present very different visual contents since they are captured in totally different views. Inspired by the idea of deep feature consistency in facial attribute manipulation [6] and considering the uniqueness of the scene of every spatial location, we introduce a new location identity loss to enforce the feature consistency of cross-view image pairs, which can help preserve the unique identity of each place and learn location discriminative features. It tries to minimize the distance between the embedding features of two paired cross-view images captured at the same location. The formulation of the location identity loss is shown as follows:

$$\begin{aligned} \mathcal {L}_{id} = \sum _k \left\| f(I_g^k; \theta _g) - f(I_r^k; \theta _r) \right\| _2^2, \end{aligned}$$
(6)

where \(I_g^k\) and \(I_r^k\) are the k-th paired ground and overhead images respectively. Intuitively, the learned embedding functions \(f(x;\theta _g)\) and \(f(x;\theta _r)\) should make the cross-view image pairs as close to each other as possible in the embedding space \(\mathbb {R}^F\).

4 Experiments

4.1 Dataset

The CVUSA dataset [19] includes image pairs of panoramic street view images and overhead aerial imagery collected across the US. There are 35,532 image pairs for training, and 8,884 pairs for testing. The size of the ground panoramas is 224 \(\times \) 1232, while the aerial image size is 750 \(\times \) 750.

4.2 Experiment Setup

The street view images are resized to 112 \(\times \) 616 in both the training and testing phase. While the aerial images are firstly resized to 300 \(\times \) 300, then randomly cropped to 256 \(\times \) 256 and rotated by 90n \((n=0,1,2,3)\) degrees in training phase, and they are directly resized to 256 \(\times \) 256 in testing phase.

The networks are implemented based on the PyTorch framework. Adam optimizer is used to train the networks, with a learning rate of 0.00001 and a batch size of 20. The parameters of convolutional layers of the networks are initialized by corresponding base network weights pretrained on ImageNet [4], while the weights of STLs are initialized by identity transformation. The maximum training iteration is set to 20 epochs. The weight \(\lambda \) is empirically set to 0.005. The margin \(\alpha \) of the triplet loss is empirically set to 0.2.

Evaluation Metric. We adopt the recall accuracy at top 1% as our evaluation metric, the same as previous works [7, 16, 17]. A query is regarded as correct if the corresponding aerial image of the given ground query image is within the top 1% retrieval results.

4.3 Results and Analysis

To evaluate the effectiveness of the proposed network, we compare it with baselines and previous methods. There are two major types of baselines, one is Siamese network which has two identical sub-networks with shared weights, the other is Siamese-like network with two sub-networks of the same structure but different weights. Both the Siamese and Siamese-like networks have the same structure as the proposed network, except the spatial transformer layers. AlexNet and VGG16 are used as backbone networks. In addition, the proposed network and baselines are all trained with the proposed loss which combines the triplet ranking loss and location identity loss.

Table 1. Top 1% recall accuracy of the proposed network, baselines, and previous state-of-the-art methods.

The top 1% recall results of the proposed network and the baselines, and the previous state-of-the-art results (reported in [7]) are presented in Table 1. As it can be seen, the VGG16-based Siamese-like network outperforms the shared-weight Siamese network (VGG16-based) by 11.1%, and the proposed network further increases the accuracy by 1.2%, reaching 95.8%, with 4.4% higher than previous state-of-the-art result of 91.4% from CVM-Net-I [7] which also use VGG16 as backbone network. The results demonstrate the efficacy of our proposed method in improving the cross-view image matching accuracy.

Ablation Study. To further validate the effectiveness of our proposed network and the location identity loss, we conduct ablation study by comparing the top 1% recall results of different networks training with different losses. There are three network architectures, i.e. Siamese network, Siamese-like network, and our proposed spatial-aware Siamese-like network. Compared to the proposed network, the baseline Siamese-like networks remove the spatial transformer layers, and the Siamese networks further share weights for the two sub-networks. They are trained under the triplet ranking loss \(\mathcal {L}_{tri}\) only, or the proposed loss which combines the triplet ranking loss \(\mathcal {L}_{tri}\) and location identity loss \(\mathcal {L}_{id}\). We also compare the results of using different backbone networks, i.e. AlexNet and VGG16.

Table 2. Top 1% recall accuracy of different network architectures training with different losses. (\(\lambda =0.005\))

The top 1% recall accuracy of the three different networks training under different losses are shown in Table 2. It can be seen that the deeper the network, the better the results, since the VGG16-based networks perform significantly better than the AlexNet-based counterparts. Siamese-like networks outperform Siamese networks dramatically, which owes to the removal of shared-weight constraint and thus increasing the model capacity to effectively learn view-specific features.

The proposed networks with STLs further improve the results noticeably under both base networks, from 65.0% to 71.1% (6.1% increase) for AlexNet and from 92.9% to 94.2% (1.3% increase) for VGG16, showing the efficacy of STLs in alleviating large view variation by explicitly learning spatial transformations. It is also interesting that the improvement is more significant when the backbone network is not deep enough (with 6.1% increase for AlexNet and 1.3% for VGG16). Furthermore, the proposed location identity loss can improve the performances on all the networks (as shown by the increase \(\varDelta \) in Table 2), which demonstrates its effectiveness in regularizing the networks learning location-discriminative deep features for cross-view image matching.

Qualitative Results. Some retrieval examples of the proposed network with the best performance (95.8%) are presented in Fig. 4. There are four typical scenes, i.e. medium residential area, sparse residential area, open cropland, and forest road. For each case, the ground query image is on the leftmost column, and the top five retrieval results are listed on the right, with the ground truth enveloped by orange box.

Fig. 4.
figure 4

Retrieval examples of four typical scenes (from top to bottom: medium residential area, sparse residential area, open cropland, and forest road). For each ground query image, the top 5 retrieved aerial images are presented, with the ground truth enveloped by orange box. (Color figure online)

We can see that, for all the four cases, the top-5 retrieved aerial images on the right all present very similar patterns and appearances, and are clearly drawn from the same scene categories respectively. It is even hard for human eyes to find the correct matches for the ground images. This further demonstrates the effectiveness of our proposed network and the corresponding training loss in learning visually discriminative features for cross-view image matching.

However, it should also be noted that it is almost impossible to distinguish the correct paired aerial image for the forest road case, since the retrieved results are all well matched with the ground query image visually, and there seems to be no clues to find the real match. This indicates the limitation of simple image retrieval-based ground-to-aerial geolocalization approach as it is incapable of distinguishing visually similar image scenes. The method can be used as a coarse localization approach. To achieve accurate localization in real world applications, extra supplementary sources of data are needed or query image sequence can be exploited to reduce the ambiguity of similar scenes.

5 Concluding Remarks

Image-based ground-to-aerial geolocalization is a promising approach for localization in GPS-denied situations. However, it is very challenging due to drastic viewpoint difference between the cross-view images. In this paper, we propose a novel spatial-aware Siamese-like network to address the problem, which exploits the spatial transformer layer to explicitly learn spatial transformations between the ground and overhead images to tackle the large view variation. Moreover, we propose to combine the triplet ranking loss and the simple and effective location identity loss to train the proposed network to further enhance the performances. We evaluate our method on a publicly available dataset of cross-view image pairs, and the results demonstrate that the proposed method has achieved state-of-the-art performances. In the future, we plan to utilize extra supplementary data sources and employ image sequences, in conjunction with cross-view image matching, to meet accurate geolocalization need in real world applications.