Abstract
Learned local descriptors based on Convolutional Neural Networks (CNNs) have achieved significant improvements on patch-based benchmarks, whereas not having demonstrated strong generalization ability on recent benchmarks of image-based 3D reconstruction. In this paper, we mitigate this limitation by proposing a novel local descriptor learning approach that integrates geometry constraints from multi-view reconstructions, which benefits the learning process in terms of data generation, data sampling and loss computation. We refer to the proposed descriptor as GeoDesc, and demonstrate its superior performance on various large-scale benchmarks, and in particular show its great success on challenging reconstruction tasks. Moreover, we provide guidelines towards practical integration of learned descriptors in Structure-from-Motion (SfM) pipelines, showing the good trade-off that GeoDesc delivers to 3D reconstruction tasks between accuracy and efficiency.
Z. Luo, L. Zhou and Y. Yao were summer interns, T. Shen and R. Zhang were interns at Everest Innovation Technology (Altizure).
S. Zhu is with Alibaba A.I. Labs since Oct. 2017.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Computing local descriptors on interest regions serves as the subroutine of various computer vision applications such as panorama stitching [12], wide baseline matching [18], image retrieval [22], and Structure-from-Motion (SfM) [26, 37, 40, 41]. A powerful descriptor is expected to be invariant to both photometric and geometric changes, such as illumination, blur, rotation, scale and perspective changes. Due to the reliability, efficiency and portability, hand-crafted descriptors such as SIFT [14] have been influentially dominating this field for more than a decade. Until recently, great efforts have been made on developing learned descriptors based on Convolutional Neural Networks (CNNs), which have achieved surprising results on patch-based benchmarks such as HPatches dataset [3]. However, on image-based datasets such as ETH local features benchmark [25], learned descriptors are found to underperform advanced variants of hand-crafted descriptors. The contradictory findings raise the concern of integrating those purportedly better descriptors in real applications, and show significant room of improvement for developing more powerful descriptors that generalize to a wider range of scenarios.
One possible cause of above contradictions, as demonstrated in [25], is the lack of generalization ability as a consequence of data insufficiency. Although previous research [4, 27, 28] discusses several effective sampling methods that produce seemingly large amount of training data, the generalization ability is still bounded to limited data sources, e.g., the widely-used Brown dataset [6] with only 3 image sets. Hence, it is not surprising that resulting descriptors tend to overfit to particular scenarios. To overcome it, research such as [29, 38] applies extra regularization for compact feature learning. Meanwhile, LIFT [33] and [19] seek to enhance data diversity and generate training data from reconstructions of Internet tourism data. However, the existing limitation has not yet been fully mitigated, while intermediate geometric information is overlooked in the learning process despite the robust geometric property that local patch preserves, e.g., the well approximation of local deformations [20].
Besides, we lack guidelines for integrating learned descriptors in practical pipelines such as SfM. In particular, the ratio criterion, as suggested in [14] and justified in [10], has received almost no individual attention or was considered inapplicable for learned descriptors [25], whereas it delivers excellent matching efficiency and accuracy improvements, and serves as the necessity for pipelines such as SfM to reject false matches and seed feasible initialization. A general method to apply ratio criterion for learned descriptors is in need in practice.
In this paper, we tackle above issues by presenting a novel learning framework that takes advantage of geometry constraints from multi-view reconstructed data. In particular, we address the importance of data sampling for descriptor learning and summarize our contributions threefold. (i) We propose a novel batch construction method that simulates the pair-wise matching and effectively samples useful data for learning process. (ii) Collaboratively, we propose a loss formulation to reduce overfitting and improve the performance with geometry constraints. (iii) We provide guidelines about ratio criterion, compactness and scalability towards practical portability of learned descriptors.
We evaluate the proposed descriptor, referred to as GeoDesc, on traditional [9] and recent two large-scale datasets [3, 25]. Superior performance is shown over the state-of-the-art hand-crafted and learned descriptors. We mitigate previous limitations by showing consistent improvements on both patch-based and image-based datasets, and further demonstrate its success on challenging 3D reconstructions.
2 Related Works
Networks Design. Due to weak semantics and efficiency requirements, existing descriptor learning often relies on shallow and thin networks, e.g., three-layer networks in DDesc [27] with 128-dimensional output features. Moreover, although widely-used in high-level computer vision tasks, max pooling is found to be unsuitable for descriptor learning, which is then replaced by L2 pooling in DDesc [27] or even removed in L2-Net [29]. To further incorporate scale information, DeepCompare [35] and L2-Net [29] use a two-stream central-surround structure which delivers consistent improvements at extra computational cost. To improve the rotational invariance, an orientation estimator is proposed in [34]. Besides of feature learning, previous efforts are also made on joint metric learning as in [7, 8, 35], whereas comparison in Euclidean space is more preferable by recent works [4, 5, 27, 29, 33] in order to guarantee its efficiency.
Loss Formulation. Various of loss formulations have been explored for effective descriptor learning. Initially, networks with a learned metric use softmax loss [8, 35] and cast the descriptor learning to a binary classification problem (similar/dissimilar). With weakly annotated data, [15] formulates the loss on keypoint bags. More generally, pair-wise loss [27, 33] and triplet loss [4, 5, 7] are used by networks without a learned metric. Both loss formulations encourage matching patches to be close whereas non-matching patches to be far-away in some measure space. In particular, triplet loss delivers better results [4, 7] as it suffers less overfitting [13]. For effective training, recent L2-Net [29] and HardNet [17] use the structured loss for data sampling which drastically improves the performance. To further boost the performance, extra regularizations are introduced for learning compact representation in [29, 38].
Evaluation Protocol. Previous works often evaluate on datasets such as [9, 16, 31]. However, those datasets either are small, or lack diversity to generalize well to various applications of descriptors. As a result, the evaluation results are commonly inconsistent or even contradictory to each other as pointed out in [3], which limits the application of learned descriptors. Two novel benchmarks, HPatches [3] and ETH local descriptor benchmark [25] have been recently introduced with clearly defined protocols and better generalization properties. However, inconsistency still exists in the two benchmarks, where HPatches [3] benchmark demonstrates the significant outperformance from learned descriptors over the handcrafted, whereas the ETH local descriptor benchmark [25] finds that the advanced variants of the traditional descriptor are at least on par with the learning-based. The inconclusive results indicate that there is still significant room for improvement to learn more powerful feature descriptors.
3 Method
3.1 Network Architecture
We borrow the network in L2-Net [29], where the feature tower is constructed by eschewing pooling layers and using strided convolutional layers for in-network downsampling. Each convolutional layer except the last one is followed by a batch normalization (BN) layer whose weighting and bias parameters are fixed to 1 and 0. The L2-normalization layer after the last convolution produces the final 128-dimensional feature vector.
3.2 Training Data Generation
Acquiring high quality training data is important in learning tasks. In this section, we discuss a practical pipeline that automatically produces well-annotated data suitable for descriptor learning.
2D Correspondence Generation. Similar to LIFT [33], we rely on successful 3D reconstructions to generate ground truth 2D correspondences in an automatic manner. First, sparse reconstructions are obtained from standard SfM pipeline [24]. Then, 2D correspondences are generated by projecting 3D point clouds. In general, SfM is used to filter out most mismatches among images.
Although verified by SfM, the generated correspondences are still outlier-contaminated from image noise and wrongly registered cameras. It happens particularly often on Internet tourism datasets such as [23, 30] (illustrated in Fig. 1(a)), and usually not likely to be filtered by simply limiting reprojection error. To improve data quality, we take one step further than LIFT by computing the visibility check based on 3D Delaunay triangulation [11] which is widely-used for outlier filtering in dense stereo. Empirically, \(30\%\) of 3D points will be discarded after the filtering while only points with high precision are kept for ground truth generation. Figure 1(b) gives an example to illustrate its effect.
(a) Outlier matches after SfM verification (by COLMAP [24]) on Gendarmenmarkt dataset [30]. The reprojection error (next to the image) cannot be used to identify false matches. (b) Reconstructed sparse point cloud (top), where points in red (bottom) indicate being filtered by Delaunay triangulation and only reliable points in green are kept. The number of points decreases from 75k to 53k after the filtering. (Color figure online)
Matching Patch Generation. Next, the interest region of a 2D projection is cropped similar to LIFT, which is formulated by an similarity transformation

where \((x^s_i , y_i^s ), (x^t_i , y_i^t)\) are input and output regular sampling grids, and \((x, y, \sigma , \theta )\) are keypoint parameters (x, y coordinates, scale and orientation) from SIFT detector. The constant k is set to 12 as in LIFT, resulting in \(12\sigma \times 12\sigma \) patches.
Due to the robust estimation of scale (\(\sigma \)) and orientation (\(\theta \)) parameters of SIFT even in extreme cases [39], the resulting patches are mostly free of scale and rotation differences, thus suitable for training. In later experiments of image matching or SfM, we rely on the same cropping method to achieve scale and rotation invariance for learned descriptors.
3.3 Geometric Similarity Estimation
Geometries at a 3D point are robust and provide rich information. Inspired by the MVS (Multi-View Stereo) accuracy measurement in [36], we define two types of geometric similarity: patch similarity and image similarity, which will facilitate later data sampling and loss formulation.
Patch Similarity. We define patch similarity \(S_{patch}\) to measure the difficulty to have a patch pair matched with respect to perspective changes. Formally, given a patch pair, we relate it to its corresponding 3D track P which is seen by cameras centering at \(C_i\) and \(C_j\). Next, we compute the vertex normal \(P_n\) at P from the surface model. The geometric relationship is illustrated in Fig. 2(a). Finally, we formulate \(S_{patch}\) as
where \(s_1\) measures the intersection angle between two viewing rays from the 3D track (\(\angle C_iPC_j\)), while \(s_2\) measures the difference of incidence angles between a viewing ray and the vertex normal from the 3D track (\(\angle C_iPP_n, \angle C_jPP_n\)). The angle metric is defined as \(g(\alpha , \sigma ) = \exp (-\frac{\alpha ^2}{2\sigma ^2})\). As an interpretation, \(s_1\) and \(s_2\) measure the perspective change regarding a 3D point and local 3D surface, respectively. The effect of \(S_{patch}\) is illustrated in Fig. 2(b).
The accuracy of \(s_1\) and \(s_2\) depends on sparse and mesh reconstructions, respectively, and is generally sufficient for its use as shown in [36]. The similarity does not consider scale and rotation changes as already resolved from Eq. 1. Empirically, we choose \(\sigma _1 = 15\) and \(\sigma _2 = 20\) (in degree).
Image Similarity. Based on the patch similarity, we define the image similarity \(S_{image}\) as the average patch similarity of the correspondences between an image pair. The image similarity measures the difficulty to match an image pair and can be interpreted as a measurement of perspective change. Examples are given in Fig. 2(c). The image similarity will be beneficial for data sampling in Sect. 3.4.
(a) The patch similarity relies on the geometric relationship between cameras, tracks and surface normal. (b) The effect of patch similarity, which measures the difficulty to have a patch pair matched with respect to the perspective change. (c) The effect of image similarity, which measures the perspective change between an image pairs. (d) Batch data constructed by L2-Net [29] and HardNet [17] (top), whose in-batch patch pairs are often distinctive to each other and thus contribute nothing to the loss in the late learning (e.g., the margin-based loss). However, the batch data from the proposed batch construction method (bottom) consists of similar patch pairs due to the spatially close keypoints or repetitive patterns, which are considered harder to distinguish and thus raise greater challenges for learning
3.4 Batch Construction
For descriptor learning, most existing frameworks take patch pairs (matching/non-matching) or patch triplets (reference, matching and non-matching) as input. As in previous studies, the convergence rate is highly dependent on being able to see useful data [21]. Here, “useful” data often refers to patch pairs/triplets that produce meaningful loss for learning. However, the effective sampling of such data is generally challenging due to the intractably large number of patch pair/triplet combination in the database. Hence, on one hand, sampling strategies such as hard negative mining [27] and anchor swap [4] are proposed, while on the other hand, effective batch construction is used in [7, 17, 29] to compare the reference patch against all the in-batch samples in the loss computation.
Inspired by previous works, we propose a novel batch construction method that effectively samples “useful” data by relying on geometry constraints from SfM, including the image matching results and image similarity \(S_{image}\), to simulate the pair-wise image matching and sample data. Formally, given one image pair, we extract a match set \(X=\{(x_1, x^+_1), (x_2, x^+_2), ..., (x_{N_1}, x_{N_1}^+)\}\), where \(N_1\) is the set size and \((x_i, x^+_i)\) is a matching patch pair surviving the SfM verification. A training batch is then constructed on \(N_2\) match sets. Hence, the learning objective becomes to improve the matching quality for each match set. In Sect. 3.5, we will discuss the loss computation on each match set and batch data.
Compared with L2-Net [29] and HardNet [17] whose training batches are random sampled from the whole database, the proposed method produces harder samples and thus raises greater challenges for learning. As an example shown in Fig. 2(d), the training batch constructed by the proposed method consists of many similar patterns, due to the spatially close keypoints or repetitive textures. In general, such training batch has two major advantages for descriptor learning:
-
It reflects the in-practice complexity. In real applications, image patches are often extracted between image pairs for matching. The proposed method simulates this scenario so that training and testing become more consistent.
-
It generates hard samples for training. As observed in [4, 17, 21, 27], hard samples are critical to fast convergence and performance improvement for descriptor learning. The proposed method effectively generates batch data that is sufficiently hard, while not being overfitting as constructed on real matching results instead of model inference results [27].
To further boost the training efficiency, we exclude image pairs that are too similar to contribute to the learning. Those pairs are effectively identified by the image similarity defined in Sect. 3.3. In practice, we discard image pairs whose \(S_{image}\) are larger than 0.85 (e.g., the toppest pair in Fig. 2(c)), which results in a \(30\%\) shrink of training samples.
3.5 Loss Formulation
We formulate the loss with two terms: structured loss and geometric loss.
Structured Loss. The structured loss used in L2-Net [29] and HardNet [17] is essentially suitable to consume the batch samples constructed in Sect. 3.4. In particular, the formulation in HardNet based on the “hardest-in-batch” strategy and a distance margin shows to be more effective than the log-likelihood formulation in L2-Net. However, we observe successive overfitting when applying the HardNet loss to our batch data, which we ascribe to the too strong constraint of “hardest-in-batch” strategy. In this strategy, the loss is computed on the data sample that produces the largest loss, and a margin with a large value (1.0 in HardNet) is set to push the non-matching pairs away from matching pairs. In our batch data, we already effectively sample the “hard” data which is often visually similar, thus forcing a large margin is inapplicable and stalls the learning. One simple solution is to decrease the margin value, whereas the performance drops significantly in our experiments.
To avoid above limitation and better take advantage of our batch data, we propose the loss formulation as follows. First, we compute the structured loss for one match set. Given normalized features \(\mathbf {F}_1, \mathbf {F}_2 \in \mathbb {R}^{N_1\times 128}\) computed on match set X for all \((x_i, x_i^+)\), the cosine similarity matrix is derived by \(\mathbf {S} = \mathbf {F}_1 \mathbf {F}_2^T\). Next, we compute \(\mathbf {L} = \mathbf {S} - \alpha \mathbf {diag}(\mathbf {S})\) and formulate the loss as
where \(l_{i, j}\) is the element in \(\mathbf {L}\), and \(\alpha \in (0,1)\) is the distance ratio mimicking the behavior of ratio test [14] and pushing away non-matching pairs from matching pairs. Finally, we take the average of the loss on each match set to derive the final loss for one training batch.
The proposed formulation is distinctive from HardNet in three aspects. First, we compute the cosine similarity instead of Euclidean distance for computational efficiency. Second, we apply a distance ratio margin instead of a fixed distance margin as an adaptive margin to reduce overfitting. Finally, we compute the mean value of each loss element instead of the maximum (“hardest-in-batch”) in order to cooperate the proposed batch construction.
Geometric Loss. Although \(E_1\) ensures matching patch pairs to be distant from the non-matching, it does not explicitly encourage matching pairs to be close in its measure space. One simple solution is to apply a typical pair-wise loss in [27], whereas taking a risk of positive collapse and overfitting as observed in [13]. To overcome it, we adaptively set up the margin regarding the patch similarity defined in Sect. 3.3, serving as a soft constraint for maximizing the positive similarity. We refer to this term as geometric loss and formulate it as
where \(\beta \) is the adaptive margin, \(s_{i, i}\) is the element in S, namely, the cosine similarity of patch pair \((x_i, x_i^+)\), while \(s_{patch}\) is the patch similarity for \((x_i, x_i^+)\). We use \(E_1 + \lambda E_2\) as the final loss, and empirically set \(\alpha \) and \(\lambda \) to 0.4 and 0.2.
3.6 Training
We use image sets [30] as in LIFT [33], the SfM data in [23], and further collect several image sets to form the training database. Based on COLMAP [24], we run 3D reconstructions to establish necessary geometry constraints. Image sets that are overlapping with the benchmark data are manually excluded. We train the networks from scratch using Adam with a base learning rate of 0.001 and weight decay of 0.0001. The learning rate decays by 0.9 every 10, 000 steps. Data augmentation includes randomly flipping, 90 degrees rotation and brightness and contrast adjustment. The match set size \(N_1\) and batch size \(N_2\) are 64 and 12, respectively. Input patches are standardized to have zero mean and unit norm.
4 Experiments
We evaluate the proposed descriptor on three datasets: the patch-based HPatches [3] benchmark, the image-based Heinly benchmark [9] and ETH local features benchmark [25]. We further demonstrate on challenging SfM examples.
4.1 HPatches Benchmark
HPatches benchmark [3] defines three complementary tasks: patch verification, patch matching, and patch retrieval. Different levels of geometrical perturbations are imposed to form EASY, HARD and TOUGH patch groups. In the task of verification, two subtasks are defined based on whether negative pairs are sampled from images within the same (SAMESEQ) or different sequences (DIFFSEQ). In the task of matching, two subtasks are defined based on whether the principle variance is viewpoint (VIEW) or illumination (ILLUM). Following [3], we use mean average precision (mAP) to measure the performance for all three tasks on HPatches split ‘full’.
We select five descriptors to compare: SIFT as the baseline, RSIFT [2] and DDesc [27] as the best-performing hand-crafted and learned descriptors concluded in [3]. Moreoever, we experiment with recent learned descriptors L2-Net [29] and HardNet [17]. The proposed descriptor is referred to as GeoDesc.
As shown in Fig. 3, GeoDesc surpasses all the other descriptors on all three tasks by a large margin. In particular, the performance on TOUGH patch group is significantly improved, which indicates the superior invariance to large image changes of GeoDesc. Interestingly, comparing GeoDesc with HardNet, we observe some performance drop on EASY groups especially for illumination changes, which can be ascribed to the data bias for SfM data. Though applying randomness such as illumination during the training, we cannot fully mitigate this limitation which asks more diverse real data in descriptor learning.
In addition, we evaluate different configurations of GeoDesc on HPatches as shown in Table 1 to demonstrate the effect of each part of our method.
-
Config. 1: the HardNet framework as the baseline.
-
Config. 2: trained with the SfM data in Sect. 3.2. Compared with Config. 1, it is shown that crowd-sourced training data essentially improves the generalization ability. Meanwhile, on the other hand, Config. 2 can be regarded as an extension of LIFT [33] with more advanced loss formulation.
-
Config. 3: equipped with the proposed batch construction in Sect. 3.4. As discussed in Sect. 3.5, the “hardest-in-batch” strategy in HardNet is inapplicable to hard batch data and thus leads to performance drop compared with Config. 2. In practice, we need to adjust the margin value from 1.0 in HardNet to 0.6, otherwise the training will not even converge. Though trainable, the smaller margin value harms the final performance.
-
Config. 4: equipped with the modified structured loss in Sect. 3.5. Notable performance improvements are achieved over Config. 2 due to the collaborative use of proposed methods, showing the effectiveness of simulating pair-wise matching and sampling hard data. Besides, replacing the distance margin with distance ratio can improve the training efficiency, as shown in Fig. 4.
-
Config. 5: equipped with the geometric loss in Sect. 3.5. Further improvements are obtained over Config. 4 as \(E_2\) constrains the solution space and enhances the training efficiency.
To sum up, the “hardest-in-batch” strategy is beneficial when no other sampling is applied and most in-batch samples do not contribute to the loss. However, with harder batch data effectively constructed, it is advantageous to replace the “hardest-in-batch” and further boost the descriptor performance.
4.2 Heinly Benchmark
Different from HPatches which experiments on image patches, the benchmark by Heinly et al. [9] evaluates pair-wise image matching regarding different types of photometric or geometric changes, targeting to provide practical insights for strengths and weaknesses of descriptors. We use two standard metrics as in [9] to quantify the matching quality. First, the Matching Score = #Inlier Matches/#Features. Second, the Recall = #Inlier Matches/#True Matches. Four descriptor are selected to compare: SIFT, the baseline hand-crafted descriptor; DSP-SIFT, the best hand-crafted descriptor even superior to the previous learning-based as evaluated in [25]; L2-Net and HardNet, the recent advanced learned descriptors. For fairness comparison, no ratio test and only cross check (mutual test) is applied for all descriptors.
Evaluation results are shown in Table 2. Compared with DSP-SIFT, GeoDesc performs comparably regarding image quality changes (compression, blur), while notably better for illumination and geometric changes (rotation, scale, viewpoint). On the other hand, GeoDesc delivers significant improvements on L2-Net and HardNet and particularly narrows the gap in terms of photometric changes, which makes GeoDesc applicable to different scenarios in real applications.
4.3 ETH Local Features Benchmark
The ETH local features benchmark [25] focuses on image-based 3D reconstruction tasks. We compare GeoDesc with SIFT, DSP-SIFT and L2-Net, and follow the same protocols in [25] to quantify the SfM quality, including the number of registered images (# Registered), reconstructed sparse points (# Sparse Points), image observations (# Observations), mean track length (Track Length) and mean reprojection error (Reproj. Error). For fairness comparison, we apply no distance ratio test for descriptor matching and extract features at the same keypoints as in [25].
As observed in Table 3, first, GeoDesc performs best on # Registered, which is generally considered as the most important SfM metric that directly quantifies the reconstruction completeness. Second, GeoDesc achieves best results on # Sparse Points and # Observations, which indicates the superior matching quality in the early step of SfM. However, GeoDesc fails to get best statistics about Track Length and Reproj. Error as GeoDesc computes the two metrics on significantly larger # Sparse Points. In terms of datasets whose scale is small and have similar track number (Fountain, Herzjesu), GeoDesc gives the longest Track Length.
To sum up, GeoDesc surpasses both the previous best-performing DSP-SIFT and recent advanced L2-Net by a notable margin. In addition, it is noted that L2-Net also shows consistent improvements over DSP-SIFT, which demonstrates the power of taking structured loss for learned descriptors.
4.4 Challenging 3D Reconstructions
To further demonstrate the effect of the proposed descriptor in a context of 3D reconstruction, we showcase selective image sets whose reconstructions fail or are in low quality with a typical SIFT-based 3D reconstruction pipeline but get significantly improved by integrating GeoDesc.
From examples shown in Fig. 5, it is clear to see the benefit of deploying GeoDesc in a reconstruction pipeline. First, by robust matching resistant to photometric and geometric changes, a complete sparse reconstruction registered with more cameras can be obtained. Second, due to more accurate camera pose estimation, the final fined mesh reconstruction is then derived.
5 Practical Guidelines
In this section, we discuss several practical guidelines to complement the performance evaluation and provide insights towards real applications. Following experiments are conducted with 231 extra high-resolution image pairs, whose keypoints are downsampled to \(\sim \)10k per image. We use a single NVIDIA GTX 1080 GPU with TensorFlow [1], and forward each batch with 256 patches.
5.1 Ratio Criterion
The ratio criterion [14] compares the distance between the first and the second nearest neighbor, and establishes a match if the former is smaller than the latter to some ratio. For SfM tasks, the ratio criterion improves overall matching quality, RANSAC efficiency, and seeds robust initialization. Despite those benefits, the ratio criterion has received little attention, or even been considered inapplicable to learned descriptors in previous studies [25]. Here, we propose a general method to determine the ratio that well cooperates with existing SfM pipelines.
The general idea is simple: the new ratio should function similarly as SIFT’s, as most SfM pipelines are parameterized for SIFT. To quantify the effect of the ratio criterion, we use the metric Precision = #Inlier Matches/#Putative matches, and determine the ratio that achieves similar Precision as SIFT’s. As an example in Fig. 6, we compute the Precision of SIFT and GeoDesc on our experimental dataset, and find the best ratio for GeoDesc is 0.89 at which it gives similar Precision (0.70) as SIFT (0.69). This ratio is applied to experiments in Sect. 4.4 and shows robust results and compatibility in the practical SfM pipeline.
5.2 Compactness Study
A compact feature representation generally indicates better performance with respective to discriminativeness and scalability. To quantify the compactness, we reply on the intermediate result in Principal Component Analysis (PCA). First, we compute the explained variance \(v_i\) which is stored in increasing order for each feature dimension indexed by i. Then we estimate the compact dimensionality (denoted as Compact-Dim) by finding the minimal k that satisfies \(\sum ^k_i{v_i}/\sum ^D_i{v_i} \ge t\), where t is a given threshold and D is the original feature dimensionality. In this experiment, we set \(t = 0.9\), so that the Compact-Dim can be interpreted as the minimal dimensionality required to convey more than \(90\%\) information of the original feature. Obviously, larger Compact-Dim indicates less redundancy, namely greater compactness.
As a result, the Compact-Dim estimated on 4 millions feature vectors for SIFT, DSP-SIFT, L2-Net and GeoDesc is 56, 63, 75 and 100, respectively. The ranking of Compact-Dim effectively responds to previous performance evaluations, where descriptors with larger Compact-Dim yield better results.
5.3 Scalability Study
Computational Cost. As evaluated in [3, 25], the efficiency of learned descriptors is on par with traditional descriptors such as CPU-based SIFT. Here, we further compare with GPU-based SIFT [32] to provide insights about practicability. We evaluate the running time in three steps. First, keypoint detection and canonical orientation estimation by SIFT-GPU. Next, patches cropping by Eq. 1. Finally, feature inference of image patches. The computational cost and memory demand are shown in Table 4, indicating that with GPU support, not surprisingly, SIFT (0.20s) is still faster than the learned descriptor (0.31s), with a narrow gap due to the parallel implementation. For applications heavily relying on matching quality (e.g., 3D reconstruction), the proposed descriptor achieves a good trade-off to replace SIFT.
Quantization. To conserve disk space, I/O and memory, we linearly map feature vectors of GeoDesc from \([-1, 1]\) to [0, 255] and round each element to unsigned-char value. The quantization does not affect the performance as evaluated on HPatches benchmark.
6 Conclusions
In contrast to prior work, we have addressed the advantages of integrating geometry constraints for descriptor learning, which benefits the learning process in terms of ground truth data generation, data sampling and loss computation. Also, we have discussed several guidelines, in particular, the ratio criterion, towards practical portability. Finally, we have demonstrated the superior performance and generalization ability of the proposed descriptor, GeoDesc, on three benchmark datasets in different scenarios, We have further shown the significant improvement of GeoDesc on challenging reconstructions, and the good trade-off between efficiency and accuracy to deploy GeoDesc in real applications.
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv (2016)
Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)
Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: a benchmark and evaluation of handcrafted and learned local lescriptors. In: CVPR (2017)
Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: BMVC (2016)
Balntas, V., Johns, E., Tang, L., Mikolajczyk, K.: PN-net: conjoined triple deep network for learning local image descriptors. arXiv (2016)
Brown, M.A., Hua, G., Winder, S.A.J.: Discriminative learning of local image descriptors. PAMI 33, 43–57 (2011)
Vijay Kumar, B.G., Carneiro, G., Reid, I.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions. In: CVPR (2016)
Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet - unifying feature and metric learning for patch-based matching. In: CVPR (2015)
Heinly, J., Dunn, E., Frahm, J.-M.: Comparative evaluation of binary features. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 759–773. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_54
Kaplan, A., Avraham, T., Lindenbaum, M.: Interpreting the ratio criterion for matching SIFT descriptors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 697–712. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_42
Labatut, P., Pons, J.P., Keriven, R.: Efficient multi-view reconstruction of large-scale scenes using interest points, delaunay triangulation and graph cuts. In: ICCV (2007)
Li, S., Yuan, L., Sun, J., Quan, L.: Dual-feature warping-based motion model estimation. In: ICCV (2015)
Lin, J., Morere, O., Chandrasekhar, V., Veillard, A., Goh, H.: DeepHash: getting regularization, depth and fine-tuning right. arXiv (2015)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV (2004)
Markuš, N., Pandžić, I.S., Ahlberg, J.: Learning local descriptors by optimizing the keypoint-correspondence criterion. In: ICPR (2016)
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27, 1615–1630 (2005)
Mishchuk, A., Mishkin, D., Radenovic, F.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: NIPS (2017)
Mishkin, D., Matas, J., Perdoch, M., Lenc, K.: WxBS: wide baseline stereo generalizations. In: BMVC (2015)
Mitra, R., et al.: A large dataset for improving patch matching. arXiv (2018)
Morel, J.M., Yu, G.: ASIFT: a new framework for fully affine invariant image comparison. SIAM J. Imaging Sci. 2, 438–469 (2009)
Movshovitz-Attias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S.: No Fuss Distance Metric Learning using Proxies. In: ICCV (2017)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)
Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_1
Schnberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Schönberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative evaluation of hand-crafted and learned local features. In: CVPR (2017)
Shen, T., Zhu, S., Fang, T., Zhang, R., Quan, L.: Graph-based consistent matching for structure-from-motion. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 139–155. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_9
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: CVPR (2015)
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NIPS (2016)
Tian, B.F.Y., Wu, F: L2-net: deep learning of discriminative patch descriptor in Euclidean space. In: CVPR (2017)
Wilson, K., Snavely, N.: Robust global translations with 1DSfM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 61–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_5
Winder, S., Hua, G., Brown, M.: Picking the best daisy. In: CVPR (2009)
Wu, C.: SiftGPU: a GPU implementation of sift (2007). http://cs.unc.edu/~ccwu/siftgpu
Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature transform. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 467–483. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_28
Yi, K.M., Verdie, Y., Fua, P., Lepetit, V.: Learning to assign orientations to feature points. In: CVPR (2015)
Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: CVPR (2015)
Zhang, R., Li, S., Fang, T., Zhu, S., Quan, L.: Joint camera clustering and surface segmentation for large-scale multi-view stereo. In: ICCV (2015)
Zhang, R., Zhu, S., Fang, T., Quan, L.: Distributed very large scale bundle adjustment by global camera consensus. In: ICCV (2017)
Zhang, X., Yu, F.X., Kumar, S., Chang, S.F.: Learning spread-out local feature descriptors. In: CVPR (2017)
Zhou, L., Zhu, S., Shen, T., Wang, J., Fang, T., Quan, L.: Progressive large scale-invariant image matching in scale space. In: ICCV (2017)
Zhu, S., Fang, T., Xiao, J., Quan, L.: Local readjustment for high-resolution 3D reconstruction. In: CVPR (2014)
Zhu, S., et al.: Very large-scale global SFM by distributed motion averaging. In: CVPR (2018)
Acknowledgment
This work is supported by T22-603/15N, Hong Kong ITC PSKL12EG02 and the Special Project of International Scientific and Technological Cooperation in Guangzhou Development District (No. 2017GH24).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Luo, Z. et al. (2018). GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11213. Springer, Cham. https://doi.org/10.1007/978-3-030-01240-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-01240-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01239-7
Online ISBN: 978-3-030-01240-3
eBook Packages: Computer ScienceComputer Science (R0)