Keywords

1 Introduction

Vision applications such as 3D object reconstruction [1], image stitching [2] as well as object tracking in video sequences [3, 4] mainly rely on the correct correspondences matching across images.

The determination that whether a pair of salient points is correctly corresponding to each other is a quite challenging task. This is mainly due to the existing scale, rotation, and viewpoint transformations between the compared images. The past decades witnessed the effectiveness of salient point methods to this issue. The salient point methods firstly locate extrema (the candidate salient points) in the image scale space, and then generate a local descriptor to characterize each salient point. Finally, the nearest neighbor point obtained by the similarity measure is determined as the correspondence. A representative salient point method is SIFT [5], which detects the salient points in Difference-of-Gaussians scale space, and uses the orientation histogram of gradient to represent these obtained salient points. Most of other efforts (such as SURF [6], KAZE [7] were presented to improve the efficiency or accuracy of salient points localization. The SURF method makes use of a box-filter to approximate to the commonly used Laplace of Gaussian (LoG), and further employs the integral image to speed up the box-filter based scale space construction. The recent KAZE employs a nonlinear scale space and combines with the Additive Operator Splitting (AOS) and special conductance diffusion to reduce noise. The nonlinear scale space could retain the object boundary structure and generate more accurate positions for salient points. Furthermore, the local binary representations were proposed with the advantages of fast computation and low memory requirements (BRIEF [8], ORB [9], BRISK [10], and FREAK [11]). The generation of local binary descriptors is mainly based on the pair-wise intensities comparison in a pre-defined structure. However, the local binary descriptors focus primarily on improving the speed and storage rather than the precision.

The goal of this paper is to improve the accuracy of correspondences matching from salient point methods. We propose a novel multiple feature fusion (MFF) framework in this paper and it shows the robustness to the challenging transformations (such as the rotation and perspective changes). Our framework is motivated by the theory of global precedence that humans perceive the global structure before the fine level local details. As illustrated in Fig. 1, the proposed framework combines the low-level local feature of salient point together with the high-level feature in its surrounding patch in the pre-defined global structure to establish the correct correspondences matching.

Fig. 1.
figure 1

Two salient points share the same nearest neighbor point between the compared images. However, the green line is a false match, and the yellow line is defined as a correct match. This is because the similarity of yellow patch is better than green patch when compared to the blue patch. (Color figure online)

There are two important roles in the proposed framework: one is how to define the global structure in the image, and the other one is what kinds of features are appropriate to represent these patches in the pre-defined global structure. Specifically, we employ a retina inspired sampling pattern to construct a retina patch-structure in the image. The retina sampling pattern could effectively mimic the topology of the retina in human vision system. Moreover, inspired by the fact that the image representations built upon convolutional neural networks (CNNs) [12] have strong discrimination, we choose to describe these patches via high-level CNN features. The performance evaluation on two popular benchmark datasets demonstrated that the proposed MFF framework could significantly increase the accuracy, stability, and reliability of correspondences matching under various image transformations, especially for the rotation and perspective changes.

The rest of the paper is organized as follows: Sect. 2 gives a brief review of related works. The construction of the proposed MFF framework is presented in Sect. 3. In Sect. 4, we describe the datasets and evaluation criterion in the experiment. The performance results of the MFF are shown in Sect. 5, and conclusions are given in Sect. 6.

2 Related Work

Because of the high performance of deep convolutional neural networks in various computer vision applications, the CNNs based image correspondences matching is receiving increasing attention. Fischer et al. [13] extracted salient regions in an image via MSER detector. The extracted regions were normalized to a fixed resolution and then passed through a pre-trained convolutional neural network, and the output of the last layer in the CNN is used to represent the patch. Long et al. [14] and Tulsiani el al. [15] proposed to predict the salient points based on the convnet features from the output of CNN architecture. The recent methods mainly focus on the supervised learning schemes. Zagoruyko et al. [16] and Han et al. [17] used a Siamese network architecture which minimizes a pair-wise similarity loss of annotated pairs of raw image patches to jointly learn the features of local patches as well as the similarity metric for these local patches. The triplet network [18] employs the triplet ranking loss which can preserve the relative similarity relations of learn features to represent local patches. The framework introduced in this paper fuses the low-level local feature from each salient point and the high-level CNN feature from the patch it belongs to in order to achieve accurate correspondences matching.

3 Multiple Feature Fusion Framework

3.1 Retina Sampling Pattern Review

The retina sampling pattern has been widely used in various computer vision applications [11, 19], and those approaches made good use of the topology of human retina inspired by neuro-biology research. The topology of human retina reveals that the spatial distribution density of cone cells in the human retina decreases exponentially with the distance metric from the center of retina. As the illustration of the cones density in Fig. 2(a), our approach employed the similar retina topology to define the patches structure in the image. As shown in Fig. 2(b), different size of blocks are placed at the image domain with high sampling density in the center area. The advantages of the proposed retina patch-structure are as follows: small numbers of patches (43 patches) cover almost all image domain which offers a good trade-off between accuracy and efficiency towards to the CNN features extraction; the size of the block is calculated respected to the log-polar and high density patches in the center image domain such that more details could be captured in the center area. Additionally, the overlapping between two patches in the retina pattern structure aims to increase the matching performance.

Fig. 2.
figure 2

(a) The illustration of the density distribution of cones in the human retina. (b) The retina patch-structure in the image domain.

3.2 High Level Feature from CNNs

The motivation to utilize the output from a pre-trained CNN to represent the retina patch-structure stems from several properties of CNN features: First, the discrimination power of a CNN feature is significantly high and it outperforms those manually designed features in various computer vision applications by a large margin; Second, the CNN features are transferrable: some projects [20, 21] have demonstrated that the pre-trained networks still work well when they are applied to other vision tasks different from the datasets they were trained on.

The performance evaluation is based on a popular network architecture presented in Krizhevsky et al. [12] (AlexNet), which was trained on 1.2 million images from the ILSVRC2012 for classification (note that other high performance networks such as VGGNet, GoogleNet can also be used in the proposed framework.). The AlexNet network architecture consists of five stacked convolutional layers followed by normalization layers and pooling layers as well as two fully connected layers and a softmax classifier on top. Each fully connected layer contains 4096 neurons, and we use the CAFFE implementation [22] to extract the activations from the last two fully connected layers to represent each patch (referred to as fc6, fc7 and with the dimensional of 4096, respectively).

3.3 Multiple Features Fusion Framework

Towards to the local features of salient points and CNN features from the patches in the retina sampling pattern, we propose a novel feature fusion framework. For a specific salient point P(xy) in image I, first, we calculate its local descriptor f, which is invariant to scale, rotation, and noise (such as SIFT, SURF, etc.). Then we calculate the distance between salient point position and the center of each retina patch to determine which patch the salient point belongs to. Finally each salient point is assigned a feature set: \(F_{P(x,y)}=\{f,fc6_i,fc7_i\}\), where \(i\in N\), which means P(xy) belongs to the ith patch in the retina patch-structure and N is the total amount of retina patches.

As the large variation in the value distribution from the directly obtained CNN features, the normalization operation is necessary. Inspired by the normalization of rootSIFT [23] which is more distinctive than SIFT, we apply the same normalization to the original CNN features, which exerts the feature vectors L1 normalization and then square root.

We define the similarity measure \(S(P, P')\) to determine if two salient points P(xy), and \(P'(x', y')\) is a correspondence as following:

$$\begin{aligned} \small S(P, P')=\exp (s(f,f'))\times (s(fc6_i,fc6_j')+s(fc7_i,fc7_j')) \end{aligned}$$
(1)

where the s(.) denotes the Euclidean metric, and we use an exponential function in order to emphasize the distance of two salient local descriptors.

Moreover, taking into the consideration that the existing overlaps in the proposed retina patch-structure, we use multiple assignment (MA) strategy to each salient point, which means that each salient point will be assigned K CNN features from its K nearest patches centers, and the similarity measure \(S(P, P')\) is then updated as:

$$\begin{aligned} \small S(P, P')=\exp (s(f,f'))\times \sum _{i, j=1}^{K}(s(fc6_i,fc6_j')+s(fc7_i,fc7_j')) \end{aligned}$$
(2)

The performance of correspondences matching in Fig. 3 demonstrated the strength of our MFF in the cases of challenging perspective transformations in comparison to the popular SIFT, and rootSIFT.

Fig. 3.
figure 3

Illustration of correspondence matching, the MFF is applied on rootSIFT and compared to SIFT, rootSIFT on challenge affine object detection (graffiti 1vs5, graffiti 1vs6 proposed by Mikolajczyk and Schmid [24]). Our framework exactly located the position of object after homography estimation by the RANSAC.

4 Experiment Setup

In this section, we conduct experiments to show the effectiveness of the proposed MFF framework. The accuracy of correspondences matching is evaluated on the MFF-rootSIFT and MFF-SURF, which applied our novel framework and compared with the leading popular approaches: SIFT, SURF, rootSIFT. The experimental environment for the evaluation is: Intel quad Core i7 Processor (2.6GHz), 12GB of RAM, and NVIDIA GTX970 with 4GRAM. The parameters of each compared salient point methods were set to the defaults and our MFF implementation is available online at: http://press.liacs.nl/researchdownloads/.

4.1 Datasets

The evaluation of correspondences matching is performed on two benchmark datasets (Mikolajczyk and Schmid [24] and Fischer et al. [13]), which both provided the ground-truth homography between the reference image and the transformed image. The first dataset contains eight groups, and each group consists of six image samples (total 48 images) with various transformations (rotation, viewpoint, scale, JPEG compression, illumination and image blur). Considering the small scale of the dataset offered by Mikolajczyk and Schmid [24], a large dataset provided by Fischer et al. [13] is also employed in the evaluation. It contains 16 groups and each group contains 26 images (total 416 images) which are generated synthetically by applying 6 types of transformations (zoom, blur, illumination, rotation, perspective and nonlinear).

4.2 Evaluation Criterion

As MFF is a framework to increase the correspondences matching, we use the defined formula (2) to establish the correspondences. While for the compared salient point methods, KD-tree index is established and the Nearest Neighbor Distance Ratio (NNDR) is used as the matching strategy to find the similar descriptors. NNDR defines that two points will be considered as a match if \(\parallel D_A-D_B\parallel /\parallel D_A-D_C\parallel \), where \(D_B\) is the first and \(D_C\) is the second nearest neighbor to \(D_A\). The NNDR matching threshold is set to 0.8 in the experiment.

To further determine whether a match is correct or not, we enforce a one-to-one constraint so that a match is considered as a correct only if its matching point is geometrically the closest point within the defined pixel coordinate error. For two compared images I and \(I'\), let the set of all matches as:

$$\begin{aligned} M=\{p_i\leftrightarrow p_j'|m(p_i,p_j')\} \end{aligned}$$
(3)

where \(m(p_i,p_j')\) denotes the two matches satisfy the correspondence requirement. We need to note that different points in image I could be projected to the same point in image I (many-to-one matches), even though only one single best match is returned for each point in reference image, and then we refine them to one-to-one match by accepting only the \(p_i\) with the smallest distance measure.

$$\begin{aligned} M_{refine}=\{p_k\leftrightarrow p'\in M|k=\arg \min _{i} m(p_i,p')\} \end{aligned}$$
(4)

and the final correct matches are evaluated by the ground-truth homography:

$$\begin{aligned} correct\_match=\{p_i\leftrightarrow p_j'|D(H(p_i),p_j')<\varepsilon \} \end{aligned}$$
(5)

where \(D(H(p_i),p_j')\) is the position error after the ground-truth homography H projection for the point in image I, and in all cases, the \(\varepsilon \) is set to 3 pixels.

Following the common practice in evaluation protocols, we use the amount of correct matches as a criterion, which computes the total number of correct correspondence matches between two compared images.

5 Evaluation Results

In this section, we apply our MFF framework on the local features of SIFT, rootSIFT and SURF and present the detailed comparison performance on two benchmark datasets.

Impact of multiple assignment size: We first analyse the impact of the size of MA. Table 1 shows that the increasing size of MA marginally improves the performance of matching accuracy on both datasets. As a large value of MA size accordingly introduces noise, the value of MA size is set to 2 in the experiment.

Table 1. The average number of correct matches under different MA size settings.

Evaluation results: We first evaluate the performance of each method on the dataset proposed by Mikolajczyk and Schmid [24]. The number of correct matches and the results under perspective, scale and rotation changes are shown in Fig. 4, and they clearly illustrates the effectiveness of our proposed MFF framework. Note that the MFF-rootSIFT obtained the highest number of correct matches in all cases, and MFF-SURF also obtained better performance than original SURF method.

Fig. 4.
figure 4

Evaluation results on the viewpoint, rotation and scale changes based on the dataset provided by Mikolajczyk and Schmid [24].

Fig. 5.
figure 5

Evaluation results on the viewpoint and rotation transformation based on the dataset of Fischer et al. [13].

We then evaluate all these approaches on a large scale dataset designed by Fischer et al. [13]. We use the average score of correct matches to measure the performance, and the evaluation results under two challenging transformations of viewpoint and rotation are shown in Fig. 5. It can be observed that similar tendency are demonstrated compared to the results illustrated in Fig. 4, and this further demonstrates that the MFF can significantly increase the matching accuracy under various transformations. The evaluation results in Figs. 4 and 5 both show that the proposed MFF framework is effective and can significantly improve the accuracy of correspondences matching when combined with the traditional salient point methods.

6 Conclusions

This paper propose a novel MFF framework. It firstly computes a retina inspired patch-structure and locates the salient points in an image. Then the MFF fuses the local descriptor of each salient point and the CNN feature extracted from the patch around the salient point. The experimental results demonstrate the effectiveness of the proposed framework and it yields higher accuracy in correspondences matching under the viewpoint, scale and rotation changes.