Keywords

1 Introduction

Template matching is finding the position of a template image in a reference image, and it is one of the fundamental techniques in a broad variety of computer vision applications, such as pattern recognition [1, 2], image mosaic [3, 4]. Different from object detection, template matching does not learn the features of specific objects and the template image may contain one, two or several objects. In general, there are two major classic template matching methods [5]: feature-based methods and pixel-based methods. The key point of feature-based methods is to extract robust feature vectors, and many feature extracting algorithms are proposed, including SUSAN [6], FAST [7], SIFT [8], SURF [9] and ORB [10]. Feature-based methods are resistant to illumination change and affine transformation, but it is difficult to extract robust feature vectors when the image is heavy corrupted. Pixel-based methods make use of all pixels instead of feature vectors to find the image patches, which is the most similar to the template image. These methods always utilize different similarity measure, such as sum of squared differences, normalized cross correlation [11], increment sign correlation [12], selective correlation [13] and occlusion-free correlation [14]. Usually, pixel-based methods are superior to feature-based methods for image noise and occlusion. However, classic matching methods cannot tackle complex transformation. Recently, many improved methods are proposed to overcome real-life challenges. Dekel et al. [15] propose best-buddies similarity measure, Talmi et al. [16] introduce deformable diversity similarity measure and Kat et al. [17] introduce co-occurrence based similarity measure. Concurrently, researchers utilize deep learning network to compute the similarity of image patches. Han et al. [18] present a unified method named MatchNet, which consists of a deep convolutional network that extracts features from patches and a fully connected network that outputs a similarity between the extracted features. Zagoruyko et al. [19] also propose multiple neural network architectures to learn a general similarity function for comparing image patches.

In practical applications, the template image is inevitable to be blurred by Gaussian blur, but the above methods can not effectively tackle this case. For blurred template matching, a straightforward method is first utilizing image deblurring [20, 21] to estimate the latent template image, and then performing template matching. With the help of image deblurring, the two-stage method relieves the effect of image blurring on matching, but it maybe suffer greatly from the deficiency of image deblurring. To avoid the problem, Shao et al. [5] propose a joint image deblurring and matching method (JRM-DSR), which utilizes the sparse representation prior to exploit the correlation between deblurring and matching. The method achieves deblurring and matching simultaneous, and these two tasks benefit greatly from each other. However, when the template image becomes more and more blurred, the matching accuracy of JRM-DSR will decrease dramatically. Besides, the optimization of JRM-DSR needs to solve the sparse representation of high-dimensional pixel vector, which results in slow matching speed. Moreover, once the reference image changes or the size of the template image changes, JRM-DSR has to reconstruct the image dictionary, which shows that the robustness of the method is not good.

Fig. 1.
figure 1

The framework of cascaded network for blurred template matching. Given a blurred template image and a clear reference image, the coarse matching network search for the small image where the target matching position is located in the reference image, and then the fine matching network calculates the similarity between the blurred template image and all image patches in the small image to determine the matching position.

In this paper, we propose a blurred template matching method based on a cascaded network. Adopting the coarse to fine matching strategy, the cascade network combines a coarse matching network and a fine matching network, both of which are derived from Siamese network. The framework of our method is shown in Fig. 1. Given a blurred template image and a clear reference image, the coarse matching network searches for a small image where the target matching position is located in the reference image, and then the fine matching network calculates the similarity between the blurred template image and all image patches in the small image, thus the matching position is the corresponding position of the image patch with the highest similarity. In brief, the main contributions of this paper are as follows:

  1. (1)

    We innovatively apply deep learning to blurred template matching and propose a new method based on a cascaded network. Different from the conventional blurred template matching methods, the proposed method directly learn feature vectors and similarity measurement from training data.

  2. (2)

    Extensive experiments have been conducted. The results demonstrate the effectiveness and show the proposed method significantly outperforms the state-of-art in terms of matching accuracy, speed and robustness.

The remainder of this paper is organized as follows. Section 2 describes the proposed method and details the architecture of the cascaded network. Section 3 explains how to generate the training data and discusses the objective function. Section 4 presents the experimental results and analysis under different conditions. In the last of this paper, we conclude our work with a summary in Sect. 5.

2 Proposed Method and Network Architecture

Motivated by recent successes on learning features and similarity measure, we present a blurred template matching method based on the cascaded network, as shown in Fig. 1. The cascaded network contains a coarse matching network and a fine matching network. Given a blurred template image and a clear reference image, we first utilize the coarse matching network to search for a small image where the target matching position is located in the reference image. Afterwards, image patches of the same size as the template images are extracted in the small image, and the fine matching network calculates the similarity between the blurred template image and all image patches. Finally, we obtain the matching position in the reference image by the corresponding position of the image patch with the highest similarity. Generally speaking, the coarse matching network accelerates matching speed by reducing the search region in the reference image, and the fine matching network ensures matching accuracy by determining the matching position. Our method innovatively applies deep learning network to blurred template matching, and learns more robust feature vectors and more accurate similarity measurement, thus it greatly improves the matching accuracy. The computing time of the cascade network is short, and our method does not need any preparations for the change of template image and reference image, so the matching speed is very fast. Besides, our method is robust to the size of template image and reference image, and it directly outputs the matching position. In the next, we detail the architecture of the coarse matching network and the fine matching network.

Fig. 2.
figure 2

The architecture of coarse matching network, which combines a fully-convolutional Siamese module and a cross-correction layer. A blurred template image and a clear reference image are input, and the coarse matching network searches for a small image where the target matching position is located in the reference image.

Table 1. Layer parameters of fully-convolutional Siamese module. Layer type: C denotes convolution, MP denotes max-pooling. The output dimension of template image and reference image is \(height \times width \times channel\).

2.1 Coarse Matching Network

As for blurred template matching, we can utilize various type of Siamese network [18, 19] to compute the similarity between the blurred template image and all image patches extracted in the reference image, while the size of image patch is the same as the template image. However, when the reference image size is large, the number of image patches is too big, and the time complexity of the straightforward method is too high to meet the speed requirement.

Inspired by the application of Siamese network in object tracking [22], we proposed a coarse matching network to search for the small image where the target matching position is located in the reference image. The architecture of the coarse matching network is shown in Fig. 2. In short, the coarse matching network mainly contains a fully-convolutional Siamese module and a cross-correction layer. Given a blurred template image and a reference image, the fully-convolutional Siamese module first extracts the feature maps respectively. The Siamese module is influenced by VGGNet [23] and only has convolution layers and max-pooling layers, thus it can process images of different sizes. Assuming that the template image size is \(50 \times 50\) and the reference image size is \(350 \times 350\), the layer parameters of Siamese module are listed in Table 1. Afterwards, the cross-correction layer combines the feature maps of template image and reference image. Specifically, the cross-correction layer sets the feature maps of template image as kernels, and then performs convolution on each channel of the reference image’s feature maps, as described in Eq. 1.

$$\begin{aligned} \mathbf{Y }_i = \mathbf{B }_i * \mathbf{I }_i, i=1,2,...,128 \end{aligned}$$
(1)

where \(\mathbf B \) denotes the feature maps of template image, \(\mathbf I \) is the feature maps of reference image, and \(\mathbf Y \) is the output. Generally, the operation of the cross-correlation layer is mathematically equivalent to utilize the inner product to independently evaluate the template image and each image patch in the reference image. Finally, we obtain a heat map by the output \(\mathbf Y \) and a \(1 \times 1\) convolution layer. The value of the heat map ranges from \(-1\) to \(+1\), and the pixel value greater than 0 denotes the small image where the target matching position is located in the reference image. Besides, we pad the convolution and pooling layers, so the output height and width are the same as input, and use ReLU as non-linearity for the convolution layers.

Fig. 3.
figure 3

The architecture of fine matching network, which contains a Siamese module and a metric module. Based on the matching result of the coarse matching network, the fine matching network calculates the similarity between the blurred template image and all image patches in the small image, and then outputs the matching position in the reference image.

2.2 Fine Matching Network

Based on the small image where the target matching position is located in the reference image, we propose a fine matching network to determine the final matching position. For the small image, we first extract the image patches of the same size as the template image. Afterward, we input the template image and an image patch into the fine matching network each time, and it calculates the similarity of the image pair. Therefore, the matching position is the corresponding position of the image patch with the highest similarity in the reference image.

The architecture of the fine matching network is shown in Fig. 3, and the layer parameters are listed in Table 2. As we can see, the fine matching network contains a Siamese module and a metric module. The Siamese module consists of convolution, max-pooling, three mobilenetv2 block [24] and SPP pooling [25]. Using the mobilenetv2 block, the Siamese module reduces a lot of computation and parameters. With the help of SPP pooling, it extracts fixed dimension features for the image of different size, thus we obtain two features with 640 dimensions. In the next, we concatenate the two features and pass it through the metric module, which consists of two fully-connected layers, and the similarity of two images is output. We also utilize ReLU as nonlinear activation function and set the output size is the same as the input. Different from these works [18, 19], the proposed fine matching network calculates similarity based on the pixel deviation of two images, and the smaller the deviation, the higher the similarity. Moreover, the fine matching network is resistant to image blurring.

Table 2. Layer parameters of the fine matching network, where the size of blurred template image is \(50 \times 50\). Layer type: C denotes convolution, MP denotes max-pooling, MB denotes mobilenetv2 block, SPP denotes SPP pooling, FC denotes fully-connected.

3 Training

In this section, we first discuss the training set of the coarse and fine matching network and then describe the objective functions.

The training data of the coarse matching network is a blurred template image, a reference image and a heat map. However, there are no standard datasets to train the network. Therefore, we choose 36000 images from MIT Places2 dataset [26] as clear reference images. For each reference image, we generate its blurred image by Gaussian blur kernel, and then randomly select a small image with random size from each blurred reference image as the blurred template image. Based on the blurred template image and the clear reference image, we can construct the corresponding heat map. In the heat map, the pixel value equal to 1 indicates the position of the template image in the reference image, otherwise the pixel value equal to \(-1\). Besides, the size of the heat map changes with the reference image.

For the fine matching network, the training data is a blurred template image, a clear image with the same size of blurred template image and a label. When the two images is the same, the label equals to 1. Otherwise the label equals to 0. Given the 36000 images selected from MIT Places2 dataset, we also generate the blurred images of each reference image by Gaussian blur kernel, of which the standard deviation range from 1 to 5. Afterwards, a position in reference image (x, y) and a image size (w, h) are randomly selected. On the blurred reference image, we extract the blurred template image \({{I_1}}\) with (x, y) as the center coordinate and (w, h) as the image size. On the clear reference image, we extract a clear image \({{I_2}}\) with the same center coordinate and image size, thus \((I_1, I_2, 1)\) is a positive training data. We extract another clear image \({{I_3}}\), of which the center coordinate is near to \({{I_2}}\), and combine the \((I_1, I_3, 0)\) into a negative training data. Therefore, we can obtain a large amount of training data by adopting the above method.

Given the training set, we train the coarse matching network and fine matching network in a strongly supervised manner. For the coarse matching network, the object function hopes the heat map and the ground truth will be the same as possible, so the logistic loss is adopted.

$$\begin{aligned} {L} = \frac{1}{{P*Q}}{\sum \limits _{i = 0}^{P-1}}{\sum \limits _{j = 0}^{Q-1}}{{log(1+exp(-S(i,j)*G(i,j)))}} \end{aligned}$$
(2)

where S(i, j) means the value of predicted heat map, G(i, j) means the value of ground truth.

In the small image where the target matching position is located in the reference image, only one image patch corresponds to the target matching position. However, there are many image patches that are close to the target image patch. In order to improve the matching accuracy of the fine matching network, we adopt the focal loss function [27]

$$\begin{aligned} {L} = -y(1-p)^\gamma log(p) - (1-y)p^\gamma log(1-p) \end{aligned}$$
(3)

where y is the training label, p is the predicted similarity, \(\gamma \) is the weight and we set \(\gamma = 2\).

In the training and test stage, we use Nvidia GTX1080 in tensorflow and cuDNN library as usual. Adam with initial learning rate 0.0001 and weight decay 0.0003 is adopted to train the coarse matching network and fine matching network. Besides, the mini-batches is 8 and 128 for the two network.

4 Experiments

In this section, we first discuss the test image dataset. Besides, the experiments are carried out, and we compare our approach with other methods in terms of matching accuracy, speed and robustness.

In the experiments, the test reference images are six aerial images. We also apply Gaussian blur kernel to each reference image and randomly select 100 small images as the blurred template images. The size of reference images and template images are \(600 \times 600\) and \(50 \times 50\), respectively. Besides, the size of the Gaussian blur kernel is \((6\sigma +1) \times (6\sigma +1)\), where the \(\sigma \) denotes the standard deviation and range from 1 to 5. Therefore, the experiments are conducted with the reference image, blurred template image and the corresponding matching position. In order to quantify the matching accuracy, we adopt the Manhattan distance (MD) between the predicted matching position and the ground truth, and then obtain the percentage of test samples matched accurately under different MD.

4.1 Matching Accuracy

We apply our method to blurred template matching and examine the performance. The compared methods are as follows: (1) NCC [11] : template matching based on normalized correlation coefficient; (2) DNCC: utilize image deblurring to recover the latent template image and perform NCC; (3) JRM-DSR [5]: joint image deblurring and matching.

When the standard deviation of the Gaussian blur kernel is 3, the matching accuracy of the test dataset is listed in Table 3. The results show that the matching accuracy of DNCC is lower than NCC, which means image deblurring cannot help matching. We can observe that JRM-DSR has better performance than NCC and DNCC, because it combines image deblurring and matching. Obviously, our method has the highest matching accuracy. In details, the matching accuracy \(\mathbf P _{md\,=\,5}\) for other method are 66.67%, 65.33% and 91.00%, but our method achieves 100.00% under the same conditions. In brief, our method significantly outperforms NCC, DNCC, and JRM-DSR in terms of matching accuracy.

Table 3. The matching accuracy of different methods, where the standard deviation of Gaussian blur kernel is 3.

4.2 Matching Speed

We also examine the matching speed of the proposed method and compare it with other methods. In the experiment, we count the preparation time, computing time and total time of methods. The matching speed of different methods is listed in Table 4. We can see that it takes a lot of time for JRM-DSR to construct the image dictionary, but other methods do not require any preparations. The total time of NCC is the shortest, followed by our method, but the matching accuracy of NCC is low. Besides, the total time of DNCC and JRM-DSR is 4.52 and 1026.67 times that of our method, respectively. Therefore, while ensuring the highest matching accuracy, the matching speed of our method is very fast.

Table 4. The matching speed of different methods. The preparation time of JRM-DSR is the time spent in constructing the image dictionary. The computing time is the time from the start of matching to the end of matching, and the total time is the sum of preparation time and computing time.

4.3 Robustness Analysis

Influence of the Standard Deviation of Gaussian Blur Kernel. Image blurring has a great influence on the matching accuracy, and the greater the standard deviation of Gaussian blur kernel, the more difficult the matching is. Therefore, experiments are carried out to demonstrate that our approach is robust to the standard deviation (\(\sigma \)) of Gaussian blur kernel. In the experiments, the \(\sigma \) range from 1 to 5 and the results are shown in Table 5. When the \(\sigma \) changes from 1 to 5, the matching accuracy of NCC, DNCC and JRM-DSR decreases by 67.33%, 76.67% and 51.50%, respectively. However, the matching accuracy of our method has always been 100%. Consequently, our method is more robust to the standard deviation of the Gaussian blur kernel.

Table 5. Image matching results comparison in terms of the standard deviation (\(\sigma \)) of Gaussian blur kernel, where the matching accuracy for \(P_{md<=5}\)
Fig. 4.
figure 4

Image matching results comparison in terms of the scale variation, where the standard deviation of Gaussian blur kernel is 3. (a) the size of template image is \(40 \times 40\), (b) the size of template image is \(60 \times 60\).

Influence of Scale Variation. The experiment analyses the robustness of our method to template image size, and the matching accuracy of different methods are shown in Fig. 4. The results show that our method achieves the highest matching accuracy regardless of the size of the template image, while the matching accuracy of NCC and JRM-DSR varies greatly. Especially, when the template image size changes, JRM-DSR takes a lot of time to construct image dictionary or feature dictionary. However, our method does not have preparations and retraining, and it can directly process different reference images and template images. Based on the above experimental results and analysis, it is obvious that the robustness of our method to scale variation is better than other methods.

5 Conclusions

In this paper, we have presented a blurred template matching method based on a cascaded network. Our method utilizes a coarse matching network to search for the small image where the target matching position is located, and then use a fine matching network to determine the final exact matching position in the reference image. The experimental results and analysis demonstrate its effectiveness on blurred template matching, and our method significantly outperforms the start-of-art in terms of matching accuracy, speed and robustness.