Abstract
Template matching is widely used in computer vision applications, but most matching methods simply assume the ideal images without real-word degradations, such as Gaussian blur. Traditional methods for blurred template matching either first resort to image deblurring and then perform template matching with the recovered image, or joint solve image deblurring and matching based on sparse expression prior. However, these methods always perform poor and the matching speed is slow. In this paper, we propose a blurred template matching method based on a cascaded network, which combines a coarse matching network and a fine matching network. The coarse matching network searches for a small image where the target matching position is located in the reference image, and then the fine matching network calculates the similarity between the blurred template image and all image patches in the small image, thus the matching position is the corresponding position of the image patch with the highest similarity. Extensive experiments demonstrate that our method significantly outperforms the state-of-art on the accuracy, speed and robustness.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Template matching is finding the position of a template image in a reference image, and it is one of the fundamental techniques in a broad variety of computer vision applications, such as pattern recognition [1, 2], image mosaic [3, 4]. Different from object detection, template matching does not learn the features of specific objects and the template image may contain one, two or several objects. In general, there are two major classic template matching methods [5]: feature-based methods and pixel-based methods. The key point of feature-based methods is to extract robust feature vectors, and many feature extracting algorithms are proposed, including SUSAN [6], FAST [7], SIFT [8], SURF [9] and ORB [10]. Feature-based methods are resistant to illumination change and affine transformation, but it is difficult to extract robust feature vectors when the image is heavy corrupted. Pixel-based methods make use of all pixels instead of feature vectors to find the image patches, which is the most similar to the template image. These methods always utilize different similarity measure, such as sum of squared differences, normalized cross correlation [11], increment sign correlation [12], selective correlation [13] and occlusion-free correlation [14]. Usually, pixel-based methods are superior to feature-based methods for image noise and occlusion. However, classic matching methods cannot tackle complex transformation. Recently, many improved methods are proposed to overcome real-life challenges. Dekel et al. [15] propose best-buddies similarity measure, Talmi et al. [16] introduce deformable diversity similarity measure and Kat et al. [17] introduce co-occurrence based similarity measure. Concurrently, researchers utilize deep learning network to compute the similarity of image patches. Han et al. [18] present a unified method named MatchNet, which consists of a deep convolutional network that extracts features from patches and a fully connected network that outputs a similarity between the extracted features. Zagoruyko et al. [19] also propose multiple neural network architectures to learn a general similarity function for comparing image patches.
In practical applications, the template image is inevitable to be blurred by Gaussian blur, but the above methods can not effectively tackle this case. For blurred template matching, a straightforward method is first utilizing image deblurring [20, 21] to estimate the latent template image, and then performing template matching. With the help of image deblurring, the two-stage method relieves the effect of image blurring on matching, but it maybe suffer greatly from the deficiency of image deblurring. To avoid the problem, Shao et al. [5] propose a joint image deblurring and matching method (JRM-DSR), which utilizes the sparse representation prior to exploit the correlation between deblurring and matching. The method achieves deblurring and matching simultaneous, and these two tasks benefit greatly from each other. However, when the template image becomes more and more blurred, the matching accuracy of JRM-DSR will decrease dramatically. Besides, the optimization of JRM-DSR needs to solve the sparse representation of high-dimensional pixel vector, which results in slow matching speed. Moreover, once the reference image changes or the size of the template image changes, JRM-DSR has to reconstruct the image dictionary, which shows that the robustness of the method is not good.
In this paper, we propose a blurred template matching method based on a cascaded network. Adopting the coarse to fine matching strategy, the cascade network combines a coarse matching network and a fine matching network, both of which are derived from Siamese network. The framework of our method is shown in Fig. 1. Given a blurred template image and a clear reference image, the coarse matching network searches for a small image where the target matching position is located in the reference image, and then the fine matching network calculates the similarity between the blurred template image and all image patches in the small image, thus the matching position is the corresponding position of the image patch with the highest similarity. In brief, the main contributions of this paper are as follows:
-
(1)
We innovatively apply deep learning to blurred template matching and propose a new method based on a cascaded network. Different from the conventional blurred template matching methods, the proposed method directly learn feature vectors and similarity measurement from training data.
-
(2)
Extensive experiments have been conducted. The results demonstrate the effectiveness and show the proposed method significantly outperforms the state-of-art in terms of matching accuracy, speed and robustness.
The remainder of this paper is organized as follows. Section 2 describes the proposed method and details the architecture of the cascaded network. Section 3 explains how to generate the training data and discusses the objective function. Section 4 presents the experimental results and analysis under different conditions. In the last of this paper, we conclude our work with a summary in Sect. 5.
2 Proposed Method and Network Architecture
Motivated by recent successes on learning features and similarity measure, we present a blurred template matching method based on the cascaded network, as shown in Fig. 1. The cascaded network contains a coarse matching network and a fine matching network. Given a blurred template image and a clear reference image, we first utilize the coarse matching network to search for a small image where the target matching position is located in the reference image. Afterwards, image patches of the same size as the template images are extracted in the small image, and the fine matching network calculates the similarity between the blurred template image and all image patches. Finally, we obtain the matching position in the reference image by the corresponding position of the image patch with the highest similarity. Generally speaking, the coarse matching network accelerates matching speed by reducing the search region in the reference image, and the fine matching network ensures matching accuracy by determining the matching position. Our method innovatively applies deep learning network to blurred template matching, and learns more robust feature vectors and more accurate similarity measurement, thus it greatly improves the matching accuracy. The computing time of the cascade network is short, and our method does not need any preparations for the change of template image and reference image, so the matching speed is very fast. Besides, our method is robust to the size of template image and reference image, and it directly outputs the matching position. In the next, we detail the architecture of the coarse matching network and the fine matching network.
2.1 Coarse Matching Network
As for blurred template matching, we can utilize various type of Siamese network [18, 19] to compute the similarity between the blurred template image and all image patches extracted in the reference image, while the size of image patch is the same as the template image. However, when the reference image size is large, the number of image patches is too big, and the time complexity of the straightforward method is too high to meet the speed requirement.
Inspired by the application of Siamese network in object tracking [22], we proposed a coarse matching network to search for the small image where the target matching position is located in the reference image. The architecture of the coarse matching network is shown in Fig. 2. In short, the coarse matching network mainly contains a fully-convolutional Siamese module and a cross-correction layer. Given a blurred template image and a reference image, the fully-convolutional Siamese module first extracts the feature maps respectively. The Siamese module is influenced by VGGNet [23] and only has convolution layers and max-pooling layers, thus it can process images of different sizes. Assuming that the template image size is \(50 \times 50\) and the reference image size is \(350 \times 350\), the layer parameters of Siamese module are listed in Table 1. Afterwards, the cross-correction layer combines the feature maps of template image and reference image. Specifically, the cross-correction layer sets the feature maps of template image as kernels, and then performs convolution on each channel of the reference image’s feature maps, as described in Eq. 1.
where \(\mathbf B \) denotes the feature maps of template image, \(\mathbf I \) is the feature maps of reference image, and \(\mathbf Y \) is the output. Generally, the operation of the cross-correlation layer is mathematically equivalent to utilize the inner product to independently evaluate the template image and each image patch in the reference image. Finally, we obtain a heat map by the output \(\mathbf Y \) and a \(1 \times 1\) convolution layer. The value of the heat map ranges from \(-1\) to \(+1\), and the pixel value greater than 0 denotes the small image where the target matching position is located in the reference image. Besides, we pad the convolution and pooling layers, so the output height and width are the same as input, and use ReLU as non-linearity for the convolution layers.
2.2 Fine Matching Network
Based on the small image where the target matching position is located in the reference image, we propose a fine matching network to determine the final matching position. For the small image, we first extract the image patches of the same size as the template image. Afterward, we input the template image and an image patch into the fine matching network each time, and it calculates the similarity of the image pair. Therefore, the matching position is the corresponding position of the image patch with the highest similarity in the reference image.
The architecture of the fine matching network is shown in Fig. 3, and the layer parameters are listed in Table 2. As we can see, the fine matching network contains a Siamese module and a metric module. The Siamese module consists of convolution, max-pooling, three mobilenetv2 block [24] and SPP pooling [25]. Using the mobilenetv2 block, the Siamese module reduces a lot of computation and parameters. With the help of SPP pooling, it extracts fixed dimension features for the image of different size, thus we obtain two features with 640 dimensions. In the next, we concatenate the two features and pass it through the metric module, which consists of two fully-connected layers, and the similarity of two images is output. We also utilize ReLU as nonlinear activation function and set the output size is the same as the input. Different from these works [18, 19], the proposed fine matching network calculates similarity based on the pixel deviation of two images, and the smaller the deviation, the higher the similarity. Moreover, the fine matching network is resistant to image blurring.
3 Training
In this section, we first discuss the training set of the coarse and fine matching network and then describe the objective functions.
The training data of the coarse matching network is a blurred template image, a reference image and a heat map. However, there are no standard datasets to train the network. Therefore, we choose 36000 images from MIT Places2 dataset [26] as clear reference images. For each reference image, we generate its blurred image by Gaussian blur kernel, and then randomly select a small image with random size from each blurred reference image as the blurred template image. Based on the blurred template image and the clear reference image, we can construct the corresponding heat map. In the heat map, the pixel value equal to 1 indicates the position of the template image in the reference image, otherwise the pixel value equal to \(-1\). Besides, the size of the heat map changes with the reference image.
For the fine matching network, the training data is a blurred template image, a clear image with the same size of blurred template image and a label. When the two images is the same, the label equals to 1. Otherwise the label equals to 0. Given the 36000 images selected from MIT Places2 dataset, we also generate the blurred images of each reference image by Gaussian blur kernel, of which the standard deviation range from 1 to 5. Afterwards, a position in reference image (x, y) and a image size (w, h) are randomly selected. On the blurred reference image, we extract the blurred template image \({{I_1}}\) with (x, y) as the center coordinate and (w, h) as the image size. On the clear reference image, we extract a clear image \({{I_2}}\) with the same center coordinate and image size, thus \((I_1, I_2, 1)\) is a positive training data. We extract another clear image \({{I_3}}\), of which the center coordinate is near to \({{I_2}}\), and combine the \((I_1, I_3, 0)\) into a negative training data. Therefore, we can obtain a large amount of training data by adopting the above method.
Given the training set, we train the coarse matching network and fine matching network in a strongly supervised manner. For the coarse matching network, the object function hopes the heat map and the ground truth will be the same as possible, so the logistic loss is adopted.
where S(i, j) means the value of predicted heat map, G(i, j) means the value of ground truth.
In the small image where the target matching position is located in the reference image, only one image patch corresponds to the target matching position. However, there are many image patches that are close to the target image patch. In order to improve the matching accuracy of the fine matching network, we adopt the focal loss function [27]
where y is the training label, p is the predicted similarity, \(\gamma \) is the weight and we set \(\gamma = 2\).
In the training and test stage, we use Nvidia GTX1080 in tensorflow and cuDNN library as usual. Adam with initial learning rate 0.0001 and weight decay 0.0003 is adopted to train the coarse matching network and fine matching network. Besides, the mini-batches is 8 and 128 for the two network.
4 Experiments
In this section, we first discuss the test image dataset. Besides, the experiments are carried out, and we compare our approach with other methods in terms of matching accuracy, speed and robustness.
In the experiments, the test reference images are six aerial images. We also apply Gaussian blur kernel to each reference image and randomly select 100 small images as the blurred template images. The size of reference images and template images are \(600 \times 600\) and \(50 \times 50\), respectively. Besides, the size of the Gaussian blur kernel is \((6\sigma +1) \times (6\sigma +1)\), where the \(\sigma \) denotes the standard deviation and range from 1 to 5. Therefore, the experiments are conducted with the reference image, blurred template image and the corresponding matching position. In order to quantify the matching accuracy, we adopt the Manhattan distance (MD) between the predicted matching position and the ground truth, and then obtain the percentage of test samples matched accurately under different MD.
4.1 Matching Accuracy
We apply our method to blurred template matching and examine the performance. The compared methods are as follows: (1) NCC [11] : template matching based on normalized correlation coefficient; (2) DNCC: utilize image deblurring to recover the latent template image and perform NCC; (3) JRM-DSR [5]: joint image deblurring and matching.
When the standard deviation of the Gaussian blur kernel is 3, the matching accuracy of the test dataset is listed in Table 3. The results show that the matching accuracy of DNCC is lower than NCC, which means image deblurring cannot help matching. We can observe that JRM-DSR has better performance than NCC and DNCC, because it combines image deblurring and matching. Obviously, our method has the highest matching accuracy. In details, the matching accuracy \(\mathbf P _{md\,=\,5}\) for other method are 66.67%, 65.33% and 91.00%, but our method achieves 100.00% under the same conditions. In brief, our method significantly outperforms NCC, DNCC, and JRM-DSR in terms of matching accuracy.
4.2 Matching Speed
We also examine the matching speed of the proposed method and compare it with other methods. In the experiment, we count the preparation time, computing time and total time of methods. The matching speed of different methods is listed in Table 4. We can see that it takes a lot of time for JRM-DSR to construct the image dictionary, but other methods do not require any preparations. The total time of NCC is the shortest, followed by our method, but the matching accuracy of NCC is low. Besides, the total time of DNCC and JRM-DSR is 4.52 and 1026.67 times that of our method, respectively. Therefore, while ensuring the highest matching accuracy, the matching speed of our method is very fast.
4.3 Robustness Analysis
Influence of the Standard Deviation of Gaussian Blur Kernel. Image blurring has a great influence on the matching accuracy, and the greater the standard deviation of Gaussian blur kernel, the more difficult the matching is. Therefore, experiments are carried out to demonstrate that our approach is robust to the standard deviation (\(\sigma \)) of Gaussian blur kernel. In the experiments, the \(\sigma \) range from 1 to 5 and the results are shown in Table 5. When the \(\sigma \) changes from 1 to 5, the matching accuracy of NCC, DNCC and JRM-DSR decreases by 67.33%, 76.67% and 51.50%, respectively. However, the matching accuracy of our method has always been 100%. Consequently, our method is more robust to the standard deviation of the Gaussian blur kernel.
Influence of Scale Variation. The experiment analyses the robustness of our method to template image size, and the matching accuracy of different methods are shown in Fig. 4. The results show that our method achieves the highest matching accuracy regardless of the size of the template image, while the matching accuracy of NCC and JRM-DSR varies greatly. Especially, when the template image size changes, JRM-DSR takes a lot of time to construct image dictionary or feature dictionary. However, our method does not have preparations and retraining, and it can directly process different reference images and template images. Based on the above experimental results and analysis, it is obvious that the robustness of our method to scale variation is better than other methods.
5 Conclusions
In this paper, we have presented a blurred template matching method based on a cascaded network. Our method utilizes a coarse matching network to search for the small image where the target matching position is located, and then use a fine matching network to determine the final exact matching position in the reference image. The experimental results and analysis demonstrate its effectiveness on blurred template matching, and our method significantly outperforms the start-of-art in terms of matching accuracy, speed and robustness.
References
Ryan, M., Hanafiah, N.: An examination of character recognition on ID card using template matching approach. Proc. Comput. Sci. 59, 520–529 (2015)
Boia, R., Florea, C., Florea, L., et al.: Logo localization and recognition in natural images using homographic class graphs. Mach. Vis. Appl. 27(2), 287–301 (2016)
Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis. 74(1), 59–73 (2007)
Szeliski, R.: Image alignment and stitching: a tutorial. Found. Trends®Comput. Graph. Vis. 2(1), 1–104 (2007)
Shao, Y., Sang, N., Gao, C., et al.: Joint image restoration and matching based on distance-weighted sparse representation. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2498–2503. IEEE (2018)
Smith, S.M., Brady, J.M.: SUSAN—a new approach to low level image processing. Int. J. Comput. Vis. 23(1), 45–78 (1997)
Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 430–443. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_34
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Rublee, E., Rabaud, V., Konolige, K., et al.: ORB: an efficient alternative to SIFT or SURF. ICCV. 11(1), 2 (2011)
Aggarwal, J.K., Davis, L.S., Martin, W.N.: Correspondence processes in dynamic scene analysis. Proc. IEEE 69(5), 562–572 (1981)
Kaneko, S., Murase, I., Igarashi, S.: Robust image registration by increment sign correlation. Pattern Recognit. 35(10), 2223–2234 (2002)
Kaneko, S., Satoh, Y., Igarashi, S.: Using selective correlation coefficient for robust image registration. Pattern Recognit. 36(5), 1165–1173 (2003)
Yoo, J.C., Ahn, C.W.: Image matching using peak signal-to-noise ratio-based occlusion detection. IET Image Proc. 6(5), 483–495 (2012)
Dekel, T., Oron, S., Rubinstein, M., et al.: Best-buddies similarity for robust template matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2029 (2015)
Talmi, I., Mechrez, R., Zelnik-Manor, L.: Template matching with deformable diversity similarity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 175–183 (2017)
Kat, R., Jevnisek, R., Avidan, S.: Matching pixels using co-occurrence statistics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1751–1759 (2018)
Han, X., et al.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Xu, L., Zheng, S., Jia, J.: Unnatural L0 sparse representation for natural image deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1107–1114 (2013)
Pan, J., Sun, D., Pfister, H., et al.: Blind image deblurring using dark channel prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1628–1636 (2016)
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sandler, M., Howard, A., Zhu, M., et al.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
He, K., Zhang, X., Ren, S., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Zhou, B., Lapedriza, A., Khosla, A., et al.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2018)
Lin, T.Y., Goyal, P., Girshick, R., et al.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Peng, J., Sang, N., Gao, C., Li, L. (2019). Blurred Template Matching Based on Cascaded Network. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-34120-6_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)