Abstract
Template matching by normalized cross correlation (NCC) is widely used for finding image correspondences. NCCNet improves the robustness of this algorithm by transforming image features with siamese convolutional nets trained to maximize the contrast between NCC values of true and false matches. The main technical contribution is a weakly supervised learning algorithm for the training. Unlike fully supervised approaches to metric learning, the method can improve upon vanilla NCC without receiving locations of true matches during training. The improvement is quantified through patches of brain images from serial section electron microscopy. Relative to a parameter-tuned bandpass filter, siamese convolutional nets significantly reduce false matches. The improved accuracy of the method could be essential for connectomics, because emerging petascale datasets may require billions of template matches during assembly. Our method is also expected to generalize to other computer vision applications that use template matching to find image correspondences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Computer vision–ECCV 2006, pp. 404–417 (2006)
Berg, A.C., Malik, J.: Geometric blur for template matching. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I. IEEE (2001)
Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. IJPRAI 7(4), 669–688 (1993)
Brown, M., Hua, G., Winder, S.: Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 43–57 (2011)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286 (2015)
Hegde, V., Zadeh, R.: FusionNet: 3D object classification using multiple data representations. arXiv preprint arXiv:1607.05695 (2016)
Heo, Y.S., Lee, K.M., Lee, S.U.: Robust stereo matching using adaptive normalized cross-correlation. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 807–822 (2011)
Kulis, B., et al.: Metric learning: a survey. Found. Trends® Mach. Learn. 5(4), 287–364 (2013)
Kumar, B.G., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep Siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394 (2016)
Lenc, K., Vedaldi, A.: Learning covariant feature detectors. In: Computer Vision–ECCV 2016 Workshops, pp. 100–117. Springer, Heidelberg (2016)
Lewis, J.P.: Fast template matching. In: Vision Interface, vol. 95, pp. 15–19 (1995)
Lichtman, J.W., Pfister, H., Shavit, N.: The big data challenges of connectomics. Nat. Neurosci. 17(11), 1448–1454 (2014)
Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2011)
Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Luo, J., Konofagou, E.E.: A fast normalized cross-correlation calculation method for motion estimation. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6), 1347–1357 (2010)
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. arXiv preprint arXiv:1612.06370 (2016)
Preibisch, S., Saalfeld, S., Rohlfing, T., Tomancak, P.: Bead-based mosaicing of single plane illumination microscopy images using geometric local descriptor matching. In: SPIE Medical Imaging, p. 72592S. International Society for Optics and Photonics (2009)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer, Heidelberg (2015)
Saalfeld, S., Fetter, R., Cardona, A., Tomancak, P.: Elastic volume reconstruction from series of ultra-thin microscopy sections. Nat. Methods 9(7), 717–720 (2012)
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126 (2015)
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014)
Subramaniam, A., Chatterjee, M., Mittal, A.: Deep neural networks with inexact matching for person re-identification. In: Advances in Neural Information Processing Systems, pp. 2667–2675 (2016)
Tulyakov, S., Ivanov, A., Fleuret, F.: Weakly supervised learning of deep metrics for stereo reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1339–1348 (2017)
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5000–5008. IEEE (2017)
Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Michigan State Univ. 2(2), 4 (2006)
Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: learned invariant feature transform. In: European Conference on Computer Vision, pp. 467–483. Springer, Heidelberg (2016)
Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2015)
Zheng, Z., Lauritzen, J.S., Perlman, E., Robinson, C.G., Nichols, M., Milkie, D., Torrens, O., Price, J., Fisher, C.B., Sharifi, N., Calle-Schuler, S.A., Kmecova, L., Ali, I.J., Karsh, B., Trautman, E.T., Bogovic, J., Hanslovsky, P., Jefferis, G.S.X.E., Kazhdan, M., Khairy, K., Saalfeld, S., Fetter, R.D., Bock, D.D.: A complete electron microscopy volume of the brain of adult drosophila melanogaster. bioRxiv (2017)
Acknowledgment
This work has been supported by AWS Machine Learning Award and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Training Set
Our training set consists of 10,000 source-template image pairs from consecutive slices in an affine-aligned stack of images that contained non-affine deformations. We include only source-template pairs such that \(r_{delta}<0.2\) for simple NCC. This rejects over 90% of pairs and increases the fraction of difficult examples, similar to hard example mining in the data collection process. During the training we alternate between a batch of eight source-template pairs and then the same batch with randomly permuted source-template pairings. We randomly cropped the source and template images such that the position of the peak is randomly distributed. We used random rotations/flipping of both channel inputs by \(90^{\circ }, 180^{\circ }, 270^{\circ }\).
1.2 A.2 Ground Truth Annotation
For training data, source image is \(512 \times 512\)px and each template is \(224 \times 224\)px. These pairs are sampled from 1040 superresolution slices, each 33,000 \(\times \) 33,000px. We validated our model on 95 serial images from the training set an unpublished EM dataset with a resolution of \(7 \times 7 \times 40\,\mathrm{nm}^3\). Each image was 15,000 \(\times \) 15,000px, and had been roughly aligned with an affine model but still contained considerable non-affine distortions up to 250px (full resolution).
(Left) matches on the raw images; (Middle) matches on images filtered with a Gaussian bandpass filter; (Right) NCCNet: matches on the output of the convolutional network processed image. Displacement vector fields are a representation of output of template matching in a regular triangular grid (edge length 400px at full resolution) at each slice. Each node represents the centerpoint of a template image used in the template matching procedure. Each vector represents the displacement of that template image to its matching location in its source image. Matches shown are based on 224px template size on across (next-nearest neighbor) sections. Additionally see Figs. 10, 11 and 12
Manual inspection difficulties. (a) The vector field around a match (circled in red) that prominently differs from its neighbors. (b) The template for the match, showing many neurites parallel to the sectioning plane. (c) The false color overlay of the template (green) over the source image (red) at the matched location, establishing the match as true.
In each experiment, both the template and the source images were downsampled by a factor of 3 before NCC, so that 160px and 224px templates were 480px and 672px at full resolution, while the source image was fixed at 512px downsampled (1,536px full resolution). The template matches were taken in a triangular grid covering the image, with an edge length of 400px at full resolution (Fig. 8 shows the locations of template matches across an image).
Our first method to evaluate performance was to compare error rates. Errors were detected manually with a tool that allowed human annotators to inspect the template matching inputs and outputs. The tool is based on the visualization of the displacement vectors that result from each template match across a section, as shown in Fig. 8. Any match that significantly differed (over 50px) from its neighbors were rejected, and matches that differed from neighbors but not significantly were individually inspected for correctness by visualizing a false color overlay of the template over the source at the match location. The latter step was needed as there were many true matches that deviated prominently from its neighbors: the template patch could contain neurites or other features parallel to the sectioning plane, resulting in large motions of specific features in a random direction that may not be consistent with the movement of the larger area around the template (see Fig. 9 for an example of this behavior). Table 1 summarizes the error counts in each experiment.
1.3 A.3 Match Filtering
To assess how easily false matches could be removed, we evaluated matches with the following criteria:
-
norm: The Euclidean norm of the displacement required to move the template image to its match location in the source image at full resolution.
-
r max: The first peak of the correlogram serves as a proxy for confidence in the match.
-
r delta: The difference between the first peak and second peak (after removing a 5px square window surrounding the first peak) of the correlogram provides some estimate of the certainty that there is no other likely match in the source image, and the criteria the NCCNet was trained to optimize.
These criteria can serve as useful heuristics to accept or reject matches. It approximate the unknown partitions for the true and erroneous correspondences. The less overlap between the actual distributions when projected onto the criterion dimension, the more useful that criterion. Figure 3 plots these three criteria across the three image conditions. The improvement in rejection efficiency also generalized well across experiments, as evident in the Appendix, Sup. Fig. 15. Achieving a 0.1% error rate on the most difficult task we tested (across, 160px template size) required rejecting 20% of the true matches on bandpass, while less than 1% rejection of true matches was sufficient with the convolutional networks.
False matches can be rejected efficiently with the match criteria. The NCCNet transformed the true match distributions for r max and r delta to be more left-skewed, while the erroneous match distribution for r delta remains with lower values (see Fig. 3a), resulting in a distribution more amenable to accurate error rejection. For the case of adjacent sections with 224px templates, we can remove every error in our NCCNet output by rejecting matches with an r delta below 0.05, which removes only 0.12% of the true matches. The same threshold also removes all false matches in the bandpass outputs, but removes 0.40% of the true matches (see Fig. 3b). This 3.5x improvement in rejection efficiency is critical to balancing the trade-off between complete elimination of false matches and retaining as many true matches as possible.
The NCCNet produced matches in the vast majority of cases where the bandpass produced matches. It introduce some false matches that the bandpass did not, but it correctly identified 3–20 times as many additional true matches relatively (see Table 3).
The majority of the false matches in the convnet output were also present in the bandpass case, which establishes the NCCNet as superior to and not merely different from bandpass.
Defining whether a patch pair is a true match depends on the tolerance for spatial localization error. Our method makes the tolerance explicit through the definition of the secondary peak. Explicitness is required by our weakly supervised approach and is preferable because blindly accepting a hidden tolerance parameter in the training set could produce suboptimal results in some applications (Table 2).
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Buniatyan, D., Popovych, S., Ih, D., Macrina, T., Zung, J., Seung, H.S. (2020). Weakly Supervised Deep Metric Learning for Template Matching. In: Arai, K., Kapoor, S. (eds) Advances in Computer Vision. CVC 2019. Advances in Intelligent Systems and Computing, vol 943. Springer, Cham. https://doi.org/10.1007/978-3-030-17795-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-17795-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17794-2
Online ISBN: 978-3-030-17795-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)