Abstract
Siamese network based trackers measure the similarity between a target template and a search region by computing their cross-correlation. Specifically, Siamese trackers regard the target template as a spatial filter to convolve the search region, putting emphasis on the coarse-grained semantic abstraction of the target in the spatial domain. Along with the demonstrated success of Siamese trackers, little attention has been paid to fine-grained spatial details in cross-correlation computation, which is crucial to precise target localization. In this paper, we propose to learn point-wise cross-correlation Siamese networks for visual tracking. By sketching the contour of the target, the proposed point-wise cross-correlation module helps Siamese networks to be aware of the distinctive boundary between the target and background. In conjunction with traditional depth-wise cross-correlation, the proposed Siamese network takes both advantages of coarse-grained semantic abstraction and fine-grained details to precisely locate the target. Extensive experiments demonstrate the effectiveness and efficiency of the proposed tracker, which achieves new state-of-the-art results on five visual tracking benchmarks including VOT2016, VOT2018, VOT2019, OTB100, and LaSOT with the speed of 38 FPS. As an extra benefit, our tracker can output the segmentation mask for the target. We demonstrate the favorable performance of our tracker on the video object segmentation datasets in comparison with the state-of-the-art.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Smeulders AWM, Chu DM, Cucchiara R, Calderara S, Dehghan A, Shah M (2014) Visual tracking: an experimental survey. IEEE Trans Pattern Anal Mach Intell 36(7):1442–1468
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) SiamRPN++: Evolution of siamese visual tracking with very deep networks. In: IEEE Conference on computer vision and pattern recognition, pp 4282–4291
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: a unifying approach. In: IEEE Conference on computer vision and pattern recognition, pp 1328–1338
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 6668–6677
Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In: IEEE Conference on computer vision and pattern recognition, pp 6269–6277
Purves D, Augustine GJ, Fitzpatrick D, Hall WC, LaMantia AS, McNamara JO, White LE (2008) Neuroscience, 4th edn. Oxford University Press
Bertinetto L, Valmadre J, Henriques J, Vedaldi A, Torr PHS (2016) Fully-Convolutional Siamese networks for object tracking. In: European conference on computer vision, pp 850–865
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: IEEE International conference on computer vision, pp 1763– 1771
Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional siamese network for high performance online visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4854–4863
He A, Luo C, Tian X, Zeng W (2018) A twofold siamese network for real-time object tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4834–4843
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: IEEE Conference on computer vision and pattern recognition, pp 8971–8980
Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, pp 770–778
Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: European conference on computer vision, pp 734–750
Zhou X, Zhuo J, Krahenbuhl P (2019) Bottom-up object detection by grouping extreme and center points. In: IEEE Conference on computer vision and pattern recognition, pp 850–859
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) CenterNet: Keypoint triplets for object detection. In: IEEE Conference on computer vision and pattern recognition, pp 6569–6578
Tian Z, Shen C, Chen H, He T (2019) FCOS: Fully Convolutional One-Stage object detection. In: IEEE International conference on computer vision, pp 9627–9636
Xie E, Sun P, Song X, Wang W, Liu X, Liang D, Shen C, Luo P (2020) Polarmask: Single Shot Instance Segmentation with Polar Representation. In: IEEE Conference on computer vision and pattern recognition, pp 12193–12202
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A, Li FF (2015) Imagenet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252
Real E, Shlens J, Mazzocchi S, Pan X, Vanhoucke V (2017) Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In: IEEE Conference on computer vision and pattern recognition, pp 5296–5305
Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) Youtube-VOS: Sequence-to-sequence Video Object Segmentation. In: European conference on computer vision, pp 585–601
Huang L, Zhao X, Huang K (2020) GOT-10K: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) LaSOT: A high-quality benchmark for large-scale single object tracking. In: IEEE Conference on computer vision and pattern recognition, pp 5374–5383
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnickg CL (2014) Microsoft COCO: Common objects in context. In: European conference on computer vision, pp 740–755
Nam H, Baek M, Han B (2016) Modeling and Propagating CNNs in a Tree Structure for Visual Tracking, [Online]. Available: arXiv:1608.07242
Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: European conference on computer vision, pp 472–488
Danelljan M, Bhat G, Khan FS, Felsberg M (2017) ECO: Efficient Convolution operators for tracking. In: IEEE Conference on computer vision and pattern recognition, pp 6638–6646
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: European conference on computer vision, pp 101–117
Sun C, Wang D, Lu H, Yang MH (2018) Correlation tracking via joint discrimination and reliability learning. In: IEEE Conference on computer vision and pattern recognition, pp 489–497
Chen B, Tsotsos JK (2019) Fast visual object tracking with rotated bounding boxes, [Online]. Available: arXiv:1907.03892
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In: AAAI Conference on artificial intelligence, pp 12549–12556
Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: European conference on computer vision, pp 483–498
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc LC, Vojır T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth Visual Object Tracking VOT2018 challenge results. In: European conference on computer vision
Xu T, Feng Z, Wu X, Kittler J (2019) Learning adaptive discriminative correlation ffilters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Trans Image Process 28(11):5596–5609
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: Accurate Tracking by overlap maximization. In: IEEE Conference on computer vision and pattern recognition, pp 4460–4469
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: IEEE International conference on computer vision, pp 6182–6191
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc LC, Vojır T, Bhat G, Lukezic A, Eldesokey A et al (2016) The visual object tracking vot2016 challenge results. In: European conference on computer vision
Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen JK, Zajc LC, Drbohlav O, Lukezic A, Berg A et al (2019) And The Seventh Visual Object Tracking VOT2019 Challenge Results. In: IEEE International conference on computer vision workshops
Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4293–4302
Fan H, Ling H (2019) Siamese cascaded region proposal networks for real-time visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 7952–7961
Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: Gradient-guided network for visual object tracking. In: IEEE International conference on computer vision workshops, pp 6162–6171
Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for Correlation Filter based tracking. In: IEEE Conference on computer vision and pattern recognition, pp 2805–2813
Wang G, Luo C, Xiong Z, Zeng W (2019) SPM-Tracker: Series-parallel matching for real-time visual object tracking. In: IEEE Conference on computer vision and pattern recognition, pp 3643–3652
Zhang Z, Peng H (2019) Deeper and wider siamese networks for real-time visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4591–4600
Perazzi F, Pont-Tuset J, McWilliams B, Gool LV, Gross M, Sorkine-Hornung A (2017) A benchmark dataset and evaluation methodology for video object segmentation. In: IEEE Conference on computer vision and pattern recognition, pp 724– 732
Pont-Tuset J, Perazzi F, Caelles S, Arbelaez P, Sorkine-Hornung A, Gool LV (2017) The 2017 davis challenge on video object segmentation, [Online]. Available: arXiv:1704.00675
Jampan V, Gadde R, Gehler PV (2017) Video propagation networks. In: IEEE Conference on computer vision and pattern recognition, pp 451–461
Cheng J, Tsai YH, Hung WC, Wang S, Yang MH (2018) Fast and accurate online video object segmentation via tracking parts. In: IEEE Conference on computer vision and pattern recognition, pp 7415–7424
Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. In: The 2017 DAVIS challenge on video object segmentation-CVPR workshops 5(6)
Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Gool LV (2017) One-shot video object segmentation. In: IEEE Conference on computer vision and pattern recognition, pp 221–230
Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: IEEE Conference on computer vision and pattern recognition, pp 6499–6507
Dai K, Wang D, Lu H, Sun C, Li J (2019) Visual tracking via adaptive spatially-regularized correlation filters. In: IEEE Conference on computer vision and pattern recognition, pp 4670–4679
Liang Y, He F, Zeng X (2020) 3D mesh simplification with feature preservation based on Whale Optimization Algorithm and Differential Evolution. Integrated Computer-Aided Engineering, pp 1–19
Chen Y, He F, Li H, Zhang D, Wu Y (2020) A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Appl Soft Comput:93
Quan Q, He F, Li H (2021) A multi-phase blending method with incremental intensity for training detection networks. Vis Comput 37(2):245–259
Zhang S, He F (2020) DRCDN: Learning deep residual convolutional dehazing networks. Vis Comput 36(9):1797–1808
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, D., Ma, C., Zhu, D. et al. Learning bi-grained cross-correlation siamese networks for visual tracking. Appl Intell 52, 12175–12190 (2022). https://doi.org/10.1007/s10489-021-03015-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03015-9