Skip to main content
Log in

Learning bi-grained cross-correlation siamese networks for visual tracking

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Siamese network based trackers measure the similarity between a target template and a search region by computing their cross-correlation. Specifically, Siamese trackers regard the target template as a spatial filter to convolve the search region, putting emphasis on the coarse-grained semantic abstraction of the target in the spatial domain. Along with the demonstrated success of Siamese trackers, little attention has been paid to fine-grained spatial details in cross-correlation computation, which is crucial to precise target localization. In this paper, we propose to learn point-wise cross-correlation Siamese networks for visual tracking. By sketching the contour of the target, the proposed point-wise cross-correlation module helps Siamese networks to be aware of the distinctive boundary between the target and background. In conjunction with traditional depth-wise cross-correlation, the proposed Siamese network takes both advantages of coarse-grained semantic abstraction and fine-grained details to precisely locate the target. Extensive experiments demonstrate the effectiveness and efficiency of the proposed tracker, which achieves new state-of-the-art results on five visual tracking benchmarks including VOT2016, VOT2018, VOT2019, OTB100, and LaSOT with the speed of 38 FPS. As an extra benefit, our tracker can output the segmentation mask for the target. We demonstrate the favorable performance of our tracker on the video object segmentation datasets in comparison with the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Smeulders AWM, Chu DM, Cucchiara R, Calderara S, Dehghan A, Shah M (2014) Visual tracking: an experimental survey. IEEE Trans Pattern Anal Mach Intell 36(7):1442–1468

    Article  Google Scholar 

  2. Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) SiamRPN++: Evolution of siamese visual tracking with very deep networks. In: IEEE Conference on computer vision and pattern recognition, pp 4282–4291

  3. Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: a unifying approach. In: IEEE Conference on computer vision and pattern recognition, pp 1328–1338

  4. Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 6668–6677

  5. Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In: IEEE Conference on computer vision and pattern recognition, pp 6269–6277

  6. Purves D, Augustine GJ, Fitzpatrick D, Hall WC, LaMantia AS, McNamara JO, White LE (2008) Neuroscience, 4th edn. Oxford University Press

  7. Bertinetto L, Valmadre J, Henriques J, Vedaldi A, Torr PHS (2016) Fully-Convolutional Siamese networks for object tracking. In: European conference on computer vision, pp 850–865

  8. Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: IEEE International conference on computer vision, pp 1763– 1771

  9. Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional siamese network for high performance online visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4854–4863

  10. He A, Luo C, Tian X, Zeng W (2018) A twofold siamese network for real-time object tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4834–4843

  11. Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: IEEE Conference on computer vision and pattern recognition, pp 8971–8980

  12. Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, pp 770–778

  14. Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: European conference on computer vision, pp 734–750

  15. Zhou X, Zhuo J, Krahenbuhl P (2019) Bottom-up object detection by grouping extreme and center points. In: IEEE Conference on computer vision and pattern recognition, pp 850–859

  16. Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) CenterNet: Keypoint triplets for object detection. In: IEEE Conference on computer vision and pattern recognition, pp 6569–6578

  17. Tian Z, Shen C, Chen H, He T (2019) FCOS: Fully Convolutional One-Stage object detection. In: IEEE International conference on computer vision, pp 9627–9636

  18. Xie E, Sun P, Song X, Wang W, Liu X, Liang D, Shen C, Luo P (2020) Polarmask: Single Shot Instance Segmentation with Polar Representation. In: IEEE Conference on computer vision and pattern recognition, pp 12193–12202

  19. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A, Li FF (2015) Imagenet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  20. Real E, Shlens J, Mazzocchi S, Pan X, Vanhoucke V (2017) Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In: IEEE Conference on computer vision and pattern recognition, pp 5296–5305

  21. Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) Youtube-VOS: Sequence-to-sequence Video Object Segmentation. In: European conference on computer vision, pp 585–601

  22. Huang L, Zhao X, Huang K (2020) GOT-10K: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence

  23. Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) LaSOT: A high-quality benchmark for large-scale single object tracking. In: IEEE Conference on computer vision and pattern recognition, pp 5374–5383

  24. Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnickg CL (2014) Microsoft COCO: Common objects in context. In: European conference on computer vision, pp 740–755

  25. Nam H, Baek M, Han B (2016) Modeling and Propagating CNNs in a Tree Structure for Visual Tracking, [Online]. Available: arXiv:1608.07242

  26. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: European conference on computer vision, pp 472–488

  27. Danelljan M, Bhat G, Khan FS, Felsberg M (2017) ECO: Efficient Convolution operators for tracking. In: IEEE Conference on computer vision and pattern recognition, pp 6638–6646

  28. Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: European conference on computer vision, pp 101–117

  29. Sun C, Wang D, Lu H, Yang MH (2018) Correlation tracking via joint discrimination and reliability learning. In: IEEE Conference on computer vision and pattern recognition, pp 489–497

  30. Chen B, Tsotsos JK (2019) Fast visual object tracking with rotated bounding boxes, [Online]. Available: arXiv:1907.03892

  31. Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In: AAAI Conference on artificial intelligence, pp 12549–12556

  32. Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: European conference on computer vision, pp 483–498

  33. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc LC, Vojır T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth Visual Object Tracking VOT2018 challenge results. In: European conference on computer vision

  34. Xu T, Feng Z, Wu X, Kittler J (2019) Learning adaptive discriminative correlation ffilters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Trans Image Process 28(11):5596–5609

    Article  MathSciNet  Google Scholar 

  35. Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: Accurate Tracking by overlap maximization. In: IEEE Conference on computer vision and pattern recognition, pp 4460–4469

  36. Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: IEEE International conference on computer vision, pp 6182–6191

  37. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc LC, Vojır T, Bhat G, Lukezic A, Eldesokey A et al (2016) The visual object tracking vot2016 challenge results. In: European conference on computer vision

  38. Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen JK, Zajc LC, Drbohlav O, Lukezic A, Berg A et al (2019) And The Seventh Visual Object Tracking VOT2019 Challenge Results. In: IEEE International conference on computer vision workshops

  39. Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848

    Article  Google Scholar 

  40. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4293–4302

  41. Fan H, Ling H (2019) Siamese cascaded region proposal networks for real-time visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 7952–7961

  42. Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: Gradient-guided network for visual object tracking. In: IEEE International conference on computer vision workshops, pp 6162–6171

  43. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for Correlation Filter based tracking. In: IEEE Conference on computer vision and pattern recognition, pp 2805–2813

  44. Wang G, Luo C, Xiong Z, Zeng W (2019) SPM-Tracker: Series-parallel matching for real-time visual object tracking. In: IEEE Conference on computer vision and pattern recognition, pp 3643–3652

  45. Zhang Z, Peng H (2019) Deeper and wider siamese networks for real-time visual tracking. In: IEEE Conference on computer vision and pattern recognition, pp 4591–4600

  46. Perazzi F, Pont-Tuset J, McWilliams B, Gool LV, Gross M, Sorkine-Hornung A (2017) A benchmark dataset and evaluation methodology for video object segmentation. In: IEEE Conference on computer vision and pattern recognition, pp 724– 732

  47. Pont-Tuset J, Perazzi F, Caelles S, Arbelaez P, Sorkine-Hornung A, Gool LV (2017) The 2017 davis challenge on video object segmentation, [Online]. Available: arXiv:1704.00675

  48. Jampan V, Gadde R, Gehler PV (2017) Video propagation networks. In: IEEE Conference on computer vision and pattern recognition, pp 451–461

  49. Cheng J, Tsai YH, Hung WC, Wang S, Yang MH (2018) Fast and accurate online video object segmentation via tracking parts. In: IEEE Conference on computer vision and pattern recognition, pp 7415–7424

  50. Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. In: The 2017 DAVIS challenge on video object segmentation-CVPR workshops 5(6)

  51. Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Gool LV (2017) One-shot video object segmentation. In: IEEE Conference on computer vision and pattern recognition, pp 221–230

  52. Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: IEEE Conference on computer vision and pattern recognition, pp 6499–6507

  53. Dai K, Wang D, Lu H, Sun C, Li J (2019) Visual tracking via adaptive spatially-regularized correlation filters. In: IEEE Conference on computer vision and pattern recognition, pp 4670–4679

  54. Liang Y, He F, Zeng X (2020) 3D mesh simplification with feature preservation based on Whale Optimization Algorithm and Differential Evolution. Integrated Computer-Aided Engineering, pp 1–19

  55. Chen Y, He F, Li H, Zhang D, Wu Y (2020) A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Appl Soft Comput:93

  56. Quan Q, He F, Li H (2021) A multi-phase blending method with incremental intensity for training detection networks. Vis Comput 37(2):245–259

    Article  Google Scholar 

  57. Zhang S, He F (2020) DRCDN: Learning deep residual convolutional dehazing networks. Vis Comput 36(9):1797–1808

    Article  Google Scholar 

  58. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dandan Zhu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, D., Ma, C., Zhu, D. et al. Learning bi-grained cross-correlation siamese networks for visual tracking. Appl Intell 52, 12175–12190 (2022). https://doi.org/10.1007/s10489-021-03015-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-03015-9

Keywords

Navigation