Skip to main content
Log in

Siamese visual tracking based on criss-cross attention and improved head network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The efficient Siamese anchor-free tracker has fewer parameters, but it produces a large number of low-quality bounding boxes which are located far away from the center of the object. Moreover, a plenty of background information or distractors also interfere with the tracking process, resulting in the inaccurate results of classification and regression. As such, we propose a novel Siamese anchor-free network based on criss-cross attention and an improved head network. We apply ResNet-50 to extract the features of the template image and search region, then feed the feature maps into a recurrent criss-cross attention module to make it more discriminative. The enhanced feature maps are inputted into our improved head network, which include the center-ness branch based on the original classification and regression branches to filter out low-quality bounding boxes. Our proposed tracker reduces the impact of background information or distractors and can obtain high-quality bounding boxes, generating more accurate and robust tracking results. Extensive experiments and comparisons with state-of-the-art trackers are conducted on many challenging benchmarks such as VOT2016, VOT2018, GOT-10k, UAV123 and OTB2015. Our tracker achieves excellent performance with a considerable real-time speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

The VOT2016 and VOT2018 datasets analyzed during the current study are available in https://www.votchallenge.net/; The UAV123 dataset is available in https://cemse.kaust.edu.sa/ivul/uav123; The GOT-10k dataset is available in http://got-10k.aitestunion.com/; The OTB2015 dataset analyzed during the current study is available in http://cvlab.hanyang.ac.kr/tracker_benchmark/.

References

  1. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional Siamese networks for object tracking. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 850–865

    Google Scholar 

  2. Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 483–498

  3. Chen ZD, Zhong BN, Li GR, Zhang SP, Ji RR (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 6668–6677. https://doi.org/10.48550/arXiv.2003.06761

  4. Chen X, Yan B, Zhu JW, Wang D, Yang XY, Lu HC (2021) Transformer tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 8126–8135. https://doi.org/10.48550/arXiv.2103.15436

  5. Dai K, Wang D, Lu H, Sun C, Li J (2019) Visual tracking via adaptive spatially-regularized correlation filters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, pp 4670–4679

  6. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 472–488

  7. Danelljan M, Bhat G, Khan FS, Felsberg M (2017) ECO: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, HI, pp 6638–6646

  8. Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: accurate tracking by overlap maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, pp 4660–4669

  9. De Boer PT, Kroese DP, Mannor S, Rubinstein RY (2005) A tutorial on the cross-entropy method. Ann Oper Res 134(1):19–67

    Article  MathSciNet  Google Scholar 

  10. Fu J, Liu J, Tian HJ, Li Y, Bao YJ, Fang ZW, Lu HQ (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, pp 3141–3149

  11. Guo DY, Wang J, Cui Y, Wang ZH, Chen SY (2020) SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 6269–6277. https://doi.org/10.48550/arXiv.1911.07241

  12. He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, NV, pp 770–778

  13. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Nashville, TN, pp 13713–13722

  14. Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, pp 7132–7141

  15. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) CCNet: Criss-Cross attention for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recongnition. IEEE, Seoul, South Korea, pp 603–612

  16. Huang LH, Zhao X, Huang KQ (2021) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577

    Article  Google Scholar 

  17. Kristan M, Leonardis A, Matas J, Felsberg M, Pfugfelder R, Zajc LC, Vojir T, Bhat G, Lukezic A, Eldesokey A, Fernandez G (2016) The visual object tracking VOT2016 challenge results. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 777–823

  18. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc LC, Vojir T, Bhat G, Lukezic A, Eldesokey A (2018) The sixth visual object tracking VOT2018 challenge results. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 3–53

  19. Law H, Deng J (2018) CornerNet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision. ECCV, pp 734–750. https://doi.org/10.48550/arXiv.1808.01244

  20. Li B, Yan JJ, Wu W, Zhu Z, Hu XL (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, pp 8971–8980

  21. Li F, Tian C, Zuo W, Zhang L, Yang M (2018) Learning spatial-temporal regularized correlation filters for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, pp 4904–4913

  22. Li B, Wu W, Wang Q, Zhang FY, Xing JL, Yan JJ (2019) SiamRPN++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, pp 4282–4291

  23. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollàr P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 740–755

  24. Liu P, Yu H, Cang S (2019) Adaptive neural network tracking control for underactuated systems with matched and mismatched disturbances. Nonlin Dyn 98:1447–1464

    Article  Google Scholar 

  25. Luca B, Jack V, Stuart G, Ondrej M, Torr PHS (2016) Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 1401–1409. https://doi.org/10.48550/arXiv.1512.01355

  26. Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: Proceedings of the European Conference on Computer Vision. Springer, Cham, pp 445–461

  27. Real E, Shlens J, Mazzocchi S, Pan X, Vanhoucke V (2017) YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 5296–5305. https://doi.org/10.48550/arXiv.1702.00824

  28. Ren SQ, He KM, Girshick R, Sun J (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in neural information processing systems. NIPS, pp 91–99. https://doi.org/10.48550/arXiv.1506.01497

  29. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  30. Shen H, Lin D, Song T (2022) A real-time siamese tracker deployed on UAVs. J Real-Time Image Proc 19:463–473

    Article  Google Scholar 

  31. Sosnovik I, Moskalev A, Smeulders AWM (2021) Scale equivariance improves siamese tracking. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, pp 2765–2774. https://doi.org/10.48550/arXiv.2007.09115

  32. Sun L, Zhao C, Yan Z, Liu P, Duckett T, Stolkin R (2019) A novel weakly-supervised approach for RGB-D-based nuclear waste object detection. IEEE Sensors J 19(9):3487–3500

    Article  Google Scholar 

  33. Tang F, Ling Q (2022) Ranking-based siamese visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 8741–8750. https://doi.org/10.48550/arXiv.2205.11761

  34. Tian Z, Shen CH, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE International Conference on Computer Vision. CVPR, pp 9627–9636. https://doi.org/10.48550/arXiv.1904.01355

  35. Voigtlaender P, Luiten J, Torr PHS, Leibe B (2020) Siam r-cnn: visual tracking by re-detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 6578–6588. https://doi.org/10.48550/arXiv.1911.12836

  36. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 7794–7803. https://doi.org/10.48550/arXiv.1711.07971

  37. Wang N, Zhou W, Tian Q, Hong R, Wang M, Li H (2018) Multi-cue correlation filters for robust visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, pp 4844–4853

  38. Wang Q, Zhang L, Bertinetto L, Hu W, Torr P (2019) Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, pp 1328–1338

  39. Woo S, Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision. ECCV, pp 3–19. https://doi.org/10.48550/arXiv.1807.06521

  40. Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848

    Article  Google Scholar 

  41. Xing D, Evangeliou N, Tsoukalas A (2022) Siamese transformer pyramid networks for real-time UAV tracking. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision. CACV, pp 2139–2148. https://doi.org/10.48550/arXiv.2110.08822

  42. Xu TY, Feng ZH, Wu XJ, Kittler J (2019) Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Trans Image Process 28(11):5596–5609

    Article  MathSciNet  Google Scholar 

  43. Xu YD, Wang ZY, Li ZX, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, pp 12549–12556. https://doi.org/10.48550/arXiv.1911.06188

  44. Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. In: Proceedings of the ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, pp 516–520

  45. Yu F, Zhang ZN, Shen H (2022) FPGA implementation and image encryption application of a new PRNG based on a memristive Hopfield neural network with a special activation gradient. Chin Phys B 31(2):020505

    Article  Google Scholar 

  46. Zhang Z, Peng H (2019) Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 4586–4595. https://doi.org/10.48550/arXiv.1901.01660

  47. Zhang Z, Zhang Y, Cheng X (2020) Siamese network for real-time tracking with action-selection. J Real-Time Image Proc 17:1647–1657

    Article  Google Scholar 

  48. Zhang JM, Jin XK, Sun J, Wang J, Sangaiah AK (2020) Spatial and semantic convolutional features for robust visual object tracking. Multimed Tools Appl 79(21–22):15095–15115

    Article  Google Scholar 

  49. Zhang JM, Sun J, Wang J, Yue X-G (2021) Visual object tracking based on residual network and cascaded correlation filters. J Ambient Intell Humaniz Comput 12(8):8427–8440

    Article  Google Scholar 

  50. Zhang JM, Liu Y, Liu HH, Wang J (2021) Learning local–global multiple correlation filters for robust visual tracking with Kalman filter redetection. Sensors 21(4):1129

    Article  Google Scholar 

  51. Zhang JM, Feng WJ, Yuan TY, Wang J, Sangaiah AK (2022) SCSTCF: spatial-channel selection and temporal regularized correlation filters for visual tracking. Appl Soft Comput 118:108485

    Article  Google Scholar 

  52. Zhang JM, Sun J, Wang J, Li ZP, Chen X (2022) An object tracking framework with recapture based on correlation filters and Siamese networks. Comput Electr Eng 98:107730

    Article  Google Scholar 

  53. Zhang JM, Yuan TY, He YQ, Wang J (2022) A background-aware correlation filter with adaptive saliency-aware regularization for visual tracking. Neural Comput Applic 34(8):6359–6376

    Article  Google Scholar 

  54. Zhang JM, Liu Y, Liu HH, Wang J, Zhang YD (2022) Distractor-aware visual tracking using hierarchical correlation filters adaptive selection. Appl Intell 52(6):6129–6147

    Article  Google Scholar 

  55. Zhou X, Zhuo J, Krähenbühl P (2019) Bottom-up object detection by grouping extreme and center points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp 850–859. https://doi.org/10.48550/arXiv.1901.08043

  56. Zhu Z, Wang Q, Li B, Wu W, Yan JJ, Hu WM (2018) Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision. CVPR, pp 101–117. https://doi.org/10.48550/arXiv.1808.06048

Download references

Acknowledgements

This work was supported in part by the Open Fund of Key Laboratory of Safety Control of Bridge Engineering, Ministry of Education (Changsha University of Science and Technology) under Grant 21 KB06, in part by the Science Fund for Creative Research Groups of Hunan Province under Grant 2020JJ1006, in part by the National Natural Science Foundation of China under Grant 61972056.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianming Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Huang, H., Jin, X. et al. Siamese visual tracking based on criss-cross attention and improved head network. Multimed Tools Appl 83, 1589–1615 (2024). https://doi.org/10.1007/s11042-023-15429-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15429-3

Keywords

Navigation