Skip to main content
Log in

SCATT: Transformer tracking with symmetric cross-attention

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In the popular Siamese network tracker, cross-correlation is based on the similarity to find the exact location of the template in the search region. However, due to cross-correlation primarily focuses on the spatial neighborhoods, so it often falls into local optimum. Additionally, multiple fusions of features results in a degrade of the target position information. To address these issues, we purpose a novel transformer-variant tracker. We make cross-attention play a central role in our tracker, and thus propose a novel symmetric cross-attention that effectively fuses the features of the template and the search region. The symmetric cross-attention only uses the cross-attention mechanism so as to get rid of the cross-correlation operation, which avoids local optimum and captures more global information. We also propose a position information enhancement module preserving more horizontal and vertical position information, which avoids the loss of position information caused by multiple fusions of features and helps the tracker to locate the target more accurately. Our proposed tracker achieves state-of-the-art performance on six benchmarks including GOT-10k, TrackingNet, LaSOT, UAV123, OTB100, and VOT2020, and is able to run at real-time speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability statements

Data will be made available on request.

References

  1. Xiao D, Tan K, Wei Z, Zhang G (2023) Siamese block attention network for online update object tracking. Appl Intell 53(3):3459–3471

    Article  Google Scholar 

  2. Zhang J, He Y, Feng W, Wang J, Xiong NN (2023) Learning background-aware and spatial-temporal regularized correlation filters for visual tracking. Appl Intell 53(7):7697–7712

    Article  Google Scholar 

  3. Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117

  4. Zhang J, Jin X, Sun J, Wang J, Sangaiah AK (2020) Spatial and semantic convolutional features for robust visual object tracking. Multimed Tools Appl 79:15095–15115

    Article  Google Scholar 

  5. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: Computer Vision-ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II 14, Springer, pp 850–865

  6. Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI conference on artificial intelligence 34:12549–12556

    Article  Google Scholar 

  7. Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4282–4291

  8. Zhou W, Wen L, Zhang L, Du D, Luo T, Wu Y (2021) Siamcan: Real-time visual tracking based on siamese center-aware network. IEEE Trans Image Process 30:3597–3609

    Article  Google Scholar 

  9. Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1328–1338

  10. Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677

  11. Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980

  12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30

  13. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16, Springer, pp 213-229

  14. Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10448–10457

  15. Wu Y, Lim J, Yang M-H (2013) Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2411–2418

  16. Zhang J, Sun J, Wang J, Li Z, Chen X (2022) An object tracking framework with recapture based on correlation filters and siamese networks. Computers & Electrical Eng 98:107730

    Article  Google Scholar 

  17. Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: Visual tracking by redetection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588

  18. Zhang L, Gonzalez-Garcia A, Weijer Jvd, Danelljan M, Khan FS (2019) Learning the model update for siamese trackers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4010–4019

  19. Yan B, Zhang X, Wang D, Lu H, Yang X (2021) Alpha-refine: Boosting tracking performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5289–5298

  20. Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: Object-aware anchor-free tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, pp 771–787

  21. Zhang J, Huang B, Ye Z, Kuang L-D, Ning X (2021) Siamese anchor-free object tracking with multiscale spatial attentions. Scientific Reports 11(1):22908

    Article  Google Scholar 

  22. Wang N, Song Y, Ma C, Zhou W, Liu W, Li H (2019) Unsupervised deep tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1308–1317

  23. Quan H, Li X, Chen W, Bai Q, Zou M, Yang R, Zheng T, Qi R, Gao X, Cui X (2022) Global contrast masked autoencoders are powerful pathological representation learners. arXiv:2205.09048

  24. Zhang J, Sun J, Wang J, Yue X-G (2021) Visual object tracking based on residual network and cascaded correlation filters. J Ambient Intell Human Comput 12:8427–8440

    Article  Google Scholar 

  25. Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669

  26. Floridi L, Chiriatti M (2020) Gpt-3: Its nature, scope, limits, and consequences. Minds Mach 30:681–694

    Article  Google Scholar 

  27. Li J, Dong S, Ding L, Xu T (2023) Mssvt++: Mixed-scale sparse voxel transformer with center voting for 3d object detection. IEEE Trans Pattern Anal Mach Intell 1–17. https://doi.org/10.1109/TPAMI.2023.3345880

  28. Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135

  29. Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) Aiatrack: Attention in attention for transformer visual tracking. In: Computer vision-ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, Springer, pp 146–164

  30. Song Z, Luo R, Yu J, Chen Y-PP, Yang W (2023) Compact transformer tracker with correlative masked modeling. Proceedings of the AAAI conference on artificial intelligence 37:2321–2329

    Article  Google Scholar 

  31. Cui Y, Jiang C, Wang L, Wu G (2022) Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13608–13618

  32. Chen X, Kang B, Wang D, Li D, Lu H (2022) Efficient visual tracking via hierarchical cross-attention transformer. In: European conference on computer vision, Springer, pp 461–477

  33. Zhang J, Xie X, Zheng Z, Kuang L-D, Zhang Y (2022) Siamoa: siamese offsetaware object tracking. Neural Comput Appl 34(24):22223–22239

    Article  Google Scholar 

  34. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  35. Hu J, Shen L, Sun G (2018) Squeeze-andexcitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  36. Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277

  37. Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142

  38. Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191

  39. Nie J, He Z, Yang Y, Gao M, Dong Z (2023) Learning localization-aware target confidence for siamese visual tracking. IEEE Trans Multimed 25:6194–6206. https://doi.org/10.1109/TMM.2022.3206668

    Article  Google Scholar 

  40. Zhou Z, Sun Q, Li H, Li C, Ren Z (2023) Regression-selective feature-adaptive tracker for visual object tracking. IEEE Trans Multimed 25:5444–5457. https://doi.org/10.1109/TMM.2022.3192775

  41. Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192

  42. Bhat G, Danelljan M, Van Gool L, Timofte R (2020) Know your surroundings: Exploiting scene information for object tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XXIII 16, Springer, pp 205–221

  43. Zheng Y, Zhang Y, Xiao B (2023) Target-aware transformer tracking. IEEE Trans Circuits Syst Video Technol 33(9):4542–4551. https://doi.org/10.1109/TCSVT.2023.3276061

    Article  Google Scholar 

  44. Zhang M, Zhang Q, Song W, Huang D, He Q (2024) Promptvt: Prompting for efficient and accurate visual tracking. IEEE Trans Circuits Syst Video Technol 1–1. https://doi.org/10.1109/TCSVT.2024.3376582

  45. Zhang J, He Y, Chen W, Kuang L-D, Zheng B (2024) Corrformer: Context-aware tracking with cross-correlation and transformer. Comput Electrical Eng 114:109075

    Article  Google Scholar 

  46. Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552

  47. Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 300–317

  48. Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: A high-quality benchmark for largescale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383

  49. Huang L, Zhao X, Huang K (2019) Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577

    Article  Google Scholar 

  50. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, pp 740–755

  51. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

    Article  Google Scholar 

  52. Lukezic A, Vojir T, Čehovin Zajc L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6309–6318

  53. Blatter P, Kanakis M, Danelljan M, Van Gool L (2023) Efficient visual tracking with exemplar transformers. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1571–1581

  54. Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 483–498

  55. Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for uav tracking. In: Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, pp 445–461

  56. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Kämäräinen J-K, Danelljan M, Zajc LČ, Lukežič A, Drbohlav O et al (2020) The eighth visual object tracking vot2020 challenge results. In: Computer Vision-ECCV 2020 Workshops: Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16, Springer, pp 547–601 20

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61972056, the Open Fund of Key Laboratory of Safety Control of Bridge Engineering, Ministry of Education (Changsha University of Science and Technology) under Grant 21KB06, the Open Research Project of the State Key Laboratory of Industrial Control Technology under Grant ICT2022B60, and the Postgraduate Scientific Research Innovation Fund of Changsha University of Science and Technology under Grant CSLGCX23093.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianming Zhang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Chen, W., Dai, J. et al. SCATT: Transformer tracking with symmetric cross-attention. Appl Intell 54, 6069–6084 (2024). https://doi.org/10.1007/s10489-024-05467-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05467-1

Keywords