Abstract
In the popular Siamese network tracker, cross-correlation is based on the similarity to find the exact location of the template in the search region. However, due to cross-correlation primarily focuses on the spatial neighborhoods, so it often falls into local optimum. Additionally, multiple fusions of features results in a degrade of the target position information. To address these issues, we purpose a novel transformer-variant tracker. We make cross-attention play a central role in our tracker, and thus propose a novel symmetric cross-attention that effectively fuses the features of the template and the search region. The symmetric cross-attention only uses the cross-attention mechanism so as to get rid of the cross-correlation operation, which avoids local optimum and captures more global information. We also propose a position information enhancement module preserving more horizontal and vertical position information, which avoids the loss of position information caused by multiple fusions of features and helps the tracker to locate the target more accurately. Our proposed tracker achieves state-of-the-art performance on six benchmarks including GOT-10k, TrackingNet, LaSOT, UAV123, OTB100, and VOT2020, and is able to run at real-time speed.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability statements
Data will be made available on request.
References
Xiao D, Tan K, Wei Z, Zhang G (2023) Siamese block attention network for online update object tracking. Appl Intell 53(3):3459–3471
Zhang J, He Y, Feng W, Wang J, Xiong NN (2023) Learning background-aware and spatial-temporal regularized correlation filters for visual tracking. Appl Intell 53(7):7697–7712
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117
Zhang J, Jin X, Sun J, Wang J, Sangaiah AK (2020) Spatial and semantic convolutional features for robust visual object tracking. Multimed Tools Appl 79:15095–15115
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: Computer Vision-ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II 14, Springer, pp 850–865
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI conference on artificial intelligence 34:12549–12556
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4282–4291
Zhou W, Wen L, Zhang L, Du D, Luo T, Wu Y (2021) Siamcan: Real-time visual tracking based on siamese center-aware network. IEEE Trans Image Process 30:3597–3609
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1328–1338
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16, Springer, pp 213-229
Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10448–10457
Wu Y, Lim J, Yang M-H (2013) Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2411–2418
Zhang J, Sun J, Wang J, Li Z, Chen X (2022) An object tracking framework with recapture based on correlation filters and siamese networks. Computers & Electrical Eng 98:107730
Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: Visual tracking by redetection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
Zhang L, Gonzalez-Garcia A, Weijer Jvd, Danelljan M, Khan FS (2019) Learning the model update for siamese trackers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4010–4019
Yan B, Zhang X, Wang D, Lu H, Yang X (2021) Alpha-refine: Boosting tracking performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5289–5298
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: Object-aware anchor-free tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, pp 771–787
Zhang J, Huang B, Ye Z, Kuang L-D, Ning X (2021) Siamese anchor-free object tracking with multiscale spatial attentions. Scientific Reports 11(1):22908
Wang N, Song Y, Ma C, Zhou W, Liu W, Li H (2019) Unsupervised deep tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1308–1317
Quan H, Li X, Chen W, Bai Q, Zou M, Yang R, Zheng T, Qi R, Gao X, Cui X (2022) Global contrast masked autoencoders are powerful pathological representation learners. arXiv:2205.09048
Zhang J, Sun J, Wang J, Yue X-G (2021) Visual object tracking based on residual network and cascaded correlation filters. J Ambient Intell Human Comput 12:8427–8440
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669
Floridi L, Chiriatti M (2020) Gpt-3: Its nature, scope, limits, and consequences. Minds Mach 30:681–694
Li J, Dong S, Ding L, Xu T (2023) Mssvt++: Mixed-scale sparse voxel transformer with center voting for 3d object detection. IEEE Trans Pattern Anal Mach Intell 1–17. https://doi.org/10.1109/TPAMI.2023.3345880
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135
Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) Aiatrack: Attention in attention for transformer visual tracking. In: Computer vision-ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, Springer, pp 146–164
Song Z, Luo R, Yu J, Chen Y-PP, Yang W (2023) Compact transformer tracker with correlative masked modeling. Proceedings of the AAAI conference on artificial intelligence 37:2321–2329
Cui Y, Jiang C, Wang L, Wu G (2022) Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13608–13618
Chen X, Kang B, Wang D, Li D, Lu H (2022) Efficient visual tracking via hierarchical cross-attention transformer. In: European conference on computer vision, Springer, pp 461–477
Zhang J, Xie X, Zheng Z, Kuang L-D, Zhang Y (2022) Siamoa: siamese offsetaware object tracking. Neural Comput Appl 34(24):22223–22239
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu J, Shen L, Sun G (2018) Squeeze-andexcitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277
Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
Nie J, He Z, Yang Y, Gao M, Dong Z (2023) Learning localization-aware target confidence for siamese visual tracking. IEEE Trans Multimed 25:6194–6206. https://doi.org/10.1109/TMM.2022.3206668
Zhou Z, Sun Q, Li H, Li C, Ren Z (2023) Regression-selective feature-adaptive tracker for visual object tracking. IEEE Trans Multimed 25:5444–5457. https://doi.org/10.1109/TMM.2022.3192775
Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Bhat G, Danelljan M, Van Gool L, Timofte R (2020) Know your surroundings: Exploiting scene information for object tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XXIII 16, Springer, pp 205–221
Zheng Y, Zhang Y, Xiao B (2023) Target-aware transformer tracking. IEEE Trans Circuits Syst Video Technol 33(9):4542–4551. https://doi.org/10.1109/TCSVT.2023.3276061
Zhang M, Zhang Q, Song W, Huang D, He Q (2024) Promptvt: Prompting for efficient and accurate visual tracking. IEEE Trans Circuits Syst Video Technol 1–1. https://doi.org/10.1109/TCSVT.2024.3376582
Zhang J, He Y, Chen W, Kuang L-D, Zheng B (2024) Corrformer: Context-aware tracking with cross-correlation and transformer. Comput Electrical Eng 114:109075
Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 300–317
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: A high-quality benchmark for largescale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383
Huang L, Zhao X, Huang K (2019) Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, pp 740–755
Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Lukezic A, Vojir T, Čehovin Zajc L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6309–6318
Blatter P, Kanakis M, Danelljan M, Van Gool L (2023) Efficient visual tracking with exemplar transformers. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1571–1581
Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 483–498
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for uav tracking. In: Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, pp 445–461
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Kämäräinen J-K, Danelljan M, Zajc LČ, Lukežič A, Drbohlav O et al (2020) The eighth visual object tracking vot2020 challenge results. In: Computer Vision-ECCV 2020 Workshops: Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16, Springer, pp 547–601 20
Funding
This work was supported by the National Natural Science Foundation of China under Grant 61972056, the Open Fund of Key Laboratory of Safety Control of Bridge Engineering, Ministry of Education (Changsha University of Science and Technology) under Grant 21KB06, the Open Research Project of the State Key Laboratory of Industrial Control Technology under Grant ICT2022B60, and the Postgraduate Scientific Research Innovation Fund of Changsha University of Science and Technology under Grant CSLGCX23093.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Chen, W., Dai, J. et al. SCATT: Transformer tracking with symmetric cross-attention. Appl Intell 54, 6069–6084 (2024). https://doi.org/10.1007/s10489-024-05467-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05467-1