SCATT: Transformer tracking with symmetric cross-attention

Zhang, Jianming; Chen, Wentao; Dai, Jiangxin; Zhang, Jin

doi:10.1007/s10489-024-05467-1

SCATT: Transformer tracking with symmetric cross-attention

Published: 04 May 2024

Volume 54, pages 6069–6084, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jianming Zhang ORCID: orcid.org/0000-0002-4278-0805¹,
Wentao Chen¹^na1,
Jiangxin Dai¹ &
…
Jin Zhang^1,2

441 Accesses
Explore all metrics

Abstract

In the popular Siamese network tracker, cross-correlation is based on the similarity to find the exact location of the template in the search region. However, due to cross-correlation primarily focuses on the spatial neighborhoods, so it often falls into local optimum. Additionally, multiple fusions of features results in a degrade of the target position information. To address these issues, we purpose a novel transformer-variant tracker. We make cross-attention play a central role in our tracker, and thus propose a novel symmetric cross-attention that effectively fuses the features of the template and the search region. The symmetric cross-attention only uses the cross-attention mechanism so as to get rid of the cross-correlation operation, which avoids local optimum and captures more global information. We also propose a position information enhancement module preserving more horizontal and vertical position information, which avoids the loss of position information caused by multiple fusions of features and helps the tracker to locate the target more accurately. Our proposed tracker achieves state-of-the-art performance on six benchmarks including GOT-10k, TrackingNet, LaSOT, UAV123, OTB100, and VOT2020, and is able to run at real-time speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AiATrack: Attention in Attention for Transformer Visual Tracking

Deep Siamese Network with Co-channel and Cr-Spatial Attention for Object Tracking

LSNT: A Lightweight Siamese Network Based Tracker

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability statements

Data will be made available on request.

References

Xiao D, Tan K, Wei Z, Zhang G (2023) Siamese block attention network for online update object tracking. Appl Intell 53(3):3459–3471
Article Google Scholar
Zhang J, He Y, Feng W, Wang J, Xiong NN (2023) Learning background-aware and spatial-temporal regularized correlation filters for visual tracking. Appl Intell 53(7):7697–7712
Article Google Scholar
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117
Zhang J, Jin X, Sun J, Wang J, Sangaiah AK (2020) Spatial and semantic convolutional features for robust visual object tracking. Multimed Tools Appl 79:15095–15115
Article Google Scholar
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: Computer Vision-ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II 14, Springer, pp 850–865
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI conference on artificial intelligence 34:12549–12556
Article Google Scholar
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4282–4291
Zhou W, Wen L, Zhang L, Du D, Luo T, Wu Y (2021) Siamcan: Real-time visual tracking based on siamese center-aware network. IEEE Trans Image Process 30:3597–3609
Article Google Scholar
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1328–1338
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16, Springer, pp 213-229
Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10448–10457
Wu Y, Lim J, Yang M-H (2013) Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2411–2418
Zhang J, Sun J, Wang J, Li Z, Chen X (2022) An object tracking framework with recapture based on correlation filters and siamese networks. Computers & Electrical Eng 98:107730
Article Google Scholar
Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: Visual tracking by redetection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
Zhang L, Gonzalez-Garcia A, Weijer Jvd, Danelljan M, Khan FS (2019) Learning the model update for siamese trackers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4010–4019
Yan B, Zhang X, Wang D, Lu H, Yang X (2021) Alpha-refine: Boosting tracking performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5289–5298
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: Object-aware anchor-free tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, pp 771–787
Zhang J, Huang B, Ye Z, Kuang L-D, Ning X (2021) Siamese anchor-free object tracking with multiscale spatial attentions. Scientific Reports 11(1):22908
Article Google Scholar
Wang N, Song Y, Ma C, Zhou W, Liu W, Li H (2019) Unsupervised deep tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1308–1317
Quan H, Li X, Chen W, Bai Q, Zou M, Yang R, Zheng T, Qi R, Gao X, Cui X (2022) Global contrast masked autoencoders are powerful pathological representation learners. arXiv:2205.09048
Zhang J, Sun J, Wang J, Yue X-G (2021) Visual object tracking based on residual network and cascaded correlation filters. J Ambient Intell Human Comput 12:8427–8440
Article Google Scholar
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669
Floridi L, Chiriatti M (2020) Gpt-3: Its nature, scope, limits, and consequences. Minds Mach 30:681–694
Article Google Scholar
Li J, Dong S, Ding L, Xu T (2023) Mssvt++: Mixed-scale sparse voxel transformer with center voting for 3d object detection. IEEE Trans Pattern Anal Mach Intell 1–17. https://doi.org/10.1109/TPAMI.2023.3345880
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135
Gao S, Zhou C, Ma C, Wang X, Yuan J (2022) Aiatrack: Attention in attention for transformer visual tracking. In: Computer vision-ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, Springer, pp 146–164
Song Z, Luo R, Yu J, Chen Y-PP, Yang W (2023) Compact transformer tracker with correlative masked modeling. Proceedings of the AAAI conference on artificial intelligence 37:2321–2329
Article Google Scholar
Cui Y, Jiang C, Wang L, Wu G (2022) Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13608–13618
Chen X, Kang B, Wang D, Li D, Lu H (2022) Efficient visual tracking via hierarchical cross-attention transformer. In: European conference on computer vision, Springer, pp 461–477
Zhang J, Xie X, Zheng Z, Kuang L-D, Zhang Y (2022) Siamoa: siamese offsetaware object tracking. Neural Comput Appl 34(24):22223–22239
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu J, Shen L, Sun G (2018) Squeeze-andexcitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277
Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
Nie J, He Z, Yang Y, Gao M, Dong Z (2023) Learning localization-aware target confidence for siamese visual tracking. IEEE Trans Multimed 25:6194–6206. https://doi.org/10.1109/TMM.2022.3206668
Article Google Scholar
Zhou Z, Sun Q, Li H, Li C, Ren Z (2023) Regression-selective feature-adaptive tracker for visual object tracking. IEEE Trans Multimed 25:5444–5457. https://doi.org/10.1109/TMM.2022.3192775
Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Bhat G, Danelljan M, Van Gool L, Timofte R (2020) Know your surroundings: Exploiting scene information for object tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XXIII 16, Springer, pp 205–221
Zheng Y, Zhang Y, Xiao B (2023) Target-aware transformer tracking. IEEE Trans Circuits Syst Video Technol 33(9):4542–4551. https://doi.org/10.1109/TCSVT.2023.3276061
Article Google Scholar
Zhang M, Zhang Q, Song W, Huang D, He Q (2024) Promptvt: Prompting for efficient and accurate visual tracking. IEEE Trans Circuits Syst Video Technol 1–1. https://doi.org/10.1109/TCSVT.2024.3376582
Zhang J, He Y, Chen W, Kuang L-D, Zheng B (2024) Corrformer: Context-aware tracking with cross-correlation and transformer. Comput Electrical Eng 114:109075
Article Google Scholar
Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 300–317
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: A high-quality benchmark for largescale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383
Huang L, Zhao X, Huang K (2019) Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577
Article Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, pp 740–755
Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Article Google Scholar
Lukezic A, Vojir T, Čehovin Zajc L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6309–6318
Blatter P, Kanakis M, Danelljan M, Van Gool L (2023) Efficient visual tracking with exemplar transformers. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1571–1581
Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 483–498
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for uav tracking. In: Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, pp 445–461
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Kämäräinen J-K, Danelljan M, Zajc LČ, Lukežič A, Drbohlav O et al (2020) The eighth visual object tracking vot2020 challenge results. In: Computer Vision-ECCV 2020 Workshops: Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16, Springer, pp 547–601 20

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61972056, the Open Fund of Key Laboratory of Safety Control of Bridge Engineering, Ministry of Education (Changsha University of Science and Technology) under Grant 21KB06, the Open Research Project of the State Key Laboratory of Industrial Control Technology under Grant ICT2022B60, and the Postgraduate Scientific Research Innovation Fund of Changsha University of Science and Technology under Grant CSLGCX23093.

Author information

Wentao Chen contributed equally to this work.

Authors and Affiliations

School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, 410076, People’s Republic of China
Jianming Zhang, Wentao Chen, Jiangxin Dai & Jin Zhang
State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, 310058, People’s Republic of China
Jin Zhang

Authors

Jianming Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Wentao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Jiangxin Dai
View author publications
You can also search for this author inPubMed Google Scholar
Jin Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jianming Zhang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, J., Chen, W., Dai, J. et al. SCATT: Transformer tracking with symmetric cross-attention. Appl Intell 54, 6069–6084 (2024). https://doi.org/10.1007/s10489-024-05467-1

Download citation

Accepted: 13 April 2024
Published: 04 May 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10489-024-05467-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SCATT: Transformer tracking with symmetric cross-attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AiATrack: Attention in Attention for Transformer Visual Tracking

Deep Siamese Network with Co-channel and Cr-Spatial Attention for Object Tracking

LSNT: A Lightweight Siamese Network Based Tracker

Explore related subjects

Data availability statements

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now