Abstract
Siamese-based RGBT trackers have attracted wide attention because of their high efficiency. However, there is a lack of an effective multimodal fusion module and information interaction between the search area and template area, which leads to poor performance of these siamese-based RGBT trackers. To solve this problem, inspire by the global information modeling capability of the transformer, we construct a siamese-based transformer RGBT tracker consisting of a single unified transformer module. Specifically, we propose a unified transformer fusion module to achieve feature extraction and global information interaction in the siamese RGBT tracker, i.e., the interaction between the search area and template area, the interaction between different modalities. It consists of self-attention and cross-attention, which are used to extract features and information interaction respectively. In addition, to alleviate the impact of multimodal fusion on the efficiency of template update in the tracking stage, we propose a feature-level template update strategy, which effectively improves tracking efficiency. To verify the effectiveness of our tracker, we evaluate it on five benchmark datasets including GTOT, RGBT210, RGBT234, LasHeR and VTUAV, and the results show that our tracker achieves excellent performance compared to the state-of-the-art methods.
Similar content being viewed by others
References
Zhu Y, Li C, Luo B, Tang J, Wang X (2019) Dense feature aggregation and pruning for rgbt tracking. In: Proceedings of the ACM International conference on multimedia, pp 465–472
Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance rgbt tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops, pp 0–0
Long Li C, Lu A, Hua Zheng A, Tu Z, Tang J (2019) Multi-adapter rgbt tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops, pp 0–0
Li C, Liu L, Lu A, Ji Q, Tang J (2020) Challenge-aware rgbt tracking. In: European conference on computer vision, pp 222–237
Zhang P, Wang D, Lu H, Yang X (2021) Learning adaptive attribute-driven representation for real-time rgb-t tracking. Int J Comput Vis 129(9):2714–2729
Xiao Y, Yang M, Li C, Liu L, Tang J (2022) Attribute-based progressive fusion network for rgbt tracking. National Conference on Artificial Intelligence
Nam H and Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4293–4302
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 6668–6677
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4282–4291
Lu X, Li F, Zhao Y, Yang W (2022) A robust tracking architecture using tracking failure detection in siamese trackers
Meng Y, Deng Z, Zhao K, Xu Y, Liu H (2021) Hierarchical correlation siamese network for real-time object tracking. Applied Intell 51(6):3202–3211
Zhang T, Liu X, Zhang Q, Han J (2021) Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology
He F, Chen M, Chen X, Han J, Bai L (2022) Siamdl: Siamese dual-level fusion attention network for rgbt tracking. Available at SSRN 4209345
Cui Y, Jiang C, Wang L, Wu G (2022) Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 13608–13618
Li Y, Yu AW, Meng T, Caine B, Ngiam J, Peng D, Shen J, Lu Y, Zhou D, Le QV, et al (2022) Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 17182–17191
Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R (2021) Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 1833–1844
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021)Transformer tracking. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 8126–8135
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems 30
Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 10448–10457
Lin L, Fan H, Zhang Z, Xu Y, Ling H (2022) Swintrack: A simple and strong baseline for transformer tracking. Adv Neural Inf Process Syst 35:16743–16754
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 10012–10022
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 22–31
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1251–1258
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778
Kingma DP and Ba J (2015) Adam: A method for stochastic optimization. ICLR
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 658–666
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 300–317
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2020) Lasot: A high-quality benchmark for large-scale single object tracking. In: IEEE/CVF Conference on computer vision and pattern recognition
Huang L, Zhao X, Huang K (2022) Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence
Li C, Cheng H, Hu S, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing, pp 5743–5756 . https://doi.org/10.1109/tip.2016.2614135
Li C, Zhao N, Lu Y, Zhu C, Tang J (2017) Weighted sparse representation regularized graph learning for rgb-t object tracking. In: Proceedings of the ACM International conference on multimedia, pp 1856–1864
Li C, Liang X, Lu Y, Zhao N, Tang J (2019) Rgb-t object tracking: Benchmark and baseline. Pattern Recognit 96:06977
Li C, Xue W, Jia Y, Qu Z, Luo B, Tang J, Sun D (2021) Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Trans Image Process 31:392–404
Pengyu Z, Zhao J, Wang D, Lu H, Ruan X (2022) Visible-thermal uav tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Wang C, Xu C, Cui Z, Zhou L, Zhang T, Zhang X, Yang J (2020) Cross-modal pattern-propagation for rgb-t tracking. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 7064–7073
Zhang P, Zhao J, Bo C, Wang D, Lu H, Yang X (2021) Jointly modeling motion and appearance cues for robust rgb-t tracking. IEEE Transactions on Image Processing
Lu A, Li C, Yan Y, Tang J, Luo B (2021) Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans Image Process 30:5613–5625
Tu Z, Lin C, Zhao W, Li C, Tang J (2022) M5l: Multi-modal multi-margin metric learning for rgbt tracking. IEEE Transactions on Image Processing
Danelljan M, Robinson A, Shahbaz Khan F, Felsberg M (2016) Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision, pp 472–488
Lu A, Qian C, Li C, Tang J, Wang L (2022) Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems
Zhang H, Zhang L, Zhuo L, Zhang J (2020) Object tracking in rgb-t videos using modal-aware attention network and competitive learning. Sensors 20(2):393
Zhang L, Danelljan M, Gonzalez-Garcia A, van de Weijer J, Shahbaz Khan F (2019) Multi-modal fusion for end-to-end rgb-t tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops
Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen J-K, Cehovin Zajc L, Drbohlav O, Lukezic A, Berg A, et al (2019) The seventh visual object tracking VOT2019 challenge results. In: Proceedings of the IEEE/CVF International conference on computer vision workshops
Feng M and Su J (2022) Learning reliable modal weight with transformer for robust rgbt tracking. Knowledge-Based Systems, 108945
Zhang L, Gonzalez-Garcia A, Weijer Jvd, Danelljan M, Khan FS (2019) Learning the model update for siamese trackers. In: The IEEE International Conference on Computer Vision (ICCV)
Acknowledgements
This work is jointly supported by the Natural Science Foundation for the Higher Education Institutions of Anhui Province (No. KJ2021A0044), Hefei Natural Science Foundation (No. HZ22ZK001), the University Synergy Innovation Program of Anhui Province (No. GXXT-2021-038, GXXT-2022-042), and the National Natural Science Foundation of China (No. 62076003)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, F., Wang, W., Liu, L. et al. Siamese transformer RGBT tracking. Appl Intell 53, 24709–24723 (2023). https://doi.org/10.1007/s10489-023-04741-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04741-y