Skip to main content
Log in

Siamese transformer RGBT tracking

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Siamese-based RGBT trackers have attracted wide attention because of their high efficiency. However, there is a lack of an effective multimodal fusion module and information interaction between the search area and template area, which leads to poor performance of these siamese-based RGBT trackers. To solve this problem, inspire by the global information modeling capability of the transformer, we construct a siamese-based transformer RGBT tracker consisting of a single unified transformer module. Specifically, we propose a unified transformer fusion module to achieve feature extraction and global information interaction in the siamese RGBT tracker, i.e., the interaction between the search area and template area, the interaction between different modalities. It consists of self-attention and cross-attention, which are used to extract features and information interaction respectively. In addition, to alleviate the impact of multimodal fusion on the efficiency of template update in the tracking stage, we propose a feature-level template update strategy, which effectively improves tracking efficiency. To verify the effectiveness of our tracker, we evaluate it on five benchmark datasets including GTOT, RGBT210, RGBT234, LasHeR and VTUAV, and the results show that our tracker achieves excellent performance compared to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Zhu Y, Li C, Luo B, Tang J, Wang X (2019) Dense feature aggregation and pruning for rgbt tracking. In: Proceedings of the ACM International conference on multimedia, pp 465–472

  2. Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance rgbt tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops, pp 0–0

  3. Long Li C, Lu A, Hua Zheng A, Tu Z, Tang J (2019) Multi-adapter rgbt tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops, pp 0–0

  4. Li C, Liu L, Lu A, Ji Q, Tang J (2020) Challenge-aware rgbt tracking. In: European conference on computer vision, pp 222–237

  5. Zhang P, Wang D, Lu H, Yang X (2021) Learning adaptive attribute-driven representation for real-time rgb-t tracking. Int J Comput Vis 129(9):2714–2729

    Article  Google Scholar 

  6. Xiao Y, Yang M, Li C, Liu L, Tang J (2022) Attribute-based progressive fusion network for rgbt tracking. National Conference on Artificial Intelligence

  7. Nam H and Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4293–4302

  8. Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 6668–6677

  9. Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4282–4291

  10. Lu X, Li F, Zhao Y, Yang W (2022) A robust tracking architecture using tracking failure detection in siamese trackers

  11. Meng Y, Deng Z, Zhao K, Xu Y, Liu H (2021) Hierarchical correlation siamese network for real-time object tracking. Applied Intell 51(6):3202–3211

    Article  Google Scholar 

  12. Zhang T, Liu X, Zhang Q, Han J (2021) Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology

  13. He F, Chen M, Chen X, Han J, Bai L (2022) Siamdl: Siamese dual-level fusion attention network for rgbt tracking. Available at SSRN 4209345

  14. Cui Y, Jiang C, Wang L, Wu G (2022) Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 13608–13618

  15. Li Y, Yu AW, Meng T, Caine B, Ngiam J, Peng D, Shen J, Lu Y, Zhou D, Le QV, et al (2022) Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 17182–17191

  16. Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R (2021) Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 1833–1844

  17. Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021)Transformer tracking. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 8126–8135

  18. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations

  19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems 30

  20. Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 10448–10457

  21. Lin L, Fan H, Zhang Z, Xu Y, Ling H (2022) Swintrack: A simple and strong baseline for transformer tracking. Adv Neural Inf Process Syst 35:16743–16754

    Google Scholar 

  22. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 10012–10022

  23. Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 22–31

  24. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450

  25. Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1251–1258

  26. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778

  27. Kingma DP and Ba J (2015) Adam: A method for stochastic optimization. ICLR

  28. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 658–666

  29. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229

  30. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755

  31. Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 300–317

  32. Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2020) Lasot: A high-quality benchmark for large-scale single object tracking. In: IEEE/CVF Conference on computer vision and pattern recognition

  33. Huang L, Zhao X, Huang K (2022) Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence

  34. Li C, Cheng H, Hu S, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing, pp 5743–5756 . https://doi.org/10.1109/tip.2016.2614135

  35. Li C, Zhao N, Lu Y, Zhu C, Tang J (2017) Weighted sparse representation regularized graph learning for rgb-t object tracking. In: Proceedings of the ACM International conference on multimedia, pp 1856–1864

  36. Li C, Liang X, Lu Y, Zhao N, Tang J (2019) Rgb-t object tracking: Benchmark and baseline. Pattern Recognit 96:06977

    Article  Google Scholar 

  37. Li C, Xue W, Jia Y, Qu Z, Luo B, Tang J, Sun D (2021) Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Trans Image Process 31:392–404

    Article  Google Scholar 

  38. Pengyu Z, Zhao J, Wang D, Lu H, Ruan X (2022) Visible-thermal uav tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  39. Wang C, Xu C, Cui Z, Zhou L, Zhang T, Zhang X, Yang J (2020) Cross-modal pattern-propagation for rgb-t tracking. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 7064–7073

  40. Zhang P, Zhao J, Bo C, Wang D, Lu H, Yang X (2021) Jointly modeling motion and appearance cues for robust rgb-t tracking. IEEE Transactions on Image Processing

  41. Lu A, Li C, Yan Y, Tang J, Luo B (2021) Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans Image Process 30:5613–5625

    Article  Google Scholar 

  42. Tu Z, Lin C, Zhao W, Li C, Tang J (2022) M5l: Multi-modal multi-margin metric learning for rgbt tracking. IEEE Transactions on Image Processing

  43. Danelljan M, Robinson A, Shahbaz Khan F, Felsberg M (2016) Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision, pp 472–488

  44. Lu A, Qian C, Li C, Tang J, Wang L (2022) Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems

  45. Zhang H, Zhang L, Zhuo L, Zhang J (2020) Object tracking in rgb-t videos using modal-aware attention network and competitive learning. Sensors 20(2):393

    Article  Google Scholar 

  46. Zhang L, Danelljan M, Gonzalez-Garcia A, van de Weijer J, Shahbaz Khan F (2019) Multi-modal fusion for end-to-end rgb-t tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops

  47. Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen J-K, Cehovin Zajc L, Drbohlav O, Lukezic A, Berg A, et al (2019) The seventh visual object tracking VOT2019 challenge results. In: Proceedings of the IEEE/CVF International conference on computer vision workshops

  48. Feng M and Su J (2022) Learning reliable modal weight with transformer for robust rgbt tracking. Knowledge-Based Systems, 108945

  49. Zhang L, Gonzalez-Garcia A, Weijer Jvd, Danelljan M, Khan FS (2019) Learning the model update for siamese trackers. In: The IEEE International Conference on Computer Vision (ICCV)

Download references

Acknowledgements

This work is jointly supported by the Natural Science Foundation for the Higher Education Institutions of Anhui Province (No. KJ2021A0044), Hefei Natural Science Foundation (No. HZ22ZK001), the University Synergy Innovation Program of Anhui Province (No. GXXT-2021-038, GXXT-2022-042), and the National Natural Science Foundation of China (No. 62076003)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Liu.

Ethics declarations

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Wang, W., Liu, L. et al. Siamese transformer RGBT tracking. Appl Intell 53, 24709–24723 (2023). https://doi.org/10.1007/s10489-023-04741-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04741-y

Keywords

Navigation