Skip to main content
Log in

Deep Triply Attention Network for RGBT Tracking

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

RGB-Thermal (RGBT) tracking has gained significant attention in the field of computer vision due to its wide range of applications in video surveillance, autonomous driving, and human-computer interaction. This paper focuses on achieving a robust fusion of different modalities for RGBT tracking through attention modeling. We propose an effective triply attentive network for robust RGBT tracking, which consists of a local attention module, a cross-modality co-attention module, and a global attention module. The local attention module enables the tracker to focus on target regions while considering background interference, generated through backpropagation of the score map with respect to the RGB and thermal image pair. To enhance the interaction of different modalities in feature learning, we introduce a co-attention module that selects more discriminative features for both the visible (RGB) and thermal modalities simultaneously. To compensate for the limitations of local sampling, we incorporate a global attention module based on multi-modal information to compute high-quality global proposals. This module not only complements the local search strategy but also re-tracks lost targets when they come back into view. Extensive experiments conducted on three RGBT tracking datasets demonstrate that our proposed method outperforms other RGBT trackers, achieving more competitive results. Specifically, on the LasHeR dataset, the precision rate, normalized precision rate, and success rate reach 57.5%, 51.6%, and 41.0%, respectively. The above state-of-the-art experimental results confirm the effectiveness of our method in exploring the complementary advantages between modalities and achieving robust visual tracking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Data is available on request from the authors.

References

  1. Wang X, Li C, Luo B, Tang J. Sint++: Robust visual tracking via adversarial positive instance generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  2. Cao Y, Ji H, Zhang W, Xue F. Learning spatio-temporal context via hierarchical features for visual tracking. Signal Process Image Commun. 2018;66:50–65.

    Article  Google Scholar 

  3. Qian X, Han L, Wang Y, Ding M. Deep learning assisted robust visual tracking with adaptive particle filtering. Signal Process Image Commun. 2018;60:183–92.

    Article  Google Scholar 

  4. Dong X, Shen J, Wang W, Liu Y, Shao L, Porikli F. Hyperparameter optimization for tracking with continuous deep q-learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  5. Gong L, Mo Z, Zhao S, Song Y. An improved kernelized correlation filter tracking algorithm based on multi-channel memory model. Signal Process Image Commun. 2019;78:200–5.

    Article  Google Scholar 

  6. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH. Fully-convolutional Siamese networks for object tracking. In: Proceedings of the European Conference on Computer Vision. 2016.

  7. Zhang Z, Liu Y, Wang X, Li B, Hu W. Learn to match: Automatic matching network design for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 13339–48.

  8. Wang X, Shu X, Zhang Z, Jiang B, Wang Y, Tian Y, Wu F. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 13763–73.

  9. Wang X, Tang J, Luo B, Wang Y, Tian Y, Wu F. Tracking by joint local and global search: A target-aware attention-based approach. IEEE Trans Neural Netw Learn Syst. 2021;1–15. https://doi.org/10.1109/TNNLS.2021.3083933.

  10. Wang X, Chen Z, Tang J, Luo B, Wang Y, Tian Y, Wu F. Dynamic attention guided multi-trajectory analysis for single object tracking. IEEE Trans Circ Syst Vid Technol. 2021.

  11. Kong Q, Tang J, Li C, Wang X, Zhang J. An ensemble of complementary models for deep tracking. Cognit Comput. 2022;14(3):1096–106.

    Article  Google Scholar 

  12. Li C, Cheng H, Hu S, Liu X, Tang J, Lin L. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process. 2016;25(12):5743–56.

    Article  MathSciNet  MATH  Google Scholar 

  13. Li C, Zhu C, Huang Y, Tang J, Wang L. Cross-modal ranking with soft consistency and noisy labels for robust rgb-t tracking. In: Proceedings of the European Conference on Computer Vision. 2018.

  14. Li C, Wu X, Zhao N, Cao X, Tang J. Fusing two-stream convolutional neural networks for rgb-t object tracking. 2018;281:78–85.

    Google Scholar 

  15. Li C, Liang X, Lu Y, Zhao N, Tang J. Rgb-t object tracking: Benchmark and baseline. Pattern Recognit. 2019;96:106977.

    Article  Google Scholar 

  16. Yang R, Wang X, Li C, Hu J, Tang J. Rgbt tracking via cross-modality message passing. Neurocomputing. 2021;462:365–75.

    Article  Google Scholar 

  17. Olshausen BA, Anderson CH, Van Essen DC. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J Neurosci. 1993;13(11):4700–19.

    Article  Google Scholar 

  18. Zhu Y, Li C, Tang J, Luo B. Quality-aware feature aggregation network for robust rgbt tracking. IEEE Trans Intell Vehicles. 2020;6(1):121–30.

    Article  Google Scholar 

  19. Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J. Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

  20. Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N. Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In: Proceedings of the IEEE International Conference on Computer Vision. 2017.

  21. Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S. Learning attentions: Residual attentional Siamese network for high performance online visual tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  22. Wang X, Li C, Yang R, Zhang T, Tang J, Luo B. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv:1811.10014 [Preprint]. 2018. Available from: http://arxiv.org/abs/1811.10014.

  23. Wang X, Sun T, Yang R, Luo B. Learning target-aware attention for robust tracking with conditional adversarial network. In: The British Machine Vision Conference. 2019.

  24. Pu S, Song Y, Ma C, Zhang H, Yang M. Deep attentive tracking via reciprocative learning. In: Advances in Neural Information Processing Systems. 2018.

  25. Nguyen D, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  26. Ma C, Shen C, Dick A, Wu Q, Wang P, van den Hengel A, Reid I. Visual question answering with memory-augmented networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  27. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 770–8.

  28. Guo M-H, Xu T-X, Liu J-J, Liu Z-N, Jiang P-T, Mu T-J, Zhang S-H, Martin RR, Cheng M-M, Hu S-M. Attention mechanisms in computer vision: A survey. Comput Vis Media. 2022;8(3):331–68.

    Article  Google Scholar 

  29. Zhang P, Wang D, Lu H. Multi-modal visual tracking: Review and experimental comparison. arXiv:2012.04176 [Preprint]. 2020. Available from: http://arxiv.org/abs/2012.04176.

  30. Li C, Zhao N, Lu Y, Zhu C, Tang J. Weighted sparse representation regularized graph learning for rgb-t object tracking. In: ACM International Conference on Multimedia. 2017.

  31. Li C, Zhu C, Zheng S, Luo B, Tang J. Two-stage modality-graphs regularized manifold ranking for rgb-t tracking. Signal Process Image Commun. 2018;68:207–17.

    Article  Google Scholar 

  32. Ding M, Yao Y, Wei L, Cao Y. Visual tracking using locality-constrained linear coding and saliency map for visible light and infrared image sequences. Signal Process Image Commun. 2018;68:13–25.

    Article  Google Scholar 

  33. Li S, Bak S, Carr P, Wang X. Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  34. Cui Z, Xiao S, Feng J, Yan S. Recurrently target-attending tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

  35. Liang J, Jiang L, Cao L, Li L, Hauptmann AG. Focal visual-text attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

  36. Lu J, Yang J, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems. 2016.

  37. Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

  38. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 7132–41.

  39. Li X, Wang W, Hu X, Yang J. Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 510–9.

  40. Zhu G, Porikli F, Li H. Beyond local search: Tracking objects everywhere with instance-specific proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

  41. Jia X, De Brabandere B, Tuytelaars T, Gool LV. Dynamic filter networks. In: Advances in Neural Information Processing Systems. 2016.

  42. Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H. LASOT: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

  43. Li C, Xue W, Jia Y, Qu Z, Luo B, Tang J, Sun D. Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Trans Image Process. 2021;31:392–404.

    Article  Google Scholar 

  44. Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M. ECO: Efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

  45. Danelljan M, Robinson A, Khan FS, Felsberg M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: Proceedings of the European Conference on Computer Vision. 2016.

  46. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH. Staple: Complementary learners for real-time tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

  47. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH. End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

  48. Lukezic A, Vojir T, Cehovin Zajc L, Matas J, Kristan M. Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

  49. Kim H, Lee D, Sim J, Kim C. SOWP: Spatially ordered and weighted patch descriptor for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision. 2015.

  50. Zhang J, Ma S, Sclaroff S. Meem: Robust tracking via multiple experts using entropy minimization. In: Proceedings of the European Conference on Computer Vision. 2014.

  51. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M. Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision. 2015.

  52. Zhang Z, Peng H. Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

  53. Jung I, Son J, Baek M, Han B. Real-time mdnet. In: Proceedings of the European Conference on Computer Vision. 2018.

  54. Long Li C, Lu A, Hua Zheng A, Tu Z, Tang J. Multi-adapter rgbt tracking. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.

  55. Gao Y, Li C, Zhu Y, Tang J, He T, Wang F. Deep adaptive fusion network for high performance rgbt tracking. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. 2019.

  56. Zhang H, Zhang L, Zhuo L, Zhang J. Object tracking in rgb-t videos using modal-aware attention network and competitive learning. Sensors. 2020;20(2):393.

    Article  Google Scholar 

  57. Li C, Liu L, Lu A, Ji Q, Tang J. Challenge-aware rgbt tracking. In: Proceedings of European Conference on Computer Vision. 2020. p. 222–37.

  58. Lu A, Qian C, Li C, Tang J, Wang L. Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems. 2022. Early Access.

  59. Lu A, Li C, Yan Y, Tang J, Luo B. Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans Image Process. 2021;30:5613–25.

    Article  Google Scholar 

  60. Xiao Y, Yang M, Li C, Liu L, Tang J. Attribute-based progressive fusion network for rgbt tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022. p. 2831–8.

  61. Mei J, Zhou D, Cao J, Nie R, He K. Differential reinforcement and global collaboration network for rgbt tracking. IEEE Sens J. 2023. Early Access.

Download references

Funding

This work was supported by the Major Project for New Generation of AI under Grant (No. 2018AAA0100400) and the National Natural Science Foundation of China (No. 62202002, No. 62102205).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yabin Zhu.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Conflict of Interest

Rui Yang is a Master’s graduate from Anhui University and is currently employed at Arm China. Xiao Wang previously served as a postdoctoral fellow at Pengcheng Laboratory and is presently a faculty member at Anhui University. Yabin Zhu is currently a postdoctoral fellow at Anhui University. Jin Tang holds the position of professor at Anhui University. Apart from their affiliations with the mentioned educational institutions, research institutes, and companies, all authors declare no conflicts of interest with external entities.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, R., Wang, X., Zhu, Y. et al. Deep Triply Attention Network for RGBT Tracking. Cogn Comput 15, 1934–1946 (2023). https://doi.org/10.1007/s12559-023-10158-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-023-10158-z

Keywords

Navigation