skip to main content
10.1145/3595916.3626446acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

GTTrack: Gaussian Transformer Tracker for Visual Tracking

Published:01 January 2024Publication History

ABSTRACT

Recently, Transformer based visual object tracking methods have achieved impressive advancements and significantly improved tracking performance. Transformer includes two modules of self-attention and cross-attention for those methods. However, it brings up two problems: first, the self-attention only considers the relative relation between elements when establishing global association, which can not highlight the essential areas of the tracked target. Second, the cross-attention only relies on feature similarity to locate the target, where the interference of similar objects is challenging. In this paper, we propose a new transformer tracking method of GTTrack by defining Gaussian Attention (GA) and Adaptive Focusing Module (AFM). The GA leads into Gaussian prior to generate a semantic template with robust object features, in which Gaussian prior pays more attention to the central region of the tracked target. The AFM calculates the similarity between current frame and the template by combining the appearance features and position features. The position features are defined with an adaptive Gaussian prior according to the target area in the previous frame. The introduction of position features enhances the contrast between the tracked target and the similar objects. Extensive experiments also demonstrate that the GTTrack outperforms many state-of-the-art trackers and achieves leading performance. Code will be available.

References

  1. Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In European conference on computer vision. Springer, 850–865.Google ScholarGoogle ScholarCross RefCross Ref
  2. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8126–8135.Google ScholarGoogle ScholarCross RefCross Ref
  4. Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. 2020. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6668–6677.Google ScholarGoogle ScholarCross RefCross Ref
  5. Martin Danelljan, Luc Van Gool, and Radu Timofte. 2020. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7183–7192.Google ScholarGoogle ScholarCross RefCross Ref
  6. Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5374–5383.Google ScholarGoogle ScholarCross RefCross Ref
  7. Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. 2022. SparseTT: Visual Tracking with Sparse Transformers. (2022).Google ScholarGoogle Scholar
  8. Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. 2021. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13774–13783.Google ScholarGoogle ScholarCross RefCross Ref
  9. Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, and Shengyong Chen. 2020. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6269–6277.Google ScholarGoogle ScholarCross RefCross Ref
  10. Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2019. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2019), 1562–1577.Google ScholarGoogle ScholarCross RefCross Ref
  11. Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 2019. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4282–4291.Google ScholarGoogle ScholarCross RefCross Ref
  12. Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8971–8980.Google ScholarGoogle ScholarCross RefCross Ref
  13. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.Google ScholarGoogle ScholarCross RefCross Ref
  15. Matthias Mueller, Neil Smith, and Bernard Ghanem. 2016. A benchmark and simulator for uav tracking. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 445–461.Google ScholarGoogle Scholar
  16. Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV). 300–317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).Google ScholarGoogle Scholar
  18. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2022. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8791–8800.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  21. Paul Voigtlaender, Jonathon Luiten, Philip HS Torr, and Bastian Leibe. 2020. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6578–6588.Google ScholarGoogle ScholarCross RefCross Ref
  22. Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1571–1580.Google ScholarGoogle ScholarCross RefCross Ref
  23. Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan Wang, Hongzhi Li, and Yun Fu. 2020. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10186–10195.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. 2020. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12549–12556.Google ScholarGoogle ScholarCross RefCross Ref
  25. Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. 2021. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10448–10457.Google ScholarGoogle ScholarCross RefCross Ref
  26. Bin Yu, Ming Tang, Linyu Zheng, Guibo Zhu, Jinqiao Wang, Hao Feng, Xuetao Feng, and Hanqing Lu. 2021. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9856–9865.Google ScholarGoogle ScholarCross RefCross Ref
  27. Liang Yun, Li Qiaoqiao, and Long Fumian. 2023. Global Dilated Attention and Target Focusing Network for Robust Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  28. Zhipeng Zhang and Houwen Peng. 2019. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4591–4600.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. GTTrack: Gaussian Transformer Tracker for Visual Tracking

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
      December 2023
      745 pages
      ISBN:9798400702051
      DOI:10.1145/3595916

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 January 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate59of204submissions,29%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)42
      • Downloads (Last 6 weeks)4

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format