ABSTRACT
Recently, Transformer based visual object tracking methods have achieved impressive advancements and significantly improved tracking performance. Transformer includes two modules of self-attention and cross-attention for those methods. However, it brings up two problems: first, the self-attention only considers the relative relation between elements when establishing global association, which can not highlight the essential areas of the tracked target. Second, the cross-attention only relies on feature similarity to locate the target, where the interference of similar objects is challenging. In this paper, we propose a new transformer tracking method of GTTrack by defining Gaussian Attention (GA) and Adaptive Focusing Module (AFM). The GA leads into Gaussian prior to generate a semantic template with robust object features, in which Gaussian prior pays more attention to the central region of the tracked target. The AFM calculates the similarity between current frame and the template by combining the appearance features and position features. The position features are defined with an adaptive Gaussian prior according to the target area in the previous frame. The introduction of position features enhances the contrast between the tracked target and the similar objects. Extensive experiments also demonstrate that the GTTrack outperforms many state-of-the-art trackers and achieves leading performance. Code will be available.
- Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In European conference on computer vision. Springer, 850–865.Google ScholarCross Ref
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.Google ScholarDigital Library
- Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8126–8135.Google ScholarCross Ref
- Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. 2020. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6668–6677.Google ScholarCross Ref
- Martin Danelljan, Luc Van Gool, and Radu Timofte. 2020. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7183–7192.Google ScholarCross Ref
- Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5374–5383.Google ScholarCross Ref
- Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. 2022. SparseTT: Visual Tracking with Sparse Transformers. (2022).Google Scholar
- Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. 2021. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13774–13783.Google ScholarCross Ref
- Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, and Shengyong Chen. 2020. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6269–6277.Google ScholarCross Ref
- Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2019. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2019), 1562–1577.Google ScholarCross Ref
- Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 2019. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4282–4291.Google ScholarCross Ref
- Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8971–8980.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.Google ScholarCross Ref
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.Google ScholarCross Ref
- Matthias Mueller, Neil Smith, and Bernard Ghanem. 2016. A benchmark and simulator for uav tracking. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 445–461.Google Scholar
- Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV). 300–317.Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.Google ScholarDigital Library
- Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2022. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8791–8800.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
- Paul Voigtlaender, Jonathon Luiten, Philip HS Torr, and Bastian Leibe. 2020. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6578–6588.Google ScholarCross Ref
- Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1571–1580.Google ScholarCross Ref
- Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan Wang, Hongzhi Li, and Yun Fu. 2020. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10186–10195.Google ScholarCross Ref
- Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. 2020. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12549–12556.Google ScholarCross Ref
- Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. 2021. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10448–10457.Google ScholarCross Ref
- Bin Yu, Ming Tang, Linyu Zheng, Guibo Zhu, Jinqiao Wang, Hao Feng, Xuetao Feng, and Hanqing Lu. 2021. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9856–9865.Google ScholarCross Ref
- Liang Yun, Li Qiaoqiao, and Long Fumian. 2023. Global Dilated Attention and Target Focusing Network for Robust Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
- Zhipeng Zhang and Houwen Peng. 2019. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4591–4600.Google ScholarCross Ref
Index Terms
- GTTrack: Gaussian Transformer Tracker for Visual Tracking
Recommendations
Exploiting spatial relationships for visual tracking
AbstractTransformer has been widely applied to visual tracking tasks, and the performance of object tracking keeps getting better since the attention mechanism excels in capturing long-range dependencies. However, conventional attention mechanisms can ...
Highlights- A novel axial attention mechanism is applied for the first time to the VOT task to mine feature position relationships.
- The axial attention mechanism we designed is distinctly different from previous axial attention mechanisms.
- We ...
Visual Object Tracking Based on Mean-shift and Particle-Kalman Filter
Even though many algorithms have been developed and many applications of object tracking have been made, object tracking is still considered as a difficult task to accomplish. The existence of several problems such as illumination variation, tracking ...
Visual object tracking--classical and contemporary approaches
Visual object tracking (VOT) is an important subfield of computer vision. It has widespread application domains, and has been considered as an important part of surveillance and security system. VOA facilitates finding the position of target in image ...
Comments