research-article

Robust Tracking via Unifying Pretrain-Finetuning and Visual Prompt Tuning

Authors:

Guangtong Zhang,

Bineng ZhongAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 38, Pages 1 - 7

https://doi.org/10.1145/3595916.3626410

Published: 01 January 2024 Publication History

Abstract

The finetuning paradigm has been a widely used methodology for the supervised training of top-performing trackers. However, the finetuning paradigm faces one key issue: it is unclear how best to perform the finetuning method to adapt a pretrained model to tracking tasks while alleviating the catastrophic forgetting problem. To address this problem, we propose a novel partial finetuning paradigm for visual tracking via unifying pretrain-finetuning and visual prompt tuning (named UPVPT), which can not only efficiently learn knowledge from the tracking task but also reuse the prior knowledge learned by the pre-trained model for effectively handling various challenges in tracking task. Firstly, to maintain the pre-trained prior knowledge, we design a Prompt-style method to freeze some parameters of the pretrained network. Then, to learn knowledge from the tracking task, we update the parameters of the prompt and MLP layers. As a result, we cannot only retain useful prior knowledge of the pre-trained model by freezing the backbone network but also effectively learn target domain knowledge by updating the Prompt and MLP layer. Furthermore, the proposed UPVPT can easily be embedded into existing Transformer trackers (e.g., OSTracker and SwinTracker) by adding only a small number of model parameters (less than 1% of a Backbone network). Extensive experiments on five tracking benchmarks (i.e., UAV123, GOT-10k, LaSOT, TNL2K, and TrackingNet) demonstrate that the proposed UPVPT can improve the robustness and effectiveness of the model, especially in complex scenarios.

References

[1]

Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassne. 2022. Robust Visual Tracking by Segmentation. European Conference on Computer Vision,ECCV (2022).

[2]

Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-Convolutional Siamese Networks for Object Tracking. European Conference on Computer Vision,ECCV (2016).

[3]

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning Discriminative Model Prediction for Tracking. International Conference on Computer Vision, ICCV (2019).

[4]

Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2018. Unveiling the Power of Deep Tracking. European Conference on Computer Vision,ECCV (2018).

[5]

Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. 2022. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. European Conference on Computer Vision,ECCV (2022).

Digital Library

[6]

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).

[7]

Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, Rongrong Ji, Zhenjun Tang, and Xianxian Li. 2022. SiamBAN: Target-Aware Tracking With Siamese Box Adaptive Network. IEEE Transactions on Pattern Analysis and Machine Intelligence,TPAMI (2022).

[8]

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. 2022. MixFormer: End-to-End Tracking with Iterative Mixed Attention. Conference on Computer Vision and Pattern Recognition, CVPR (2022).

[9]

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2018. ATOM: Accurate Tracking by Overlap Maximization. Conference on Computer Vision and Pattern Recognition, CVPR (2018).

[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. Conference on Computer Vision and Pattern Recognition, CVPR (2009).

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations,ICLR (2021).

[12]

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2018. LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2018).

[13]

Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds and Machines (2020).

[14]

Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. 2022. SparseTT: Visual Tracking with Sparse Transformers. International Joint Conference on Artificial Intelligence,IJCAI (2022).

[15]

Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. 2021. STMTrack: Template-free Visual Tracking with Space-time Memory Networks. Conference on Computer Vision and Pattern Recognition, CVPR (2021).

[16]

Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, and Shengyong Chen. 2020. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2020).

[17]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked Autoencoders Are Scalable Vision Learners. Conference on Computer Vision and Pattern Recognition, CVPR (2022).

[18]

Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2022. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

[19]

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual Prompt Tuning. European Conference on Computer Vision,ECCV (2022).

[20]

Shibo Jie and Zhi-Hong Deng. 2022. Convolutional Bypasses Are Better Vision Transformer Adapters. arXiv preprint arXiv:2207.07039 (2022).

[21]

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 2019. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. Conference on Computer Vision and Pattern Recognition, CVPR (2019).

[22]

Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, and Ming-Hsuan Yang. 2019. Target-Aware Deep Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2019).

[23]

Liting Lin, Heng Fan, Yong Xu, and Haibin Ling. 2021. SwinTrack: A Simple and Strong Baseline for Transformer Tracking. arXiv preprint arXiv:2112.00995 (2021).

[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision,ECCV (2014).

[25]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.arXiv: Computation and Language (2021).

[26]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. International Conference on Learning Representations,ICLR (2019).

[27]

Matthias Mueller, Neil Smith, and Bernard Ghanem. 2016. A Benchmark and Simulator for UAV Tracking. European Conference on Computer Vision,ECCV (2016).

[28]

Matthias A. Müller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi, and Bernard Ghanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. European Conference on Computer Vision,ECCV (2018).

[29]

Hyeonseob Nam and Bohyung Han. 2015. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2015).

[30]

Chunhong Pan, Qi Tian, Shiming Xiang, Zhaoxiang Zhang, Bolin Ni, Xing Nie, Jianlong Chang, Gaomeng Meng, and Chunlei Huo. 2022. Pro-tuning: Unified Prompt Tuning for Vision Tasks. arXiv preprint arXiv:2207.14381 (2022).

[31]

Paul Voigtlaender, Jonathon Luiten, Philip H. S. Torr, and Bastian Leibe. 2020. Siam R-CNN: Visual Tracking by Re-Detection. Conference on Computer Vision and Pattern Recognition, CVPR (2020).

[32]

Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, and Wenjun Zeng. 2020. Tracking by Instance Detection: A Meta-Learning Approach. Conference on Computer Vision and Pattern Recognition, CVPR (2020).

[33]

Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).

[34]

Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).

[35]

Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).

[36]

Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. 2021. Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark. Conference on Computer Vision and Pattern Recognition, CVPR (2021).

[37]

Haiping Wu, Bin Xiao, Noel C. F. Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. CvT: Introducing Convolutions to Vision Transformers. International Conference on Computer Vision, ICCV (2021).

[38]

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. 2021. Learning Spatio-Temporal Transformer for Visual Tracking. International Conference on Computer Vision, ICCV (2021).

[39]

Botao Ye, Hong Chang, Bingpeng Ma, and Shiguang Shan. 2022. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. European Conference on Computer Vision,ECCV (2022).

[40]

Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. 2022. Neural Prompt Search. arXiv preprint arXiv:2206.04673 (2022).

[41]

Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, and Weiming Hu. 2021. Learn to Match: Automatic Matching Network Design for Visual Tracking. International Conference on Computer Vision, ICCV (2021).

[42]

Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. 2020. Ocean: Object-aware Anchor-free Tracking. European Conference on Computer Vision,ECCV (2020).

Index Terms

Robust Tracking via Unifying Pretrain-Finetuning and Visual Prompt Tuning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking

Recommendations

Robust object tracking via multi-cue fusion

A long-term object tracking method based on calibrated binocular cameras by fusing information of the two channels and binocular geometry constraints is proposed.The stereo filter which is built based on the epipolar geometry of the binocular cameras is ...
Robust Visual Tracking via Binocular Consistent Sparse Learning

In spite of the rapid development of visual tracking technologies, robust object tracking in the monocular images under complex environments still remains a challenging problem. In contrast to its monocular counterpart, stereo vision features more ...
Visual object tracking: A survey
Abstract
Visual object tracking is an important area in computer vision, and many tracking algorithms have been proposed with promising results. Existing object tracking approaches can be categorized into generative trackers, discriminative trackers, and ...
Graphical abstract

Display Omitted
Highlights
- Comprehensive overview of state-of-the-art tracking frameworks and datasets.
- Detailed evaluation conducted on five tracking benchmarks with quantitative and qualitative results.
- Comprehensive summary of trackers with different ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
139
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)5

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten