skip to main content
10.1145/3595916.3626410acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Robust Tracking via Unifying Pretrain-Finetuning and Visual Prompt Tuning

Published: 01 January 2024 Publication History

Abstract

The finetuning paradigm has been a widely used methodology for the supervised training of top-performing trackers. However, the finetuning paradigm faces one key issue: it is unclear how best to perform the finetuning method to adapt a pretrained model to tracking tasks while alleviating the catastrophic forgetting problem. To address this problem, we propose a novel partial finetuning paradigm for visual tracking via unifying pretrain-finetuning and visual prompt tuning (named UPVPT), which can not only efficiently learn knowledge from the tracking task but also reuse the prior knowledge learned by the pre-trained model for effectively handling various challenges in tracking task. Firstly, to maintain the pre-trained prior knowledge, we design a Prompt-style method to freeze some parameters of the pretrained network. Then, to learn knowledge from the tracking task, we update the parameters of the prompt and MLP layers. As a result, we cannot only retain useful prior knowledge of the pre-trained model by freezing the backbone network but also effectively learn target domain knowledge by updating the Prompt and MLP layer. Furthermore, the proposed UPVPT can easily be embedded into existing Transformer trackers (e.g., OSTracker and SwinTracker) by adding only a small number of model parameters (less than 1% of a Backbone network). Extensive experiments on five tracking benchmarks (i.e., UAV123, GOT-10k, LaSOT, TNL2K, and TrackingNet) demonstrate that the proposed UPVPT can improve the robustness and effectiveness of the model, especially in complex scenarios.

References

[1]
Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassne. 2022. Robust Visual Tracking by Segmentation. European Conference on Computer Vision,ECCV (2022).
[2]
Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-Convolutional Siamese Networks for Object Tracking. European Conference on Computer Vision,ECCV (2016).
[3]
Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning Discriminative Model Prediction for Tracking. International Conference on Computer Vision, ICCV (2019).
[4]
Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2018. Unveiling the Power of Deep Tracking. European Conference on Computer Vision,ECCV (2018).
[5]
Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. 2022. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. European Conference on Computer Vision,ECCV (2022).
[6]
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).
[7]
Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, Rongrong Ji, Zhenjun Tang, and Xianxian Li. 2022. SiamBAN: Target-Aware Tracking With Siamese Box Adaptive Network. IEEE Transactions on Pattern Analysis and Machine Intelligence,TPAMI (2022).
[8]
Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. 2022. MixFormer: End-to-End Tracking with Iterative Mixed Attention. Conference on Computer Vision and Pattern Recognition, CVPR (2022).
[9]
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2018. ATOM: Accurate Tracking by Overlap Maximization. Conference on Computer Vision and Pattern Recognition, CVPR (2018).
[10]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. Conference on Computer Vision and Pattern Recognition, CVPR (2009).
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations,ICLR (2021).
[12]
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2018. LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2018).
[13]
Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds and Machines (2020).
[14]
Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. 2022. SparseTT: Visual Tracking with Sparse Transformers. International Joint Conference on Artificial Intelligence,IJCAI (2022).
[15]
Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. 2021. STMTrack: Template-free Visual Tracking with Space-time Memory Networks. Conference on Computer Vision and Pattern Recognition, CVPR (2021).
[16]
Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, and Shengyong Chen. 2020. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2020).
[17]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked Autoencoders Are Scalable Vision Learners. Conference on Computer Vision and Pattern Recognition, CVPR (2022).
[18]
Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2022. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[19]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual Prompt Tuning. European Conference on Computer Vision,ECCV (2022).
[20]
Shibo Jie and Zhi-Hong Deng. 2022. Convolutional Bypasses Are Better Vision Transformer Adapters. arXiv preprint arXiv:2207.07039 (2022).
[21]
Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 2019. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. Conference on Computer Vision and Pattern Recognition, CVPR (2019).
[22]
Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, and Ming-Hsuan Yang. 2019. Target-Aware Deep Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2019).
[23]
Liting Lin, Heng Fan, Yong Xu, and Haibin Ling. 2021. SwinTrack: A Simple and Strong Baseline for Transformer Tracking. arXiv preprint arXiv:2112.00995 (2021).
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision,ECCV (2014).
[25]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.arXiv: Computation and Language (2021).
[26]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. International Conference on Learning Representations,ICLR (2019).
[27]
Matthias Mueller, Neil Smith, and Bernard Ghanem. 2016. A Benchmark and Simulator for UAV Tracking. European Conference on Computer Vision,ECCV (2016).
[28]
Matthias A. Müller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi, and Bernard Ghanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. European Conference on Computer Vision,ECCV (2018).
[29]
Hyeonseob Nam and Bohyung Han. 2015. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2015).
[30]
Chunhong Pan, Qi Tian, Shiming Xiang, Zhaoxiang Zhang, Bolin Ni, Xing Nie, Jianlong Chang, Gaomeng Meng, and Chunlei Huo. 2022. Pro-tuning: Unified Prompt Tuning for Vision Tasks. arXiv preprint arXiv:2207.14381 (2022).
[31]
Paul Voigtlaender, Jonathon Luiten, Philip H. S. Torr, and Bastian Leibe. 2020. Siam R-CNN: Visual Tracking by Re-Detection. Conference on Computer Vision and Pattern Recognition, CVPR (2020).
[32]
Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, and Wenjun Zeng. 2020. Tracking by Instance Detection: A Meta-Learning Approach. Conference on Computer Vision and Pattern Recognition, CVPR (2020).
[33]
Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).
[34]
Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).
[35]
Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. 2021. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Conference on Computer Vision and Pattern Recognition, CVPR (2021).
[36]
Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. 2021. Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark. Conference on Computer Vision and Pattern Recognition, CVPR (2021).
[37]
Haiping Wu, Bin Xiao, Noel C. F. Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. CvT: Introducing Convolutions to Vision Transformers. International Conference on Computer Vision, ICCV (2021).
[38]
Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. 2021. Learning Spatio-Temporal Transformer for Visual Tracking. International Conference on Computer Vision, ICCV (2021).
[39]
Botao Ye, Hong Chang, Bingpeng Ma, and Shiguang Shan. 2022. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. European Conference on Computer Vision,ECCV (2022).
[40]
Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. 2022. Neural Prompt Search. arXiv preprint arXiv:2206.04673 (2022).
[41]
Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, and Weiming Hu. 2021. Learn to Match: Automatic Matching Network Design for Visual Tracking. International Conference on Computer Vision, ICCV (2021).
[42]
Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. 2020. Ocean: Object-aware Anchor-free Tracking. European Conference on Computer Vision,ECCV (2020).

Index Terms

  1. Robust Tracking via Unifying Pretrain-Finetuning and Visual Prompt Tuning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
    December 2023
    745 pages
    ISBN:9798400702051
    DOI:10.1145/3595916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Object Tracking
    2. Pretrain-finetuning
    3. Prompt

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    MMAsia '23
    Sponsor:
    MMAsia '23: ACM Multimedia Asia
    December 6 - 8, 2023
    Tainan, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 139
      Total Downloads
    • Downloads (Last 12 months)87
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media