Abstract
The discriminative model prediction (DiMP) object tracking model is an excellent end-to-end tracking framework and have achieved the best results of its time. However, there are two problems with DiMP in the process of actual use: (1) DiMP is prone to interference from similar objects during the tracking process, and (2) DiMP requires a large amount of labeled data for training. In this paper, we propose two methods to enhance the robustness of interference to similar objects in target tracking: multi-scale region search and Gaussian convolution-based response map processing. Simultaneously, aiming at tackling the issue of requiring a large amount of labeled data for training, we implement self-supervised training based on forward-backward tracking for the DiMP tracking method. Furthermore, a new consistency loss function is designed to better self-supervised training. Extensive experiments show that the enhancements implemented in the DiMP tracking framework can bolster its robustness, and the tracker based on self-supervised training has outstanding tracking performance.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Yuan D, Shu X, Liu Q, Zhang X, He Z (2023) Robust thermal infrared tracking via an adaptively multi-feature fusion model. Neural Comput Appl 35:3423–3434
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4282–4291
Xu L, Gao M, Liu Z, Li Q, Jeon G (2022) Accelerated duality-aware correlation filters for visual tracking. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06794-x
Yuan D, Li X, He Z, Liu Q, Lu S (2020) Visual object tracking with adaptive structural convolutional network. Knowl Based Syst 194:105554
Bhat G, Danelljan M, Van Gool L, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6181–6190
Martin D, Bhat G (2019) Pytracking: visual tracking library based on pytorch. https://github.com/visionml/pytracking
Choi S, Lee J, Lee Y, Hauptmann A (2020) Robust long-term object tracking via improved discriminative model prediction. In: Proceedings of the European conference on computer vision. Springer, pp 602–617
Yuan D, Shu X, Fan N, Chang X, Liu Q, He Z (2022) Accurate bounding-box regression with distance-IoU loss for visual tracking. J Vis Commun Image Represent 83:103428
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1571–1580
Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5369–5378
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision, pp 300–317
Huang L, Zhao X, Huang K (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision. Springer, pp 740–755
Meng F, Gong X, Zhang Y (2023) RHL-track: visual object tracking based on recurrent historical localization. Neural Comput Appl 35(17):12611–12625
Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Van Gool L (2022) Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8731–8740
Ke X, Li Y, Guo W, Huang Y (2022) Learning deep convolutional descriptor aggregation for efficient visual tracking. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06638-8
Wang N, Song Y, Ma C, Zhou W, Liu W, Li H (2019) Unsupervised deep tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1308–1317
Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1851–1858
Zhao N, Wu Z, Lau RW, Lin S (2021) Distilling localization for self-supervised representation learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 10990–10998
Yuan D, Shu X, Liu Q, He Z (2022) Aligned spatial-temporal memory network for thermal infrared target tracking. IEEE Trans Circuits Syst II Express Briefs 70(3):1224–1228
Joyce JM (2011) Kullback–Leibler divergence. International encyclopedia of statistical science. Springer, Berlin, pp 720–722
Liu Q, Li X, He Z, Fan N, Yuan D, Wang H (2021) Learning deep multi-level similarity for thermal infrared object tracking. IEEE Trans Multimed 23:2114–2126
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the advances in neural information processing systems, vol 30
Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1058–1067
Zhang C, Zhang K, Pham TX, Niu A, Qiao Z, Yoo CD, Kweon IS (2020) Dual temperature helps contrastive learning without many negative samples: towards understanding and simplifying moco. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14441–14450
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: self-supervised learning via redundancy reduction. In: Proceedings of the international conference on machine learning, pp 12310–12320
Yun S, Lee H, Kim J, Shin J (2022) Patch-level representation learning for self-supervised vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8354–8363
Sun J, Zhang L, Zha Y, Gonzalez-Garcia A, Zhang P, Huang W, Zhang Y (2021) Unsupervised cross-modal distillation for thermal infrared tracking, in: Proceedings of the 29th ACM international conference on multimedia, pp 2262–2270
Lukežic A, Vojír T, Zajc LC, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4847–4856
Sio CH, Ma YJ, Shuai HH, Chen JC, Cheng WH (2020) S2siamfc: self-supervised fully convolutional siamese network for visual tracking. In: Proceedings of the 28th ACM international conference on multimedia, pp 1948–1957
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional siamese networks for object tracking. In: Proceedings of the European conference on computer vision
Li X, Liu S, De Mello S, Wang X, Kautz J, Yang MH (2019) Joint-task self-supervised learning for temporal correspondence. In: Proceedings of the advances in neural information processing systems, vol 32
Yuan D, Chang X, Li Z, He Z (2022) Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans Multimed Comput Commun Appl 18(3):1–18
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision, pp 695–712
Yuan D, Shu X, He Z (2020) TRBACF: learning temporal regularized correlation filters for high performance online visual object tracking. J Vis Commun Image Represent 72:102882
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) ATOM: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669
Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Liang P, Blasch E, Ling H (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans Image Process 24(12):5630–5644
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: Proceedings of the European conference on computer vision, pp 445–461
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Yuan D, Chang X, Liu Q, Yang Y, Wang D, Shu M, He Z, Shi G (2023) Active learning for deep visual tracking. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3266837
Yuan D, Chang X, Huang P-Y, Liu Q, He Z (2021) Self-supervised deep correlation tracking. IEEE Trans Image Process 30:976–985
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision, pp 101–117
Gu F, Lu J, Cai C, Zhu Q, Ju Z (2023) Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking. Neural Comput Appl 1–23
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135
Zhang J, Yuan T, He Y, Wang J (2022) A background-aware correlation filter with adaptive saliency-aware regularization for visual tracking. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06771-4
Acknowledgements
This research was supported by the National Natural Science Foundation of China under Grant Nos. 62202362, 62302073 and 62172126, by the Guangzhou Key Laboratory of Scene Understanding and Intelligent Interaction under Grant No. 202201000001, by the Fundamental Research Funds for the Central Universities under Grant No. XJS222503, by the China Postdoctoral Science Foundation under Grant Nos. 2022TQ0247 and 2023M742742, and by the Science and Technology Projects in Guangzhou under Grant No. 2023A04J0397.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yuan, D., Geng, G., Shu, X. et al. Self-supervised discriminative model prediction for visual tracking. Neural Comput & Applic 36, 5153–5164 (2024). https://doi.org/10.1007/s00521-023-09348-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09348-5