Abstract
Object tracking is an important topic in computer vision. Most existing trackers require an accurate initial position of the target. However, in the real application, the initial location for tracking may not be accurate, which may lead to tracking drift. To solve this problem, we propose a simple deep learning-based method called Refiner that can produce the accurate position of an object given its rough location. Specifically, we propose an end-to-end position refinement network that consists of a backbone network, a feature enhancement module, a feature fusion module, and a shape predictor; the shape predictor includes two branches: a bounding box prediction branch and a mask prediction branch. We improve the spatial robustness of existing trackers by correcting the inaccurate initial position. In addition, the proposed method can also be used in the tracking process to improve the accuracy of the subsequent tracking results. Lots of experiments on the object tracking benchmarks verify its effectiveness and efficiency.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-023-09263-9/MediaObjects/521_2023_9263_Fig12_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848. https://doi.org/10.1109/TPAMI.2014.2388226
Danelljan M, Van Gool L, Timofte R (2020) Probabilistic regression for visual tracking. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR42600.2020.00721
Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Gool LV (2022) Transforming model prediction for tracking. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.00853
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 8126–8135. https://doi.org/10.1109/CVPR46437.2021.00803
Danelljan M, Bhat G, Khan FS, Felsberg M (2017) ECO: efficient convolution operators for tracking. In: 2017 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 6931–6939. https://doi.org/10.1109/CVPR.2017.733
Zhang J, Ma S, Sclaroff S (2014) MEEM: robust tracking via multiple experts using entropy minimization. In: Computer vision—ECCV 2014. Lecture notes in computer science, vol 8694, pp 188–203. https://doi.org/10.1007/978-3-319-10599-4_13
Hare S, Golodetz S, Saffari A, Vineet V, Cheng M, Hicks SL, Torr PHS (2016) Struck: Structured output tracking with kernels. IEEE Trans Pattern Anal Mach Intell 38(10):2096–2109. https://doi.org/10.1109/TPAMI.2015.2509974
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422. https://doi.org/10.1109/TPAMI.2011.239
Felzenszwalb PF, Girshick RB, McAllester DA, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645. https://doi.org/10.1109/TPAMI.2009.167
Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 580–587. https://doi.org/10.1109/CVPR.2014.81
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016. Barcelona, Spain, pp 379–387
Cai Z, Vasconcelos N (2021) Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell 43(5):1483–1498. https://doi.org/10.1109/TPAMI.2019.2956516
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 9626–9635. https://doi.org/10.1109/ICCV.2019.00972
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid ID, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE Conference on computer vision and pattern recognition (CVPR), pp 658–666. https://doi.org/10.1109/CVPR.2019.00075
Lin T, Goyal P, Girshick RB, He K, Dollár P (2020) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 42(2):318–327. https://doi.org/10.1109/TPAMI.2018.2858826
Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse R-CNN: end-to-end object detection with learnable proposals. In: 2021 IEEE Conference on computer vision and pattern recognition (CVPR), pp 14454–14463. https://doi.org/10.1109/CVPR46437.2021.01422
Jiang B, Luo R, Mao J, Xiao T, Jiang Y (2018) Acquisition of localization confidence for accurate object detection. In: Computer vision—ECCV 2018, vol 11218, pp 816–832. https://doi.org/10.1007/978-3-030-01264-9_48
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C, Berg AC (2016) SSD: single shot multibox detector. In: Computer vision—ECCV 2016, vol 9905, pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
Lin T, Dollár P, Girshick RB, He K, Hariharan B, Belongie SJ (2017) Feature pyramid networks for object detection. In: 2017 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 936–944. https://doi.org/10.1109/CVPR.2017.106
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 4293–4302. https://doi.org/10.1109/CVPR.2016.465
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 8971–8980. https://doi.org/10.1109/CVPR.2018.00935
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4277–4286. https://doi.org/10.1109/CVPR.2019.00441
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4655–4664. https://doi.org/10.1109/CVPR.2019.00479
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: Computer vision—ECCV 2020, pp 771–787. https://doi.org/10.1007/978-3-030-58589-1_46
Yan B, Zhang X, Wang D, Lu H, Yang X (2021) Alpha-refine: boosting tracking performance by precise bounding box estimation. In: 2021 IEEE Conference on computer vision and pattern recognition (CVPR), pp 5289–5298. https://doi.org/10.1109/CVPR46437.2021.00525
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional siamese networks for object tracking. In: Computer vision—ECCV 2016 workshops, vol 9914, pp 850–865. https://doi.org/10.1007/978-3-319-48881-3_56
Danelljan M, Häger G, Khan FS, Felsberg M (2017) Discriminative scale space tracking. IEEE Trans Pattern Anal Mach Intell 39(8):1561–1575. https://doi.org/10.1109/TPAMI.2016.2609928
Yuan D, Chang X, Huang P, Liu Q, He Z (2021) Self-supervised deep correlation tracking. Trans Image Process 30:976–985. https://doi.org/10.1109/TIP.2020.3037518
Yang K, He Z, Pei W, Zhou Z, Li X, Yuan D, Zhang H (2022) Siamcorners: siamese corner networks for visual tracking. IEEE Trans Multimed 24:1956–1967. https://doi.org/10.1109/TMM.2021.3074239
Yuan D, Chang X, Li Z, He Z (2022) Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans Multim Comput Commun Appl 18(3):1–18. https://doi.org/10.1145/3486678
Yuan D, Chang X, Liu Q, Yang Y, Wang D, Shu M, He Z, Shi G (2023) Active learning for deep visual tracking. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3266837
Lempitsky VS, Kohli P, Rother C, Sharp T (2009) Image segmentation with a bounding box prior. In: 2009 IEEE International conference on computer vision (ICCV), pp 277–284. https://doi.org/10.1109/ICCV.2009.5459262
Hsu C, Hsu K, Tsai C, Lin Y, Chuang Y (2019) Weakly supervised instance segmentation using the bounding box tightness prior. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, pp 6582–6593
Lan S, Yu Z, Choy CB, Radhakrishnan S, Liu G, Zhu Y, Davis LS, Anandkumar A (2021) Discobox: weakly supervised instance segmentation and semantic correspondence from box supervision. In: 2021 IEEE International conference on computer vision (ICCV), pp 3386–3396. https://doi.org/10.1109/ICCV48922.2021.00339
Wang X, Feng J, Hu B, Ding Q, Ran L, Chen X, Liu W (2021) Weakly-supervised instance segmentation via class-agnostic learning with salient images. In: 2021 IEEE Conference on computer vision and pattern recognition (CVPR), pp 10225–10235. https://doi.org/10.1109/CVPR46437.2021.01009
Rother C, Kolmogorov V, Blake A (2004) “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314. https://doi.org/10.1145/1015706.1015720
Xu N, Price BL, Cohen S, Yang J, Huang TS (2017) Deep grabcut for object selection. In: British machine vision conference 2017, BMVC
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 8759–8768. https://doi.org/10.1109/CVPR.2018.00913
Chen K, Pang J, Wang J, Xiong Y, Li X, Sun S, Feng W, Liu Z, Shi J, Ouyang W, Loy CC, Lin D (2019) Hybrid task cascade for instance segmentation. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4974–4983. https://doi.org/10.1109/CVPR.2019.00511
He K, Gkioxari G, Dollár P, Girshick RB (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42(2):386–397. https://doi.org/10.1109/TPAMI.2018.2844175
Bolya D, Zhou C, Xiao F, Lee YJ (2022) YOLACT++ better real-time instance segmentation. IEEE Trans Pattern Anal Mach Intell 44(2):1108–1121. https://doi.org/10.1109/TPAMI.2020.3014297
Li F, Zhang H, Xu H, Liu S, Zhang L, Ni LM, Shum H (2022) Mask DINO: towards a unified transformer-based framework for object detection and segmentation. CoRR https://doi.org/10.48550/arXiv.2206.02777
Ren S, He K, Girshick RB, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Zhuge M, Fan D-P, Liu N, Zhang D, Xu D, Shao L (2022) Salient object detection via integrity learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3179526
Zhou X, Shen K, Liu Z, Gong C, Zhang J, Yan C (2022) Edge-aware multiscale feature integration network for salient object detection in optical remote sensing images. IEEE Trans Geosci Remote Sens 60:1–15. https://doi.org/10.1109/TGRS.2021.3091312
Liu Z, Mao H, Wu C, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.01167
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Zhou D, Yu Z, Xie E, Xiao C, Anandkumar A, Feng J, Alvarez JM (2022) Understanding the robustness in vision transformers. In: International conference on machine learning, ICML 2022. Proceedings of machine learning research, vol 162, pp 27378–27394
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, ICML 2015. JMLR Workshop and Conference Proceedings, vol 37, pp 448–456
Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization. CoRR arxiv:abs/1607.06450
Hendrycks D, Gimpel K (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR arxiv:abs/1606.08415
Luvizon DC, Tabia H, Picard D (2019) Human pose regression by combining indirect part detection and contextual information. Comput Graph 85:15–22. https://doi.org/10.1016/j.cag.2019.09.002
Fan H, Bai H, Lin L, Yang F, Chu P, Deng G, Yu S, Harshit Huang M, Liu J, Xu Y, Liao C, Yuan L, Ling H (2021) LaSOT: a high-quality large-scale single object tracking benchmark. Int J Comput Vis 129(2):439–461. https://doi.org/10.1007/s11263-020-01387-y
Huang L, Zhao X, Huang K (2021) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577. https://doi.org/10.1109/TPAMI.2019.2957464
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Computer vision—ECCV 2014, vol 8693, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price BL, Cohen S, Huang TS (2018) Youtube-vos: sequence-to-sequence video object segmentation. In: Computer vision—ECCV 2018, vol 11209, pp 603–619. https://doi.org/10.1007/978-3-030-01228-1_36
Deng C, Yang X, Nie F, Tao D (2020) Saliency detection via a multiple self-weighted graph-based manifold ranking. IEEE Trans Multimed 22(4):885–896. https://doi.org/10.1109/TMM.2019.2934833
Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, Ruan X (2017) Learning to detect salient objects with image-level supervision. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3796–3805. https://doi.org/10.1109/CVPR.2017.404
Shi J, Yan Q, Xu L, Jia J (2016) Hierarchical image saliency detection on extended CSSD. IEEE Trans Pattern Anal Mach Intell 38(4):717–729. https://doi.org/10.1109/TPAMI.2015.2465960
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: 2015 IEEE International conference on computer vision (ICCV), pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR 2015
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International conference on computer vision, ICCV, pp 6181–6190. https://doi.org/10.1109/ICCV.2019.00628
Kristan M, Leonardis A, Matas J et al (2020) The eighth visual object tracking VOT2020 challenge results. In: Computer vision—ECCV 2020 workshops, vol 12539, pp 547–601. https://doi.org/10.1007/978-3-030-68238-5_39
Mayer C, Danelljan M, Paudel DP, Gool LV (2021) Learning target candidate association to keep track of what not to track. In: 2021 IEEE/CVF International conference on computer vision, ICCV, pp 13424–13434. https://doi.org/10.1109/ICCV48922.2021.01319
Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: Computer vision—ECCV 2020. Lecture notes in computer science, vol 12368, pp 205–221. https://doi.org/10.1007/978-3-030-58592-1_13
Kristan M, Leonardis A, Matas J et al (2022) The tenth visual object tracking VOT2022 challenge results. In: Karlinsky L, Michaeli T, Nishino K (eds) Computer vision—ECCV 2022 workshops, vol 13808, pp 431–460. https://doi.org/10.1007/978-3-031-25085-9_25
Cui Y, Jiang C, Wang L, Wu G (2022) MixFormer: End-to-end tracking with iterative mixed attention. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.01324
Müller M, Bibi A, Giancola S, Al-Subaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Computer vision—ECCV 2018, vol 11205, pp 310–327. https://doi.org/10.1007/978-3-030-01246-5_19
Funding
Funding was provided by Shaanxi Key Research and Development Program (Grant No. 2018ZDCXL-GY-04-03-02).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, H., Zhao, B. & Liu, G. Refiner: a general object position refinement algorithm for visual tracking. Neural Comput & Applic 36, 3967–3981 (2024). https://doi.org/10.1007/s00521-023-09263-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09263-9