Skip to main content
Log in

Refiner: a general object position refinement algorithm for visual tracking

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Object tracking is an important topic in computer vision. Most existing trackers require an accurate initial position of the target. However, in the real application, the initial location for tracking may not be accurate, which may lead to tracking drift. To solve this problem, we propose a simple deep learning-based method called Refiner that can produce the accurate position of an object given its rough location. Specifically, we propose an end-to-end position refinement network that consists of a backbone network, a feature enhancement module, a feature fusion module, and a shape predictor; the shape predictor includes two branches: a bounding box prediction branch and a mask prediction branch. We improve the spatial robustness of existing trackers by correcting the inaccurate initial position. In addition, the proposed method can also be used in the tracking process to improve the accuracy of the subsequent tracking results. Lots of experiments on the object tracking benchmarks verify its effectiveness and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848. https://doi.org/10.1109/TPAMI.2014.2388226

    Article  PubMed  Google Scholar 

  2. Danelljan M, Van Gool L, Timofte R (2020) Probabilistic regression for visual tracking. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR42600.2020.00721

  3. Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Gool LV (2022) Transforming model prediction for tracking. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.00853

  4. Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 8126–8135. https://doi.org/10.1109/CVPR46437.2021.00803

  5. Danelljan M, Bhat G, Khan FS, Felsberg M (2017) ECO: efficient convolution operators for tracking. In: 2017 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 6931–6939. https://doi.org/10.1109/CVPR.2017.733

  6. Zhang J, Ma S, Sclaroff S (2014) MEEM: robust tracking via multiple experts using entropy minimization. In: Computer vision—ECCV 2014. Lecture notes in computer science, vol 8694, pp 188–203. https://doi.org/10.1007/978-3-319-10599-4_13

  7. Hare S, Golodetz S, Saffari A, Vineet V, Cheng M, Hicks SL, Torr PHS (2016) Struck: Structured output tracking with kernels. IEEE Trans Pattern Anal Mach Intell 38(10):2096–2109. https://doi.org/10.1109/TPAMI.2015.2509974

    Article  PubMed  Google Scholar 

  8. Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422. https://doi.org/10.1109/TPAMI.2011.239

    Article  PubMed  Google Scholar 

  9. Felzenszwalb PF, Girshick RB, McAllester DA, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645. https://doi.org/10.1109/TPAMI.2009.167

    Article  PubMed  Google Scholar 

  10. Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 580–587. https://doi.org/10.1109/CVPR.2014.81

  11. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016. Barcelona, Spain, pp 379–387

  12. Cai Z, Vasconcelos N (2021) Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell 43(5):1483–1498. https://doi.org/10.1109/TPAMI.2019.2956516

    Article  PubMed  Google Scholar 

  13. Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 9626–9635. https://doi.org/10.1109/ICCV.2019.00972

  14. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid ID, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE Conference on computer vision and pattern recognition (CVPR), pp 658–666. https://doi.org/10.1109/CVPR.2019.00075

  15. Lin T, Goyal P, Girshick RB, He K, Dollár P (2020) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 42(2):318–327. https://doi.org/10.1109/TPAMI.2018.2858826

    Article  PubMed  Google Scholar 

  16. Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse R-CNN: end-to-end object detection with learnable proposals. In: 2021 IEEE Conference on computer vision and pattern recognition (CVPR), pp 14454–14463. https://doi.org/10.1109/CVPR46437.2021.01422

  17. Jiang B, Luo R, Mao J, Xiao T, Jiang Y (2018) Acquisition of localization confidence for accurate object detection. In: Computer vision—ECCV 2018, vol 11218, pp 816–832. https://doi.org/10.1007/978-3-030-01264-9_48

  18. Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C, Berg AC (2016) SSD: single shot multibox detector. In: Computer vision—ECCV 2016, vol 9905, pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

  19. Lin T, Dollár P, Girshick RB, He K, Hariharan B, Belongie SJ (2017) Feature pyramid networks for object detection. In: 2017 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 936–944. https://doi.org/10.1109/CVPR.2017.106

  20. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 4293–4302. https://doi.org/10.1109/CVPR.2016.465

  21. Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 8971–8980. https://doi.org/10.1109/CVPR.2018.00935

  22. Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4277–4286. https://doi.org/10.1109/CVPR.2019.00441

  23. Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4655–4664. https://doi.org/10.1109/CVPR.2019.00479

  24. Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: Computer vision—ECCV 2020, pp 771–787. https://doi.org/10.1007/978-3-030-58589-1_46

  25. Yan B, Zhang X, Wang D, Lu H, Yang X (2021) Alpha-refine: boosting tracking performance by precise bounding box estimation. In: 2021 IEEE Conference on computer vision and pattern recognition (CVPR), pp 5289–5298. https://doi.org/10.1109/CVPR46437.2021.00525

  26. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional siamese networks for object tracking. In: Computer vision—ECCV 2016 workshops, vol 9914, pp 850–865. https://doi.org/10.1007/978-3-319-48881-3_56

  27. Danelljan M, Häger G, Khan FS, Felsberg M (2017) Discriminative scale space tracking. IEEE Trans Pattern Anal Mach Intell 39(8):1561–1575. https://doi.org/10.1109/TPAMI.2016.2609928

    Article  PubMed  Google Scholar 

  28. Yuan D, Chang X, Huang P, Liu Q, He Z (2021) Self-supervised deep correlation tracking. Trans Image Process 30:976–985. https://doi.org/10.1109/TIP.2020.3037518

    Article  ADS  Google Scholar 

  29. Yang K, He Z, Pei W, Zhou Z, Li X, Yuan D, Zhang H (2022) Siamcorners: siamese corner networks for visual tracking. IEEE Trans Multimed 24:1956–1967. https://doi.org/10.1109/TMM.2021.3074239

    Article  Google Scholar 

  30. Yuan D, Chang X, Li Z, He Z (2022) Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans Multim Comput Commun Appl 18(3):1–18. https://doi.org/10.1145/3486678

    Article  Google Scholar 

  31. Yuan D, Chang X, Liu Q, Yang Y, Wang D, Shu M, He Z, Shi G (2023) Active learning for deep visual tracking. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3266837

    Article  PubMed  Google Scholar 

  32. Lempitsky VS, Kohli P, Rother C, Sharp T (2009) Image segmentation with a bounding box prior. In: 2009 IEEE International conference on computer vision (ICCV), pp 277–284. https://doi.org/10.1109/ICCV.2009.5459262

  33. Hsu C, Hsu K, Tsai C, Lin Y, Chuang Y (2019) Weakly supervised instance segmentation using the bounding box tightness prior. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, pp 6582–6593

  34. Lan S, Yu Z, Choy CB, Radhakrishnan S, Liu G, Zhu Y, Davis LS, Anandkumar A (2021) Discobox: weakly supervised instance segmentation and semantic correspondence from box supervision. In: 2021 IEEE International conference on computer vision (ICCV), pp 3386–3396. https://doi.org/10.1109/ICCV48922.2021.00339

  35. Wang X, Feng J, Hu B, Ding Q, Ran L, Chen X, Liu W (2021) Weakly-supervised instance segmentation via class-agnostic learning with salient images. In: 2021 IEEE Conference on computer vision and pattern recognition (CVPR), pp 10225–10235. https://doi.org/10.1109/CVPR46437.2021.01009

  36. Rother C, Kolmogorov V, Blake A (2004) “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314. https://doi.org/10.1145/1015706.1015720

    Article  Google Scholar 

  37. Xu N, Price BL, Cohen S, Yang J, Huang TS (2017) Deep grabcut for object selection. In: British machine vision conference 2017, BMVC

  38. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 8759–8768. https://doi.org/10.1109/CVPR.2018.00913

  39. Chen K, Pang J, Wang J, Xiong Y, Li X, Sun S, Feng W, Liu Z, Shi J, Ouyang W, Loy CC, Lin D (2019) Hybrid task cascade for instance segmentation. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4974–4983. https://doi.org/10.1109/CVPR.2019.00511

  40. He K, Gkioxari G, Dollár P, Girshick RB (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42(2):386–397. https://doi.org/10.1109/TPAMI.2018.2844175

    Article  PubMed  Google Scholar 

  41. Bolya D, Zhou C, Xiao F, Lee YJ (2022) YOLACT++ better real-time instance segmentation. IEEE Trans Pattern Anal Mach Intell 44(2):1108–1121. https://doi.org/10.1109/TPAMI.2020.3014297

    Article  PubMed  Google Scholar 

  42. Li F, Zhang H, Xu H, Liu S, Zhang L, Ni LM, Shum H (2022) Mask DINO: towards a unified transformer-based framework for object detection and segmentation. CoRR https://doi.org/10.48550/arXiv.2206.02777

  43. Ren S, He K, Girshick RB, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  PubMed  Google Scholar 

  44. Zhuge M, Fan D-P, Liu N, Zhang D, Xu D, Shao L (2022) Salient object detection via integrity learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3179526

    Article  Google Scholar 

  45. Zhou X, Shen K, Liu Z, Gong C, Zhang J, Yan C (2022) Edge-aware multiscale feature integration network for salient object detection in optical remote sensing images. IEEE Trans Geosci Remote Sens 60:1–15. https://doi.org/10.1109/TGRS.2021.3091312

    Article  Google Scholar 

  46. Liu Z, Mao H, Wu C, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.01167

  47. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

  48. Zhou D, Yu Z, Xie E, Xiao C, Anandkumar A, Feng J, Alvarez JM (2022) Understanding the robustness in vision transformers. In: International conference on machine learning, ICML 2022. Proceedings of machine learning research, vol 162, pp 27378–27394

  49. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, ICML 2015. JMLR Workshop and Conference Proceedings, vol 37, pp 448–456

  50. Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization. CoRR arxiv:abs/1607.06450

  51. Hendrycks D, Gimpel K (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR arxiv:abs/1606.08415

  52. Luvizon DC, Tabia H, Picard D (2019) Human pose regression by combining indirect part detection and contextual information. Comput Graph 85:15–22. https://doi.org/10.1016/j.cag.2019.09.002

    Article  Google Scholar 

  53. Fan H, Bai H, Lin L, Yang F, Chu P, Deng G, Yu S, Harshit Huang M, Liu J, Xu Y, Liao C, Yuan L, Ling H (2021) LaSOT: a high-quality large-scale single object tracking benchmark. Int J Comput Vis 129(2):439–461. https://doi.org/10.1007/s11263-020-01387-y

    Article  Google Scholar 

  54. Huang L, Zhao X, Huang K (2021) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577. https://doi.org/10.1109/TPAMI.2019.2957464

    Article  PubMed  Google Scholar 

  55. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  56. Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Computer vision—ECCV 2014, vol 8693, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

  57. Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price BL, Cohen S, Huang TS (2018) Youtube-vos: sequence-to-sequence video object segmentation. In: Computer vision—ECCV 2018, vol 11209, pp 603–619. https://doi.org/10.1007/978-3-030-01228-1_36

  58. Deng C, Yang X, Nie F, Tao D (2020) Saliency detection via a multiple self-weighted graph-based manifold ranking. IEEE Trans Multimed 22(4):885–896. https://doi.org/10.1109/TMM.2019.2934833

    Article  Google Scholar 

  59. Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, Ruan X (2017) Learning to detect salient objects with image-level supervision. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3796–3805. https://doi.org/10.1109/CVPR.2017.404

  60. Shi J, Yan Q, Xu L, Jia J (2016) Hierarchical image saliency detection on extended CSSD. IEEE Trans Pattern Anal Mach Intell 38(4):717–729. https://doi.org/10.1109/TPAMI.2015.2465960

    Article  PubMed  Google Scholar 

  61. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: 2015 IEEE International conference on computer vision (ICCV), pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123

  62. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR 2015

  63. Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International conference on computer vision, ICCV, pp 6181–6190. https://doi.org/10.1109/ICCV.2019.00628

  64. Kristan M, Leonardis A, Matas J et al (2020) The eighth visual object tracking VOT2020 challenge results. In: Computer vision—ECCV 2020 workshops, vol 12539, pp 547–601. https://doi.org/10.1007/978-3-030-68238-5_39

  65. Mayer C, Danelljan M, Paudel DP, Gool LV (2021) Learning target candidate association to keep track of what not to track. In: 2021 IEEE/CVF International conference on computer vision, ICCV, pp 13424–13434. https://doi.org/10.1109/ICCV48922.2021.01319

  66. Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: Computer vision—ECCV 2020. Lecture notes in computer science, vol 12368, pp 205–221. https://doi.org/10.1007/978-3-030-58592-1_13

  67. Kristan M, Leonardis A, Matas J et al (2022) The tenth visual object tracking VOT2022 challenge results. In: Karlinsky L, Michaeli T, Nishino K (eds) Computer vision—ECCV 2022 workshops, vol 13808, pp 431–460. https://doi.org/10.1007/978-3-031-25085-9_25

  68. Cui Y, Jiang C, Wang L, Wu G (2022) MixFormer: End-to-end tracking with iterative mixed attention. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.01324

  69. Müller M, Bibi A, Giancola S, Al-Subaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Computer vision—ECCV 2018, vol 11205, pp 310–327. https://doi.org/10.1007/978-3-030-01246-5_19

Download references

Funding

Funding was provided by Shaanxi Key Research and Development Program (Grant No. 2018ZDCXL-GY-04-03-02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, H., Zhao, B. & Liu, G. Refiner: a general object position refinement algorithm for visual tracking. Neural Comput & Applic 36, 3967–3981 (2024). https://doi.org/10.1007/s00521-023-09263-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09263-9

Keywords

Navigation