Skip to main content
Log in

ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Object tracking has become a crucial area of research in the field of intelligent perception in recent years. Current mainstream single-object trackers utilize the method of point regression and heatmaps to predict the position of the target. However, the performance of these models can be negatively impacted by occlusions of key points. To address this issue, we introduce the ExtremeFormer model, which includes a backbone similar to OSTrack and an ENM (Extreme Net Module) head. Our core idea is to use the ENM module to predict the target’s position by regressing the position of the edges instead of the points. Different from traditional center-based regression methods, ENM predicts the left, top, right, and bottom boundaries of the target bounding box through the network and uses an offset branch to compensate for errors caused by resolution reduction. This approach greatly alleviates the problem of tracking failure caused by occlusion of the center point and improves the robustness of the tracking model. In addition, our tracker does not require Hanning windows or penalties to ensure stability during tracking. Our final ExtremeFormer model outperforms existing state-of-the-art trackers on four tracking benchmarks, including LaSOT, TrackingNet, GOT-10k, and UAV123. Specifically, our ExtremeFormer-384 achieves a Precision score of 83.1% on TrackingNet, 74.9% on LaSOT, and an AO of 73.9% on GOT-10k. These results demonstrate the effectiveness of our proposed model, which provides a more robust and accurate approach for single-object tracking in challenging environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

Data will be made available on request.

References

  1. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 8971–8980 (2018). https://doi.org/10.1109/CVPR.2018.00935. http://openaccess.thecvf.com/content_cvpr_2018/html/Li_High_Performance_Visual_CVPR_2018_paper.html

  2. Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.S.: Fast online object tracking and segmentation: a unifying approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1328–1338 (2019). https://doi.org/10.1109/CVPR.2019.00142. http://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Fast_Online_Object_Tracking_and_Segmentation_A_Unifying_Approach_CVPR_2019_paper.html

  3. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 4282–4291 (2019). https://doi.org/10.1109/CVPR.2019.00441. http://openaccess.thecvf.com/content_CVPR_2019/html/Li_SiamRPN_Evolution_of_Siamese_Visual_Tracking_With_Very_Deep_Networks_CVPR_2019_paper.html

  4. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, pp. 6181–6190 (2019). https://doi.org/10.1109/ICCV.2019.00628

  5. Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S.: Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6268–6276 (2020). https://doi.org/10.1109/CVPR42600.2020.00630

  6. Zhao, M., Okada, K., Inaba, M.: Trtr: Visual tracking with transformer. (2021). arXiv:2105.03817

  7. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: International Conference on Computer Vision (2021)

  8. Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision. Springer, pp. 341–357 (2022)

  9. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  10. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: object detection with keypoint triplets (2019). arXiv:1904.08189

  11. Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 5374–5383 (2019). https://doi.org/10.1109/CVPR.2019.00552. http://openaccess.thecvf.com/content_CVPR_2019/html/Fan_LaSOT_A_High-Quality_Benchmark_for_Large-Scale_Single_Object_Tracking_CVPR_2019_paper.html

  12. Müller, M.A., Bibi, A., Giancola, S., Al-Subaihi, S., Ghanem, B.: Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: European Conference on Computer Vision . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8781–8790 (2018)

  13. Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1562–1577 (2022)

    Article  Google Scholar 

  14. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: European Conference on Computer Vision (2016)

  15. Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6667–6676 (2020). https://doi.org/10.1109/CVPR42600.2020.00670

  16. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. Learning (2020). arXiv:2010.11929

  17. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (2021)

  18. Ma, F., Shou, M.Z., Zhu, L., Fan, H., Xu, Y., Yang, Y., Yan, Z.: Unified transformer tracker for object tracking, pp. 8781–8790 (2022)

  19. Lin, L., Fan, H., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. (2021). arXiv:2112.00995

  20. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: European Conference on Computer Vision (2016)

  21. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems vol. 28 (2015)

  22. Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confidence for accurate object detection. In: European Conference on Computer Vision (2018)

  23. Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: National Conference on Artificial Intelligence (2020)

  24. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 4660–4669 (2019). https://doi.org/10.1109/CVPR.2019.00479. http://openaccess.thecvf.com/content_CVPR_2019/html/Danelljan_ATOM_Accurate_Tracking_by_Overlap_Maximization_CVPR_2019_paper.html

  25. Zhou, X., Zhuo, J., Krähenbühl, P.: Bottom-up object detection by grouping extreme and center points. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 850–859 (2019). https://doi.org/10.1109/CVPR.2019.00094. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhou_Bottom-Up_Object_Detection_by_Grouping_Extreme_and_Center_Points_CVPR_2019_paper.html

  26. Chen, G., Qin, H.: Class-discriminative focal loss for extreme imbalanced multiclass object detection towards autonomous driving. Vis. Comput. 38(3), 1051–1063 (2022)

    Article  Google Scholar 

  27. Amirkhani, A., Karimi, M.P.: Adversarial defenses for object detectors based on Gabor convolutional layers. Vis. Comput. 38(6), 1929–1944 (2022)

    Article  Google Scholar 

  28. An, F.-P., Liu, J.-E., Bai, L.: Object recognition algorithm based on optimized nonlinear activation function-global convolutional neural network. Visual Comput. 38, 541–553 (2022)

    Article  Google Scholar 

  29. Dong, X., Shen, J., Wang, W., Shao, L., Ling, H., Porikli, F.: Dynamical hyperparameter optimization via deep reinforcement learning in tracking. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1515–1529 (2019)

    Article  Google Scholar 

  30. Dong, X., Shen, J., Yu, D., Wang, W., Liu, J., Huang, H.: Occlusion-aware real-time object tracking. IEEE Trans. Multimedia 19(4), 763–771 (2016)

    Article  Google Scholar 

  31. Yin, J., Wang, W., Meng, Q., Yang, R., Shen, J.: A unified object motion and affinity model for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6768–6777 (2020)

  32. Tang, H., Li, Z., Peng, Z., Tang, J.: Blockmix: meta regularization and self-calibrated inference for metric-based meta-learning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 610–618 (2020)

  33. Tang, H., Yuan, C., Li, Z., Tang, J.: Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 130, 108792 (2022)

    Article  Google Scholar 

  34. Zha, Z., Tang, H., Sun, Y., Tang, J.: Boosting few-shot fine-grained recognition with background suppression and foreground alignment. IEEE Trans. Circuits Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3236636

    Article  Google Scholar 

  35. Li, Z., Tang, H., Peng, Z., Qi, G.-J., Tang, J.: Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Trans. Neural Netw. Learn. Syst. (2023). https://doi.org/10.1109/TNNLS.2023.3240195

    Article  Google Scholar 

  36. Wang, D., Liu, J., Liu, R., Fan, X.: An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Inf. Fusion 98, 101828 (2023)

    Article  Google Scholar 

  37. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. (2021)

  38. Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. 128, 642–656 (2018)

    Article  Google Scholar 

  39. Lin, M., Chen, Q., Yan, S.: Network in network. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings (2014). http://arxiv.org/abs/1312.4400

  40. Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.S.: Unitbox: an advanced object detection network. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15–19, 2016, pp. 516–520 (2016). https://doi.org/10.1145/2964284.2967274

  41. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 658–666 (2019). https://doi.org/10.1109/CVPR.2019.00075. http://openaccess.thecvf.com/content_CVPR_2019/html/Rezatofighi_Generalized_Intersection_Over_Union_A_Metric_and_a_Loss_for_CVPR_2019_paper.html

  42. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: National Conference on Artificial Intelligence. IEEE Trans. Cybern. 52(8), 8574–8586 (2019)

  43. Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., Zuo, W.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 52(8), 8574–8586 (2021)

  44. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision (2014)

  45. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  46. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019 (2019). https://openreview.net/forum?id=Bkg6RiCqY7

  47. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 6931–6939 (2017). https://doi.org/10.1109/CVPR.2017.733

  48. Van Gool, L., Timofte, R., Bhat, G., Danelljan, M., Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Know your surroundings: exploiting scene information for object tracking. In: European Conference on Computer Vision (2020)

  49. Mayer, C., Danelljan, M., Paudel, D.P., Gool, L.V.: Learning target candidate association to keep track of what not to track. (2021). arXiv:2103.16556

  50. Mayer, C., Danelljan, M., Bhat, G., Paul, M., Pani, D., Fisher, P., Luc, Y., Gool, V.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8731–8740 (2022)

Download references

Acknowledgements

We would like to acknowledge the use of ChatGPT, a language model developed by OpenAI, for the language proofreading of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Zhang.

Ethics declarations

Conflict of interest

The author declares no conflict of interest in relation to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Supplementary material

Appendix: Supplementary material

This section provides a comprehensive introduction to our study, including the model used, experimental hardware environment, and key findings.

1.1 Experimental environment

Our experiments utilize the A5000 GPU experimental platform, which is equipped with an AMD EPYC 7543 32-Core processor. To evaluate the performance of our dataset, we conducted a comprehensive tracking test using the GOT-10k open-source evaluation algorithm.

1.2 Model details

Model parameters This work presents two different-sized models of the ExtremeFormer tracker. We perform calculations and tests to evaluate their respective MACs, Param, and FPS. Additionally, we compare the performance of these models with that of two other tracking models, namely OSTrack-256 and OSTrack-384. Based on the comparison of the data presented in Table 4, we find that the ExtremeFormer-320 model has the lowest number of parameters. To achieve a powerful single-target real-time tracker, it is essential to ensure high tracking speed while enhancing tracking precision. Therefore, the performance of these models should be evaluated based on both speed and accuracy metrics.

Table 4 Comparison of MACs, Params, and FPS of ExtremeFormer tracker models based on different input sizes
Fig. 5
figure 5

State-of-the-art comparison on the UAV123 dataset

Fig. 6
figure 6

State-of-the-art comparison on the LaSOT dataset

1.3 More Visualization

To confirm the performance of our model, we conducted several test experiments. We provide the graphs of the test results in this section. In Fig. 5, we compare the experimental results of our tracking model with the current state-of-the-art model on the UAV123 dataset. Our tracking model outperforms the second-place KeepTrack by 1.3% in precision and OSTrack by 1.6% in success, achieving a state-of-the-art level in both metrics. Similarly, in Fig. 6, we compare the experimental results of our tracking model with the current state-of-the-art model on the LaSOT dataset. Our tracking model also achieves a state-of-the-art level of precision and success.

Fig. 7
figure 7

Visualization of tracking sequence. The first frame images are displayed in the left column. The baseline tracker tracking sequence images are displayed in the middle column. The tracker tracking sequence images in this work are displayed in the right column. The bounding box label is depicted by the green box. The tracking outcome is shown in the red box

The goal of our study is to evaluate the tracking accuracy of different trackers by conducting a comprehensive analysis of multiple sequences. To achieve this objective, we compare the tracking results of the baseline trackers with the approach used in our work. As illustrated in Fig. 7, the left figure shows the tracking results of the baseline tracker, while the right figure presents the approach used in our work. Our findings demonstrate that the proposed model in this study outperforms the baseline tracking model, particularly in complex occlusion scenes. Our approach has shown a significant improvement in tracking accuracy, which can be attributed to the novel methodology employed. The experimental results indicate that our approach is more effective than traditional trackers in tracking targets under challenging scenarios.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C. ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module. Vis Comput 40, 2961–2974 (2024). https://doi.org/10.1007/s00371-023-02997-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02997-6

Keywords

Navigation