ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module

Zhang, Chao

doi:10.1007/s00371-023-02997-6

ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module

Original article
Published: 09 July 2023

Volume 40, pages 2961–2974, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Chao Zhang ORCID: orcid.org/0000-0001-6076-836X¹

202 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Object tracking has become a crucial area of research in the field of intelligent perception in recent years. Current mainstream single-object trackers utilize the method of point regression and heatmaps to predict the position of the target. However, the performance of these models can be negatively impacted by occlusions of key points. To address this issue, we introduce the ExtremeFormer model, which includes a backbone similar to OSTrack and an ENM (Extreme Net Module) head. Our core idea is to use the ENM module to predict the target’s position by regressing the position of the edges instead of the points. Different from traditional center-based regression methods, ENM predicts the left, top, right, and bottom boundaries of the target bounding box through the network and uses an offset branch to compensate for errors caused by resolution reduction. This approach greatly alleviates the problem of tracking failure caused by occlusion of the center point and improves the robustness of the tracking model. In addition, our tracker does not require Hanning windows or penalties to ensure stability during tracking. Our final ExtremeFormer model outperforms existing state-of-the-art trackers on four tracking benchmarks, including LaSOT, TrackingNet, GOT-10k, and UAV123. Specifically, our ExtremeFormer-384 achieves a Precision score of 83.1% on TrackingNet, 74.9% on LaSOT, and an AO of 73.9% on GOT-10k. These results demonstrate the effectiveness of our proposed model, which provides a more robust and accurate approach for single-object tracking in challenging environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SiamCPN: Visual tracking with the Siamese center-prediction network

Article Open access 05 April 2021

Robust Visual Tracking by Segmentation

DetOH: An Anchor-Free Object Detector with Only Heatmaps

Data availability

Data will be made available on request.

References

Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 8971–8980 (2018). https://doi.org/10.1109/CVPR.2018.00935. http://openaccess.thecvf.com/content_cvpr_2018/html/Li_High_Performance_Visual_CVPR_2018_paper.html
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.S.: Fast online object tracking and segmentation: a unifying approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1328–1338 (2019). https://doi.org/10.1109/CVPR.2019.00142. http://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Fast_Online_Object_Tracking_and_Segmentation_A_Unifying_Approach_CVPR_2019_paper.html
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 4282–4291 (2019). https://doi.org/10.1109/CVPR.2019.00441. http://openaccess.thecvf.com/content_CVPR_2019/html/Li_SiamRPN_Evolution_of_Siamese_Visual_Tracking_With_Very_Deep_Networks_CVPR_2019_paper.html
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, pp. 6181–6190 (2019). https://doi.org/10.1109/ICCV.2019.00628
Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S.: Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6268–6276 (2020). https://doi.org/10.1109/CVPR42600.2020.00630
Zhao, M., Okada, K., Inaba, M.: Trtr: Visual tracking with transformer. (2021). arXiv:2105.03817
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: International Conference on Computer Vision (2021)
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision. Springer, pp. 341–357 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: object detection with keypoint triplets (2019). arXiv:1904.08189
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 5374–5383 (2019). https://doi.org/10.1109/CVPR.2019.00552. http://openaccess.thecvf.com/content_CVPR_2019/html/Fan_LaSOT_A_High-Quality_Benchmark_for_Large-Scale_Single_Object_Tracking_CVPR_2019_paper.html
Müller, M.A., Bibi, A., Giancola, S., Al-Subaihi, S., Ghanem, B.: Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: European Conference on Computer Vision . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8781–8790 (2018)
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1562–1577 (2022)
Article Google Scholar
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: European Conference on Computer Vision (2016)
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6667–6676 (2020). https://doi.org/10.1109/CVPR42600.2020.00670
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. Learning (2020). arXiv:2010.11929
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (2021)
Ma, F., Shou, M.Z., Zhu, L., Fan, H., Xu, Y., Yang, Y., Yan, Z.: Unified transformer tracker for object tracking, pp. 8781–8790 (2022)
Lin, L., Fan, H., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. (2021). arXiv:2112.00995
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: European Conference on Computer Vision (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems vol. 28 (2015)
Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confidence for accurate object detection. In: European Conference on Computer Vision (2018)
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: National Conference on Artificial Intelligence (2020)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 4660–4669 (2019). https://doi.org/10.1109/CVPR.2019.00479. http://openaccess.thecvf.com/content_CVPR_2019/html/Danelljan_ATOM_Accurate_Tracking_by_Overlap_Maximization_CVPR_2019_paper.html
Zhou, X., Zhuo, J., Krähenbühl, P.: Bottom-up object detection by grouping extreme and center points. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 850–859 (2019). https://doi.org/10.1109/CVPR.2019.00094. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhou_Bottom-Up_Object_Detection_by_Grouping_Extreme_and_Center_Points_CVPR_2019_paper.html
Chen, G., Qin, H.: Class-discriminative focal loss for extreme imbalanced multiclass object detection towards autonomous driving. Vis. Comput. 38(3), 1051–1063 (2022)
Article Google Scholar
Amirkhani, A., Karimi, M.P.: Adversarial defenses for object detectors based on Gabor convolutional layers. Vis. Comput. 38(6), 1929–1944 (2022)
Article Google Scholar
An, F.-P., Liu, J.-E., Bai, L.: Object recognition algorithm based on optimized nonlinear activation function-global convolutional neural network. Visual Comput. 38, 541–553 (2022)
Article Google Scholar
Dong, X., Shen, J., Wang, W., Shao, L., Ling, H., Porikli, F.: Dynamical hyperparameter optimization via deep reinforcement learning in tracking. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1515–1529 (2019)
Article Google Scholar
Dong, X., Shen, J., Yu, D., Wang, W., Liu, J., Huang, H.: Occlusion-aware real-time object tracking. IEEE Trans. Multimedia 19(4), 763–771 (2016)
Article Google Scholar
Yin, J., Wang, W., Meng, Q., Yang, R., Shen, J.: A unified object motion and affinity model for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6768–6777 (2020)
Tang, H., Li, Z., Peng, Z., Tang, J.: Blockmix: meta regularization and self-calibrated inference for metric-based meta-learning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 610–618 (2020)
Tang, H., Yuan, C., Li, Z., Tang, J.: Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 130, 108792 (2022)
Article Google Scholar
Zha, Z., Tang, H., Sun, Y., Tang, J.: Boosting few-shot fine-grained recognition with background suppression and foreground alignment. IEEE Trans. Circuits Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3236636
Article Google Scholar
Li, Z., Tang, H., Peng, Z., Qi, G.-J., Tang, J.: Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Trans. Neural Netw. Learn. Syst. (2023). https://doi.org/10.1109/TNNLS.2023.3240195
Article Google Scholar
Wang, D., Liu, J., Liu, R., Fan, X.: An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Inf. Fusion 98, 101828 (2023)
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. (2021)
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. 128, 642–656 (2018)
Article Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings (2014). http://arxiv.org/abs/1312.4400
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.S.: Unitbox: an advanced object detection network. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15–19, 2016, pp. 516–520 (2016). https://doi.org/10.1145/2964284.2967274
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 658–666 (2019). https://doi.org/10.1109/CVPR.2019.00075. http://openaccess.thecvf.com/content_CVPR_2019/html/Rezatofighi_Generalized_Intersection_Over_Union_A_Metric_and_a_Loss_for_CVPR_2019_paper.html
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: National Conference on Artificial Intelligence. IEEE Trans. Cybern. 52(8), 8574–8586 (2019)
Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., Zuo, W.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 52(8), 8574–8586 (2021)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision (2014)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019 (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 6931–6939 (2017). https://doi.org/10.1109/CVPR.2017.733
Van Gool, L., Timofte, R., Bhat, G., Danelljan, M., Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Know your surroundings: exploiting scene information for object tracking. In: European Conference on Computer Vision (2020)
Mayer, C., Danelljan, M., Paudel, D.P., Gool, L.V.: Learning target candidate association to keep track of what not to track. (2021). arXiv:2103.16556
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Pani, D., Fisher, P., Luc, Y., Gool, V.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8731–8740 (2022)

Download references

Acknowledgements

We would like to acknowledge the use of ChatGPT, a language model developed by OpenAI, for the language proofreading of this manuscript.

Author information

Authors and Affiliations

Computer Science and Technology, Beihang University, XueYuan Road No. 37, Beijing, 100191, China
Chao Zhang

Authors

Chao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Zhang.

Ethics declarations

Conflict of interest

The author declares no conflict of interest in relation to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Supplementary material

This section provides a comprehensive introduction to our study, including the model used, experimental hardware environment, and key findings.

1.1 Experimental environment

Our experiments utilize the A5000 GPU experimental platform, which is equipped with an AMD EPYC 7543 32-Core processor. To evaluate the performance of our dataset, we conducted a comprehensive tracking test using the GOT-10k open-source evaluation algorithm.

1.2 Model details

Model parameters This work presents two different-sized models of the ExtremeFormer tracker. We perform calculations and tests to evaluate their respective MACs, Param, and FPS. Additionally, we compare the performance of these models with that of two other tracking models, namely OSTrack-256 and OSTrack-384. Based on the comparison of the data presented in Table 4, we find that the ExtremeFormer-320 model has the lowest number of parameters. To achieve a powerful single-target real-time tracker, it is essential to ensure high tracking speed while enhancing tracking precision. Therefore, the performance of these models should be evaluated based on both speed and accuracy metrics.

Table 4 Comparison of MACs, Params, and FPS of ExtremeFormer tracker models based on different input sizes

Full size table

1.3 More Visualization

To confirm the performance of our model, we conducted several test experiments. We provide the graphs of the test results in this section. In Fig. 5, we compare the experimental results of our tracking model with the current state-of-the-art model on the UAV123 dataset. Our tracking model outperforms the second-place KeepTrack by 1.3% in precision and OSTrack by 1.6% in success, achieving a state-of-the-art level in both metrics. Similarly, in Fig. 6, we compare the experimental results of our tracking model with the current state-of-the-art model on the LaSOT dataset. Our tracking model also achieves a state-of-the-art level of precision and success.

The goal of our study is to evaluate the tracking accuracy of different trackers by conducting a comprehensive analysis of multiple sequences. To achieve this objective, we compare the tracking results of the baseline trackers with the approach used in our work. As illustrated in Fig. 7, the left figure shows the tracking results of the baseline tracker, while the right figure presents the approach used in our work. Our findings demonstrate that the proposed model in this study outperforms the baseline tracking model, particularly in complex occlusion scenes. Our approach has shown a significant improvement in tracking accuracy, which can be attributed to the novel methodology employed. The experimental results indicate that our approach is more effective than traditional trackers in tracking targets under challenging scenarios.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, C. ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module. Vis Comput 40, 2961–2974 (2024). https://doi.org/10.1007/s00371-023-02997-6

Download citation

Accepted: 20 June 2023
Published: 09 July 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00371-023-02997-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module

Abstract

Access this article

Similar content being viewed by others