Abstract
Object tracking has become a crucial area of research in the field of intelligent perception in recent years. Current mainstream single-object trackers utilize the method of point regression and heatmaps to predict the position of the target. However, the performance of these models can be negatively impacted by occlusions of key points. To address this issue, we introduce the ExtremeFormer model, which includes a backbone similar to OSTrack and an ENM (Extreme Net Module) head. Our core idea is to use the ENM module to predict the target’s position by regressing the position of the edges instead of the points. Different from traditional center-based regression methods, ENM predicts the left, top, right, and bottom boundaries of the target bounding box through the network and uses an offset branch to compensate for errors caused by resolution reduction. This approach greatly alleviates the problem of tracking failure caused by occlusion of the center point and improves the robustness of the tracking model. In addition, our tracker does not require Hanning windows or penalties to ensure stability during tracking. Our final ExtremeFormer model outperforms existing state-of-the-art trackers on four tracking benchmarks, including LaSOT, TrackingNet, GOT-10k, and UAV123. Specifically, our ExtremeFormer-384 achieves a Precision score of 83.1% on TrackingNet, 74.9% on LaSOT, and an AO of 73.9% on GOT-10k. These results demonstrate the effectiveness of our proposed model, which provides a more robust and accurate approach for single-object tracking in challenging environments.
Similar content being viewed by others
Data availability
Data will be made available on request.
References
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 8971–8980 (2018). https://doi.org/10.1109/CVPR.2018.00935. http://openaccess.thecvf.com/content_cvpr_2018/html/Li_High_Performance_Visual_CVPR_2018_paper.html
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.S.: Fast online object tracking and segmentation: a unifying approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1328–1338 (2019). https://doi.org/10.1109/CVPR.2019.00142. http://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Fast_Online_Object_Tracking_and_Segmentation_A_Unifying_Approach_CVPR_2019_paper.html
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 4282–4291 (2019). https://doi.org/10.1109/CVPR.2019.00441. http://openaccess.thecvf.com/content_CVPR_2019/html/Li_SiamRPN_Evolution_of_Siamese_Visual_Tracking_With_Very_Deep_Networks_CVPR_2019_paper.html
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, pp. 6181–6190 (2019). https://doi.org/10.1109/ICCV.2019.00628
Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S.: Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6268–6276 (2020). https://doi.org/10.1109/CVPR42600.2020.00630
Zhao, M., Okada, K., Inaba, M.: Trtr: Visual tracking with transformer. (2021). arXiv:2105.03817
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: International Conference on Computer Vision (2021)
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision. Springer, pp. 341–357 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: object detection with keypoint triplets (2019). arXiv:1904.08189
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 5374–5383 (2019). https://doi.org/10.1109/CVPR.2019.00552. http://openaccess.thecvf.com/content_CVPR_2019/html/Fan_LaSOT_A_High-Quality_Benchmark_for_Large-Scale_Single_Object_Tracking_CVPR_2019_paper.html
Müller, M.A., Bibi, A., Giancola, S., Al-Subaihi, S., Ghanem, B.: Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: European Conference on Computer Vision . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8781–8790 (2018)
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1562–1577 (2022)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: European Conference on Computer Vision (2016)
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6667–6676 (2020). https://doi.org/10.1109/CVPR42600.2020.00670
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. Learning (2020). arXiv:2010.11929
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (2021)
Ma, F., Shou, M.Z., Zhu, L., Fan, H., Xu, Y., Yang, Y., Yan, Z.: Unified transformer tracker for object tracking, pp. 8781–8790 (2022)
Lin, L., Fan, H., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. (2021). arXiv:2112.00995
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: European Conference on Computer Vision (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems vol. 28 (2015)
Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confidence for accurate object detection. In: European Conference on Computer Vision (2018)
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: National Conference on Artificial Intelligence (2020)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 4660–4669 (2019). https://doi.org/10.1109/CVPR.2019.00479. http://openaccess.thecvf.com/content_CVPR_2019/html/Danelljan_ATOM_Accurate_Tracking_by_Overlap_Maximization_CVPR_2019_paper.html
Zhou, X., Zhuo, J., Krähenbühl, P.: Bottom-up object detection by grouping extreme and center points. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 850–859 (2019). https://doi.org/10.1109/CVPR.2019.00094. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhou_Bottom-Up_Object_Detection_by_Grouping_Extreme_and_Center_Points_CVPR_2019_paper.html
Chen, G., Qin, H.: Class-discriminative focal loss for extreme imbalanced multiclass object detection towards autonomous driving. Vis. Comput. 38(3), 1051–1063 (2022)
Amirkhani, A., Karimi, M.P.: Adversarial defenses for object detectors based on Gabor convolutional layers. Vis. Comput. 38(6), 1929–1944 (2022)
An, F.-P., Liu, J.-E., Bai, L.: Object recognition algorithm based on optimized nonlinear activation function-global convolutional neural network. Visual Comput. 38, 541–553 (2022)
Dong, X., Shen, J., Wang, W., Shao, L., Ling, H., Porikli, F.: Dynamical hyperparameter optimization via deep reinforcement learning in tracking. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1515–1529 (2019)
Dong, X., Shen, J., Yu, D., Wang, W., Liu, J., Huang, H.: Occlusion-aware real-time object tracking. IEEE Trans. Multimedia 19(4), 763–771 (2016)
Yin, J., Wang, W., Meng, Q., Yang, R., Shen, J.: A unified object motion and affinity model for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6768–6777 (2020)
Tang, H., Li, Z., Peng, Z., Tang, J.: Blockmix: meta regularization and self-calibrated inference for metric-based meta-learning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 610–618 (2020)
Tang, H., Yuan, C., Li, Z., Tang, J.: Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 130, 108792 (2022)
Zha, Z., Tang, H., Sun, Y., Tang, J.: Boosting few-shot fine-grained recognition with background suppression and foreground alignment. IEEE Trans. Circuits Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3236636
Li, Z., Tang, H., Peng, Z., Qi, G.-J., Tang, J.: Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Trans. Neural Netw. Learn. Syst. (2023). https://doi.org/10.1109/TNNLS.2023.3240195
Wang, D., Liu, J., Liu, R., Fan, X.: An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Inf. Fusion 98, 101828 (2023)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. (2021)
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Int. J. Comput. Vis. 128, 642–656 (2018)
Lin, M., Chen, Q., Yan, S.: Network in network. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings (2014). http://arxiv.org/abs/1312.4400
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.S.: Unitbox: an advanced object detection network. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15–19, 2016, pp. 516–520 (2016). https://doi.org/10.1145/2964284.2967274
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.D., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 658–666 (2019). https://doi.org/10.1109/CVPR.2019.00075. http://openaccess.thecvf.com/content_CVPR_2019/html/Rezatofighi_Generalized_Intersection_Over_Union_A_Metric_and_a_Loss_for_CVPR_2019_paper.html
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: National Conference on Artificial Intelligence. IEEE Trans. Cybern. 52(8), 8574–8586 (2019)
Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., Zuo, W.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 52(8), 8574–8586 (2021)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision (2014)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019 (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 6931–6939 (2017). https://doi.org/10.1109/CVPR.2017.733
Van Gool, L., Timofte, R., Bhat, G., Danelljan, M., Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Know your surroundings: exploiting scene information for object tracking. In: European Conference on Computer Vision (2020)
Mayer, C., Danelljan, M., Paudel, D.P., Gool, L.V.: Learning target candidate association to keep track of what not to track. (2021). arXiv:2103.16556
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Pani, D., Fisher, P., Luc, Y., Gool, V.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8731–8740 (2022)
Acknowledgements
We would like to acknowledge the use of ChatGPT, a language model developed by OpenAI, for the language proofreading of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares no conflict of interest in relation to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Supplementary material
Appendix: Supplementary material
This section provides a comprehensive introduction to our study, including the model used, experimental hardware environment, and key findings.
1.1 Experimental environment
Our experiments utilize the A5000 GPU experimental platform, which is equipped with an AMD EPYC 7543 32-Core processor. To evaluate the performance of our dataset, we conducted a comprehensive tracking test using the GOT-10k open-source evaluation algorithm.
1.2 Model details
Model parameters This work presents two different-sized models of the ExtremeFormer tracker. We perform calculations and tests to evaluate their respective MACs, Param, and FPS. Additionally, we compare the performance of these models with that of two other tracking models, namely OSTrack-256 and OSTrack-384. Based on the comparison of the data presented in Table 4, we find that the ExtremeFormer-320 model has the lowest number of parameters. To achieve a powerful single-target real-time tracker, it is essential to ensure high tracking speed while enhancing tracking precision. Therefore, the performance of these models should be evaluated based on both speed and accuracy metrics.
1.3 More Visualization
To confirm the performance of our model, we conducted several test experiments. We provide the graphs of the test results in this section. In Fig. 5, we compare the experimental results of our tracking model with the current state-of-the-art model on the UAV123 dataset. Our tracking model outperforms the second-place KeepTrack by 1.3% in precision and OSTrack by 1.6% in success, achieving a state-of-the-art level in both metrics. Similarly, in Fig. 6, we compare the experimental results of our tracking model with the current state-of-the-art model on the LaSOT dataset. Our tracking model also achieves a state-of-the-art level of precision and success.
The goal of our study is to evaluate the tracking accuracy of different trackers by conducting a comprehensive analysis of multiple sequences. To achieve this objective, we compare the tracking results of the baseline trackers with the approach used in our work. As illustrated in Fig. 7, the left figure shows the tracking results of the baseline tracker, while the right figure presents the approach used in our work. Our findings demonstrate that the proposed model in this study outperforms the baseline tracking model, particularly in complex occlusion scenes. Our approach has shown a significant improvement in tracking accuracy, which can be attributed to the novel methodology employed. The experimental results indicate that our approach is more effective than traditional trackers in tracking targets under challenging scenarios.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, C. ExtremeFormer: a new framework for accurate object tracking by designing an efficient head prediction module. Vis Comput 40, 2961–2974 (2024). https://doi.org/10.1007/s00371-023-02997-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-02997-6