DMTrack: learning deformable masked visual representations for single object tracking

Abdelaziz, Omar; Shehata, Mohamed

doi:10.1007/s11760-024-03713-0

DMTrack: learning deformable masked visual representations for single object tracking

Original Paper
Published: 04 December 2024

Volume 19, article number 61, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Omar Abdelaziz¹ &
Mohamed Shehata¹

197 Accesses
Explore all metrics

Abstract

Single object tracking is still challenging because it requires localizing an arbitrary object in a sequence of frames, given only its appearance in the first frame of the sequence. Many trackers, especially those leveraging the Vision Transformer (ViT) backbone, have achieved superior performance. However, the gap between the performance metrics measured on the training data and those on the test data is still large. To alleviate this issue, we propose the deformable masking module in the transformer-based trackers. The deformable masking module is injected after each layer of the ViT. First, It masks out complete vectors of the output representations of the ViT layer. After that, the masked representations are fed into a deformable convolution to reconstruct new reliable representations. The output of the last layer of the ViT is modified by fusing it with all intermediate outputs of the deformable masking modules to produce a final robust attentional feature map. We extensively evaluate the performance of our model, named DMTrack, on seven different tracking benchmarks. Our model outperforms the previous state-of-the-art techniques by ($+\,2\%$) while having fewer parameters ($-\,92.4\%$). Moreover, our model matches the performance of much larger models in terms of parameters, indicating our training strategy’s effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SiamADT: Siamese Attention and Deformable Features Fusion Network for Visual Object Tracking

Article 10 May 2023

An Advanced Version of MDNet for Visual Tracking

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

Data and materials availibility

Not applicable.

Code availibility

Not applicable.

References

Mathur, G., Somwanshi, D., Bundele, M.M.: Intelligent video surveillance based on object tracking. In: 2018 3rd International Conference and Workshops on Recent Advances and Innovations in Engineering (ICRAIE), pp. 1–6 (2018). https://doi.org/10.1109/ICRAIE.2018.8710421
Cao, J., Song, C., Song, S., Xiao, F., Zhang, X., Liu, Z., Ang, M.H., Jr.: Robust object tracking algorithm for autonomous vehicles in complex scenes. Remote Sens. 13(16), 3234 (2021)
Article MATH Google Scholar
Zheng, Z., Zhang, X., Qin, L., Yue, S., Zeng, P.: Cows’ legs tracking and lameness detection in dairy cattle using video analysis and Siamese neural networks. Comput. Electron. Agric. 205, 107618 (2023)
Article Google Scholar
Chen, K., Oldja, R., Smolyanskiy, N., Birchfield, S., Popov, A., Wehr, D., Eden, I., Pehserl, J.: Mvlidarnet: real-time multi-class scene understanding for autonomous driving using multiple views. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2288–2294 (2020). IEEE
Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM Comput. Surv. 38(4), 13 (2006). https://doi.org/10.1145/1177352.1177355
Article MATH Google Scholar
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014). https://doi.org/10.1109/TPAMI.2013.230
Article Google Scholar
Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429 (2016). https://doi.org/10.1109/CVPR.2016.158
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) Computer Vision - ECCV 2016 Workshops, pp. 850–865. Springer, Cham (2016)
Chapter Google Scholar
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 341–357. Springer, Cham (2022)
Chapter Google Scholar
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., Li, X.: Odtrack: online dense temporal token learning for visual tracking. In: AAAI (2024)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: iBOT: Image BERT pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)
Wei, X., Bai, Y., Zheng, Y., Shi, D., Gong, Y.: Autoregressive visual tracking. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9697–9706. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.00935
Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9811–9820. IEEE Computer Society, Los Alamitos, CA, USA (2021). https://doi.org/10.1109/CVPR46437.2021.00969
Cui, Y., Han, C., Liu, D.: Collaborative multi-task learning for multi-object tracking and segmentation. ACM J. Auton. Transp. Syst. 1(2), 1–23 (2024). https://doi.org/10.1145/3632181
Article MATH Google Scholar
Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4591–4600 (2019)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018). https://doi.org/10.1109/CVPR.2018.00935
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Sio, C.H., Ma, Y.-J., Shuai, H.-H., Chen, J.-C., Cheng, W.-H.: S2siamfc: Self-supervised fully convolutional Siamese network for visual tracking. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 1948–1957. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413611
Hu, W., Wang, Q., Zhang, L., Bertinetto, L., Torr, P.H.S.: Siammask: a framework for fast online object tracking and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3072–3089 (2023). https://doi.org/10.1109/TPAMI.2022.3172932
Article MATH Google Scholar
Zhou, Z., Zhou, X., Chen, Z., Guo, P., Liu, Q.-Y., Zhang, W.: Memory network with pixel-level spatio-temporal learning for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 33(11), 6897–6911 (2023). https://doi.org/10.1109/TCSVT.2023.3272319
Article MATH Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8126–8135 (2021)
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1571–1580. IEEE Computer Society, Los Alamitos, CA, USA (2021). https://doi.org/10.1109/CVPR46437.2021.00162
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10448–10457 (2021)
Gao, S., Zhou, C., Ma, C., Wang, X., Yuan, J.: Aiatrack: Attention in attention for transformer visual tracking. In: Computer Vision—ECCV 2022: 17th European Conference, pp. 146–164. Springer, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20047-2_9
Lin, L., Fan, H., Zhang, Z., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 16743–16754. Curran Associates, Inc. (2022)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002. IEEE Computer Society, Los Alamitos, CA, USA (2021).https://doi.org/10.1109/ICCV48922.2021.00986
Tang, C., Wang, X., Bai, Y., Wu, Z., Zhang, J., Huang, Y.: Learning spatial-frequency transformer for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 33(9), 5102–5116 (2023). https://doi.org/10.1109/TCSVT.2023.3249468
Article MATH Google Scholar
Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W., Ouyang, W.: Backbone is all your need: a simplified architecture for visual object tracking. In: Computer Vision—ECCV 2022: 17th European Conference, pp. 375–392. Springer, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20047-2_22
Gao, S., Zhou, C., Zhang, J.: Generalized relation modeling for transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18686–18695 (2023)
Cai, Y., Liu, J., Tang, J., Wu, G.: Robust object modeling for visual tracking. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9555–9566. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00879
Xie, F., Chu, L., Li, J., Lu, Y., Ma, C.: Videotrack: Learning to track objects via video transformer. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22826–22835. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.02186
Chen, X., Peng, H., Wang, D., Lu, H., Hu, H.: Seqtrack: sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14572–14581 (2023)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Wu, Q., Yang, T., Liu, Z., Wu, B., Shan, Y., Chan, A.B.: Dropmae: masked autoencoders with spatial-attention dropout for tracking tasks. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14561–14571. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.01399
Zhao, H., Wang, D., Lu, H.: Representation learning for visual object tracking by masked appearance transfer. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18696–18705. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.01793
Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9300–9308. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00953
Zheng, L., Chen, Y., Tang, M., Wang, J., Lu, H.: Siamese deformable cross-correlation network for real-time visual tracking. Neurocomputing 401, 36–47 (2020). https://doi.org/10.1016/j.neucom.2020.02.080
Article MATH Google Scholar
Pan, X., Yuan, M., Zhang, T., Wang, H.: Deformable attention object tracking network based on cross-correlation. J. Vis. Commun. Image Represent. 98, 104039 (2024). https://doi.org/10.1016/j.jvcir.2023.104039
Article MATH Google Scholar
Huang, Y., Xiao, Z., Firkat, E., Zhang, J., Wu, D., Hamdulla, A.: Spatio-temporal mix deformable feature extractor in visual tracking. Expert Syst. Appl. 237, 121377 (2024). https://doi.org/10.1016/j.eswa.2023.121377
Article Google Scholar
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2021). https://doi.org/10.1109/TPAMI.2019.2957464
Article MATH Google Scholar
Javed, S., Danelljan, M., Khan, F., Khan, M., Felsberg, M., Matas, J.: Visual object tracking with discriminative filters and Siamese networks: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 45(05), 6552–6574 (2023). https://doi.org/10.1109/TPAMI.2022.3212594
Article MATH Google Scholar
Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007. IEEE Computer Society, Los Alamitos, CA, USA (2017). https://doi.org/10.1109/ICCV.2017.324
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized Intersection Over Union (2019)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: a high-quality benchmark for large-scale single object tracking. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5369–5378. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00552
Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Harshit, Huang, M., Liu, J., Xu, Y., Liao, C., Yuan, L., Ling, H.: LaSOT: a high-quality large-scale single object tracking benchmark. Int. J. Comput. Vision 129(2), 439–461 (2021). https://doi.org/10.1007/s11263-020-01387-y
Article Google Scholar
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, pp. 310–327. Springer, Cham (2018)
Chapter Google Scholar
Zhang, Z., Liu, Y., Wang, X., Li, B., Hu, W.: Learn to match: Automatic matching network design for visual tracking. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13319–13328. IEEE Computer Society, Los Alamitos, CA, USA (2021). https://doi.org/10.1109/ICCV48922.2021.01309
Mayer, C., Danelljan, M., Pani Paudel, D., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13424–13434 (2021). https://doi.org/10.1109/ICCV48922.2021.01319
Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., Zeng, W.: Correlation-aware deep tracking. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8741–8750. IEEE Computer Society, Los Alamitos, CA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.00855
Li, X., Huang, Y., He, Z., Wang, Y., Lu, H., Yang, M.: Citetracker: Correlating image and text for visual tracking. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9940–9949. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00915
Lin, L., Fan, H., Zhang, Z., Wang, Y., Xu, Y., Ling, H.: Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance (2024)
Bai, Y., Zhao, Z., Gong, Y., Wei, X.: ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe (2024)
Danelljan, M., Bhat, G., Khan, F., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4655–4664. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00479
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 445–461 (2016). Springer
Wu, Y., Lim, J., Yang, M.-H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015). https://doi.org/10.1109/TPAMI.2014.2388226
Article MATH Google Scholar
Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: A benchmark for higher frame rate object tracking. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1134–1143. IEEE Computer Society, Los Alamitos, CA, USA (2017). https://doi.org/10.1109/ICCV.2017.128
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision—ECCV 2014, pp. 740–755. Springer, Cham (2014)
Chapter Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision Mamba: efficient visual representation learning with bidirectional state space model (2024)

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Department of Computer Science, Mathematics, Physics and Statistics, The University of British Columbia, 3333 University Way, Kelowna, BC, V1V1V7, Canada
Omar Abdelaziz & Mohamed Shehata

Authors

Omar Abdelaziz
View author publications
You can also search for this author inPubMed Google Scholar
Mohamed Shehata
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: Omar Abdelaziz, Mohamed Shahata; Methodology: Omar Abdelaziz; Formal analysis and investigation: Omar Abdelaziz; Writing—original draft preparation: Omar Abdelaziz; Writing—review and editing: Omar Abdelaziz, Mohamed Shahata; Supervision: Mohamed Shehata

Corresponding author

Correspondence to Omar Abdelaziz.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval and consent to participate

Not applicable.

Consent for publication

The authors consent to publish this work in the International Journal of Machine Learning and Cybernetics.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Abdelaziz, O., Shehata, M. DMTrack: learning deformable masked visual representations for single object tracking. SIViP 19, 61 (2025). https://doi.org/10.1007/s11760-024-03713-0

Download citation

Received: 11 August 2024
Revised: 01 September 2024
Accepted: 09 September 2024
Published: 04 December 2024
DOI: https://doi.org/10.1007/s11760-024-03713-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DMTrack: learning deformable masked visual representations for single object tracking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SiamADT: Siamese Attention and Deformable Features Fusion Network for Visual Object Tracking

An Advanced Version of MDNet for Visual Tracking

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

Data and materials availibility

Code availibility

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now