Skip to main content

Advertisement

Log in

DMTrack: learning deformable masked visual representations for single object tracking

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Single object tracking is still challenging because it requires localizing an arbitrary object in a sequence of frames, given only its appearance in the first frame of the sequence. Many trackers, especially those leveraging the Vision Transformer (ViT) backbone, have achieved superior performance. However, the gap between the performance metrics measured on the training data and those on the test data is still large. To alleviate this issue, we propose the deformable masking module in the transformer-based trackers. The deformable masking module is injected after each layer of the ViT. First, It masks out complete vectors of the output representations of the ViT layer. After that, the masked representations are fed into a deformable convolution to reconstruct new reliable representations. The output of the last layer of the ViT is modified by fusing it with all intermediate outputs of the deformable masking modules to produce a final robust attentional feature map. We extensively evaluate the performance of our model, named DMTrack, on seven different tracking benchmarks. Our model outperforms the previous state-of-the-art techniques by (\(+\,2\%\)) while having fewer parameters (\(-\,92.4\%\)). Moreover, our model matches the performance of much larger models in terms of parameters, indicating our training strategy’s effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data and materials availibility

Not applicable.

Code availibility

Not applicable.

References

  1. Mathur, G., Somwanshi, D., Bundele, M.M.: Intelligent video surveillance based on object tracking. In: 2018 3rd International Conference and Workshops on Recent Advances and Innovations in Engineering (ICRAIE), pp. 1–6 (2018). https://doi.org/10.1109/ICRAIE.2018.8710421

  2. Cao, J., Song, C., Song, S., Xiao, F., Zhang, X., Liu, Z., Ang, M.H., Jr.: Robust object tracking algorithm for autonomous vehicles in complex scenes. Remote Sens. 13(16), 3234 (2021)

    Article  MATH  Google Scholar 

  3. Zheng, Z., Zhang, X., Qin, L., Yue, S., Zeng, P.: Cows’ legs tracking and lameness detection in dairy cattle using video analysis and Siamese neural networks. Comput. Electron. Agric. 205, 107618 (2023)

    Article  Google Scholar 

  4. Chen, K., Oldja, R., Smolyanskiy, N., Birchfield, S., Popov, A., Wehr, D., Eden, I., Pehserl, J.: Mvlidarnet: real-time multi-class scene understanding for autonomous driving using multiple views. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2288–2294 (2020). IEEE

  5. Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM Comput. Surv. 38(4), 13 (2006). https://doi.org/10.1145/1177352.1177355

    Article  MATH  Google Scholar 

  6. Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014). https://doi.org/10.1109/TPAMI.2013.230

    Article  Google Scholar 

  7. Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429 (2016). https://doi.org/10.1109/CVPR.2016.158

  8. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) Computer Vision - ECCV 2016 Workshops, pp. 850–865. Springer, Cham (2016)

    Chapter  Google Scholar 

  9. Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 341–357. Springer, Cham (2022)

    Chapter  Google Scholar 

  10. Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)

  11. Zheng, Y., Zhong, B., Liang, Q., Mo, Z., Zhang, S., Li, X.: Odtrack: online dense temporal token learning for visual tracking. In: AAAI (2024)

  12. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  13. Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: iBOT: Image BERT pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)

  14. Wei, X., Bai, Y., Zheng, Y., Shi, D., Gong, Y.: Autoregressive visual tracking. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9697–9706. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.00935

  15. Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9811–9820. IEEE Computer Society, Los Alamitos, CA, USA (2021). https://doi.org/10.1109/CVPR46437.2021.00969

  16. Cui, Y., Han, C., Liu, D.: Collaborative multi-task learning for multi-object tracking and segmentation. ACM J. Auton. Transp. Syst. 1(2), 1–23 (2024). https://doi.org/10.1145/3632181

    Article  MATH  Google Scholar 

  17. Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4591–4600 (2019)

  18. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018). https://doi.org/10.1109/CVPR.2018.00935

  19. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  20. Sio, C.H., Ma, Y.-J., Shuai, H.-H., Chen, J.-C., Cheng, W.-H.: S2siamfc: Self-supervised fully convolutional Siamese network for visual tracking. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 1948–1957. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413611

  21. Hu, W., Wang, Q., Zhang, L., Bertinetto, L., Torr, P.H.S.: Siammask: a framework for fast online object tracking and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3072–3089 (2023). https://doi.org/10.1109/TPAMI.2022.3172932

    Article  MATH  Google Scholar 

  22. Zhou, Z., Zhou, X., Chen, Z., Guo, P., Liu, Q.-Y., Zhang, W.: Memory network with pixel-level spatio-temporal learning for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 33(11), 6897–6911 (2023). https://doi.org/10.1109/TCSVT.2023.3272319

    Article  MATH  Google Scholar 

  23. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8126–8135 (2021)

  24. Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1571–1580. IEEE Computer Society, Los Alamitos, CA, USA (2021). https://doi.org/10.1109/CVPR46437.2021.00162

  25. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10448–10457 (2021)

  26. Gao, S., Zhou, C., Ma, C., Wang, X., Yuan, J.: Aiatrack: Attention in attention for transformer visual tracking. In: Computer Vision—ECCV 2022: 17th European Conference, pp. 146–164. Springer, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20047-2_9

  27. Lin, L., Fan, H., Zhang, Z., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 16743–16754. Curran Associates, Inc. (2022)

  28. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002. IEEE Computer Society, Los Alamitos, CA, USA (2021).https://doi.org/10.1109/ICCV48922.2021.00986

  29. Tang, C., Wang, X., Bai, Y., Wu, Z., Zhang, J., Huang, Y.: Learning spatial-frequency transformer for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 33(9), 5102–5116 (2023). https://doi.org/10.1109/TCSVT.2023.3249468

    Article  MATH  Google Scholar 

  30. Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W., Ouyang, W.: Backbone is all your need: a simplified architecture for visual object tracking. In: Computer Vision—ECCV 2022: 17th European Conference, pp. 375–392. Springer, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20047-2_22

  31. Gao, S., Zhou, C., Zhang, J.: Generalized relation modeling for transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18686–18695 (2023)

  32. Cai, Y., Liu, J., Tang, J., Wu, G.: Robust object modeling for visual tracking. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9555–9566. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00879

  33. Xie, F., Chu, L., Li, J., Lu, Y., Ma, C.: Videotrack: Learning to track objects via video transformer. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22826–22835. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.02186

  34. Chen, X., Peng, H., Wang, D., Lu, H., Hu, H.: Seqtrack: sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14572–14581 (2023)

  35. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

  36. Wu, Q., Yang, T., Liu, Z., Wu, B., Shan, Y., Chan, A.B.: Dropmae: masked autoencoders with spatial-attention dropout for tracking tasks. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14561–14571. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.01399

  37. Zhao, H., Wang, D., Lu, H.: Representation learning for visual object tracking by masked appearance transfer. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18696–18705. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/CVPR52729.2023.01793

  38. Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9300–9308. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00953

  39. Zheng, L., Chen, Y., Tang, M., Wang, J., Lu, H.: Siamese deformable cross-correlation network for real-time visual tracking. Neurocomputing 401, 36–47 (2020). https://doi.org/10.1016/j.neucom.2020.02.080

    Article  MATH  Google Scholar 

  40. Pan, X., Yuan, M., Zhang, T., Wang, H.: Deformable attention object tracking network based on cross-correlation. J. Vis. Commun. Image Represent. 98, 104039 (2024). https://doi.org/10.1016/j.jvcir.2023.104039

    Article  MATH  Google Scholar 

  41. Huang, Y., Xiao, Z., Firkat, E., Zhang, J., Wu, D., Hamdulla, A.: Spatio-temporal mix deformable feature extractor in visual tracking. Expert Syst. Appl. 237, 121377 (2024). https://doi.org/10.1016/j.eswa.2023.121377

    Article  Google Scholar 

  42. Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2021). https://doi.org/10.1109/TPAMI.2019.2957464

    Article  MATH  Google Scholar 

  43. Javed, S., Danelljan, M., Khan, F., Khan, M., Felsberg, M., Matas, J.: Visual object tracking with discriminative filters and Siamese networks: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 45(05), 6552–6574 (2023). https://doi.org/10.1109/TPAMI.2022.3212594

    Article  MATH  Google Scholar 

  44. Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007. IEEE Computer Society, Los Alamitos, CA, USA (2017). https://doi.org/10.1109/ICCV.2017.324

  45. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized Intersection Over Union (2019)

  46. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  47. Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: a high-quality benchmark for large-scale single object tracking. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5369–5378. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00552

  48. Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Harshit, Huang, M., Liu, J., Xu, Y., Liao, C., Yuan, L., Ling, H.: LaSOT: a high-quality large-scale single object tracking benchmark. Int. J. Comput. Vision 129(2), 439–461 (2021). https://doi.org/10.1007/s11263-020-01387-y

    Article  Google Scholar 

  49. Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, pp. 310–327. Springer, Cham (2018)

    Chapter  Google Scholar 

  50. Zhang, Z., Liu, Y., Wang, X., Li, B., Hu, W.: Learn to match: Automatic matching network design for visual tracking. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13319–13328. IEEE Computer Society, Los Alamitos, CA, USA (2021). https://doi.org/10.1109/ICCV48922.2021.01309

  51. Mayer, C., Danelljan, M., Pani Paudel, D., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13424–13434 (2021). https://doi.org/10.1109/ICCV48922.2021.01319

  52. Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., Zeng, W.: Correlation-aware deep tracking. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8741–8750. IEEE Computer Society, Los Alamitos, CA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.00855

  53. Li, X., Huang, Y., He, Z., Wang, Y., Lu, H., Yang, M.: Citetracker: Correlating image and text for visual tracking. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9940–9949. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00915

  54. Lin, L., Fan, H., Zhang, Z., Wang, Y., Xu, Y., Ling, H.: Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance (2024)

  55. Bai, Y., Zhao, Z., Gong, Y., Wei, X.: ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe (2024)

  56. Danelljan, M., Bhat, G., Khan, F., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4655–4664. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00479

  57. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

  58. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 445–461 (2016). Springer

  59. Wu, Y., Lim, J., Yang, M.-H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015). https://doi.org/10.1109/TPAMI.2014.2388226

    Article  MATH  Google Scholar 

  60. Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: A benchmark for higher frame rate object tracking. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1134–1143. IEEE Computer Society, Los Alamitos, CA, USA (2017). https://doi.org/10.1109/ICCV.2017.128

  61. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy

  62. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision—ECCV 2014, pp. 740–755. Springer, Cham (2014)

    Chapter  Google Scholar 

  63. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  64. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision Mamba: efficient visual representation learning with bidirectional state space model (2024)

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Omar Abdelaziz, Mohamed Shahata; Methodology: Omar Abdelaziz; Formal analysis and investigation: Omar Abdelaziz; Writing—original draft preparation: Omar Abdelaziz; Writing—review and editing: Omar Abdelaziz, Mohamed Shahata; Supervision: Mohamed Shehata

Corresponding author

Correspondence to Omar Abdelaziz.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval and consent to participate

Not applicable.

Consent for publication

The authors consent to publish this work in the International Journal of Machine Learning and Cybernetics.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abdelaziz, O., Shehata, M. DMTrack: learning deformable masked visual representations for single object tracking. SIViP 19, 61 (2025). https://doi.org/10.1007/s11760-024-03713-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03713-0

Keywords

Navigation