Skip to main content

Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking

  • Conference paper
  • First Online:
Rough Sets (IJCRS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14481))

Included in the following conference series:

  • 516 Accesses

Abstract

The recently developed transformer has been largely explored in the research field of computer vision and especially improve the performance of single object tracking. However, the majority of current efforts concentrate on combining and enhancing convolutional neural network (CNN)-generated features and cannot fully excavating the potential of transformer. Motivated by this, we introduce multi-granularity theory into the pure transformer-based single object tracker and design a multi-granularity feature fusion module. With a view to fuse the feature of different granularity and enhance the feature representation, we design the double-branch transformer feature extractor and utilize cross-attention mechanism to fuse the feature. In our extensive experiments on multiple tracking benchmarks, including OTB2015, VOT2020, TrackingNet, GOT-10k, LaSOT, our proposed method named MGTT, the results could demonstrate that the proposed tracker achieves better performance than multiple state-of-the-art trackers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56

    Chapter  Google Scholar 

  2. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018a)

    Google Scholar 

  3. Yinda, X., Wang, Z., Li, Z., Yuan, Y., Gang, Y.: SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12549–12556 (2020)

    Google Scholar 

  4. Noor, S., Waqas, M., Saleem, M.I., Minhas, H.N.: Automatic object tracking and segmentation using unsupervised SiamMask. IEEE Access 9, 106550–106559 (2021)

    Google Scholar 

  5. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)

    Google Scholar 

  6. Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 771–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_46

    Chapter  Google Scholar 

  7. Cucci, D.A., Matteucci, M., Bascetta, L.: Pose tracking and sensor self-calibration for an all-terrain autonomous vehicle. IFAC-PapersOnLine 49(15), 25–31 (2016)

    Google Scholar 

  8. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  9. Li, P., Wang, D., Wang, L., Lu, H.: Deep visual tracking: review and experimental comparison. Pattern Recognit. 76, 323–338 (2018b)

    Google Scholar 

  10. Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S.: Deep learning for visual tracking: a comprehensive survey. IEEE Trans. Intell. Transp. Syst. 23(5), 3943–3968 (2021)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  15. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021a)

    Google Scholar 

  16. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021a)

    Google Scholar 

  17. Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021a)

    Google Scholar 

  18. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286–3295 (2019)

    Google Scholar 

  19. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  20. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)

    Google Scholar 

  21. Li, J., Huang, C., Qi, J., Qian, Y., Liu, W.: Three-way cognitive concept learning via multi-granularity. Inf. Sci. 378, 244–263 (2017a)

    Google Scholar 

  22. Herrera, F., Herrera-Viedma, E., Martınez, L.: A fusion approach for managing multi-granularity linguistic term sets in decision making. Fuzzy Sets Syst. 114(1), 43–58 (2000)

    Article  Google Scholar 

  23. Yao, Y.: Perspectives of granular computing. In: 2005 IEEE International Conference on Granular Computing, vol. 1 (2005)

    Google Scholar 

  24. Qian, Y., Liang, J., Yao, Y., Dang, C.: MGRS: a multi-granulation rough set. Inf. Sci. 180(6), 949–970 (2010)

    Article  MathSciNet  Google Scholar 

  25. Yao, J.T., Vasilakos, A.V., Pedrycz, W.: Granular computing: perspectives and challenges. IEEE Trans. Cybern. 43(6), 1977–1989 (2013)

    Google Scholar 

  26. Yao, J.T., Yao, Y.Y.: Induction of classification rules by granular computing. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 331–338. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45813-1_43

    Chapter  Google Scholar 

  27. Yao, J.T.: A ten-year review of granular computing. In: 2007 IEEE International Conference on Granular Computing (GRC 2007), p. 734. IEEE (2007)

    Google Scholar 

  28. Li, F., Miao, D., Pedrycz, W.: Granular multi-label feature selection based on mutual information. Pattern Recognit. 67, 410–423 (2017b)

    Google Scholar 

  29. Zhang, X., Miao, D., Liu, C., Le, M.: Constructive methods of rough approximation operators and multigranulation rough sets. Knowl.-Based Syst. 91, 114–125 (2016)

    Article  Google Scholar 

  30. Miao, D.Q., Wang, G.Y., Liu, Q., Lin, T.Y., Yao, Y.Y.: Granular computing: past, present and future prospects (2007)

    Google Scholar 

  31. Wang, Z., Miao, D., Zhao, C., Luo, S., Wei, Z.: A robust long-term pedestrian tracking-by-detection algorithm based on three-way decision. In: Mihálydeák, T., et al. (eds.) IJCRS 2019. LNCS (LNAI), vol. 11499, pp. 522–533. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22815-6_40

    Chapter  Google Scholar 

  32. Wang, Z.Y., Miao, D.Q., Zhao, C.R., Luo, S., Wei, Z.H.: Pedestrian tracking and detection combined algorithm based on multi-granularity features. Comput. Res. Dev. 57, 996–1002 (2020)

    Google Scholar 

  33. Ruoyi, D., Xie, J., Ma, Z., Chang, D., Song, Y.-Z., Guo, J.: Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9521–9535 (2021)

    Google Scholar 

  34. Li, J., Zhang, S., Huang, T.: Multi-scale 3D convolution network for video based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8618–8625 (2019)

    Google Scholar 

  35. Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021b)

    Google Scholar 

  36. Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10407–10416 (2020b)

    Google Scholar 

  37. Lin, T.-Y., Dollár, P., Girshick, R., He, K.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  38. Lin, L., Fan, H., Zhang, Z., Yong, X., Ling, H.: Swintrack: a simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 35, 16743–16754 (2022)

    Google Scholar 

  39. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013)

    Google Scholar 

  40. Kristan, M., et al.: The eighth visual object tracking VOT2020 challenge results. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 547–601. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_39

    Chapter  Google Scholar 

  41. Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19

    Chapter  Google Scholar 

  42. Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2019)

    Article  Google Scholar 

  43. Fan, H., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)

    Google Scholar 

  44. Fan, H., et al.: Lasot: a high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis. 129, 439–461 (2021)

    Google Scholar 

  45. Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  46. Zheng, M., et al.: End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315 (2020)

  47. Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C. Max-deeplab: end-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5463–5474 (2021b)

    Google Scholar 

  48. Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8844–8854 (2022)

    Google Scholar 

  49. Sun, P., et al.: Transtrack: multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

  50. Wang, Z., Miao, D.: Spatial-temporal single object tracking with three-way decision theory. Int. J. Approx. Reason. 154, 38–47 (2023)

    Article  MathSciNet  Google Scholar 

  51. Yao, Y., Zhong, N.: Granular computing (2008)

    Google Scholar 

  52. Wang, Z., Shi, C., Wei, L., Yao, Y.: Tri-granularity attribute reduction of three-way concept lattices. Knowl.-Based Syst. 110762 (2023)

    Google Scholar 

  53. Chen, Y., Zhu, P., Li, Q., Yao, Y.: Granularity-driven trisecting-and-learning models for interval-valued rule induction. Appl. Intell. 1–23 (2023)

    Google Scholar 

  54. Deng, W., Wang, G., Zhang, X., Ji, X., Li, G.: A multi-granularity combined prediction model based on fuzzy trend forecasting and particle swarm techniques. Neurocomputing 173, 1671–1682 (2016)

    Article  Google Scholar 

  55. Liu, K., Li, T., Yang, X., Ju, H., Yang, X., Liu, D.: Feature selection in threes: neighborhood relevancy, redundancy, and granularity interactivity. Appl. Soft Comput. 110679 (2023)

    Google Scholar 

  56. Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)

    Article  Google Scholar 

  57. Stepaniuk, J., Skowron, A.: Three-way approximation of decision granules based on the rough set approach. Int. J. Approx. Reason. 155, 1–16 (2023)

    Article  MathSciNet  Google Scholar 

  58. Janusz, A., Zalewska, A., Wawrowski, Ł, Biczyk, P., Ludziejewski, J., Sikora, M., et al.: Brightbox-a rough set based technology for diagnosing mistakes of machine learning models. Appl. Soft Comput. 141, 110285 (2023)

    Article  Google Scholar 

  59. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  60. Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 845–853 (2016)

    Google Scholar 

  61. Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. In: ICLR Workshop. Cited on, p. 111 (2016)

    Google Scholar 

  62. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  63. Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_5

    Chapter  Google Scholar 

  64. Honari, S., Yosinski, J., Vincent, P., Pal, C.: Recombinator networks: learning coarse-to-fine feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5743–5752 (2016)

    Google Scholar 

  65. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

    Chapter  Google Scholar 

  66. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  67. ILoshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  68. Zhang, Z., Xie, Y., Xing, F., McGough, M., Yang, L.: MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6428–6436 (2017)

    Google Scholar 

  69. Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)

    Google Scholar 

  70. Yan, B., Peng, H., Wu, K., Wang, D., Fu, J., Lu, H.: Lighttrack: finding lightweight neural networks for object tracking via one-shot architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15189 (2021b)

    Google Scholar 

  71. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)

    Google Scholar 

  72. Bhat, G., Johnander, J., Danelljan, M., Khan, F.S., Felsberg, M.: Unveiling the power of deep tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 493–509. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_30

    Chapter  Google Scholar 

  73. Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)

    Google Scholar 

  74. Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam R-CNN: visual tracking by re-detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6578–6588 (2020)

    Google Scholar 

  75. Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S.: Siamcar: siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6269–6277 (2020)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Plan under Grant No. 2022YFB3104700, the National Science Foundation of China under Grant No. 61976158 and No. 62376198, the National Science Foundation of China under Grant No. 62076182. This paper is partially supported by the Jiangxi “Double Thousand Plan”, and the National Natural Science Foundation of China (Serial No. 62163016), and the Jiangxi Provincial natural science fund (No. 20212ACB202001) and the National Natural Science Foundation of China No. 62006172.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziye Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Miao, D. (2023). Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking. In: Campagner, A., Urs Lenz, O., Xia, S., Ślęzak, D., Wąs, J., Yao, J. (eds) Rough Sets. IJCRS 2023. Lecture Notes in Computer Science(), vol 14481. Springer, Cham. https://doi.org/10.1007/978-3-031-50959-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-50959-9_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-50958-2

  • Online ISBN: 978-3-031-50959-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics