Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking

Wang, Ziye; Miao, Duoqian

doi:10.1007/978-3-031-50959-9_22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14481))

Included in the following conference series:

International Joint Conference on Rough Sets

516 Accesses

Abstract

The recently developed transformer has been largely explored in the research field of computer vision and especially improve the performance of single object tracking. However, the majority of current efforts concentrate on combining and enhancing convolutional neural network (CNN)-generated features and cannot fully excavating the potential of transformer. Motivated by this, we introduce multi-granularity theory into the pure transformer-based single object tracker and design a multi-granularity feature fusion module. With a view to fuse the feature of different granularity and enhance the feature representation, we design the double-branch transformer feature extractor and utilize cross-attention mechanism to fuse the feature. In our extensive experiments on multiple tracking benchmarks, including OTB2015, VOT2020, TrackingNet, GOT-10k, LaSOT, our proposed method named MGTT, the results could demonstrate that the proposed tracker achieves better performance than multiple state-of-the-art trackers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Siamese Network with Co-channel and Cr-Spatial Attention for Object Tracking

TFITrack: Transformer Feature Integration Network for Object Tracking

Article Open access 29 April 2024

Split-merge-excitation: a robust channel-wise feature attention mechanism applied to MDNet tracking

Article 13 May 2022

References

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018a)
Google Scholar
Yinda, X., Wang, Z., Li, Z., Yuan, Y., Gang, Y.: SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12549–12556 (2020)
Google Scholar
Noor, S., Waqas, M., Saleem, M.I., Minhas, H.N.: Automatic object tracking and segmentation using unsupervised SiamMask. IEEE Access 9, 106550–106559 (2021)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Google Scholar
Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 771–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_46
Chapter Google Scholar
Cucci, D.A., Matteucci, M., Bascetta, L.: Pose tracking and sensor self-calibration for an all-terrain autonomous vehicle. IFAC-PapersOnLine 49(15), 25–31 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Li, P., Wang, D., Wang, L., Lu, H.: Deep visual tracking: review and experimental comparison. Pattern Recognit. 76, 323–338 (2018b)
Google Scholar
Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S.: Deep learning for visual tracking: a comprehensive survey. IEEE Trans. Intell. Transp. Syst. 23(5), 3943–3968 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021a)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021a)
Google Scholar
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021a)
Google Scholar
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286–3295 (2019)
Google Scholar
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)
Google Scholar
Li, J., Huang, C., Qi, J., Qian, Y., Liu, W.: Three-way cognitive concept learning via multi-granularity. Inf. Sci. 378, 244–263 (2017a)
Google Scholar
Herrera, F., Herrera-Viedma, E., Martınez, L.: A fusion approach for managing multi-granularity linguistic term sets in decision making. Fuzzy Sets Syst. 114(1), 43–58 (2000)
Article Google Scholar
Yao, Y.: Perspectives of granular computing. In: 2005 IEEE International Conference on Granular Computing, vol. 1 (2005)
Google Scholar
Qian, Y., Liang, J., Yao, Y., Dang, C.: MGRS: a multi-granulation rough set. Inf. Sci. 180(6), 949–970 (2010)
Article MathSciNet Google Scholar
Yao, J.T., Vasilakos, A.V., Pedrycz, W.: Granular computing: perspectives and challenges. IEEE Trans. Cybern. 43(6), 1977–1989 (2013)
Google Scholar
Yao, J.T., Yao, Y.Y.: Induction of classification rules by granular computing. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 331–338. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45813-1_43
Chapter Google Scholar
Yao, J.T.: A ten-year review of granular computing. In: 2007 IEEE International Conference on Granular Computing (GRC 2007), p. 734. IEEE (2007)
Google Scholar
Li, F., Miao, D., Pedrycz, W.: Granular multi-label feature selection based on mutual information. Pattern Recognit. 67, 410–423 (2017b)
Google Scholar
Zhang, X., Miao, D., Liu, C., Le, M.: Constructive methods of rough approximation operators and multigranulation rough sets. Knowl.-Based Syst. 91, 114–125 (2016)
Article Google Scholar
Miao, D.Q., Wang, G.Y., Liu, Q., Lin, T.Y., Yao, Y.Y.: Granular computing: past, present and future prospects (2007)
Google Scholar
Wang, Z., Miao, D., Zhao, C., Luo, S., Wei, Z.: A robust long-term pedestrian tracking-by-detection algorithm based on three-way decision. In: Mihálydeák, T., et al. (eds.) IJCRS 2019. LNCS (LNAI), vol. 11499, pp. 522–533. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22815-6_40
Chapter Google Scholar
Wang, Z.Y., Miao, D.Q., Zhao, C.R., Luo, S., Wei, Z.H.: Pedestrian tracking and detection combined algorithm based on multi-granularity features. Comput. Res. Dev. 57, 996–1002 (2020)
Google Scholar
Ruoyi, D., Xie, J., Ma, Z., Chang, D., Song, Y.-Z., Guo, J.: Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9521–9535 (2021)
Google Scholar
Li, J., Zhang, S., Huang, T.: Multi-scale 3D convolution network for video based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8618–8625 (2019)
Google Scholar
Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021b)
Google Scholar
Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10407–10416 (2020b)
Google Scholar
Lin, T.-Y., Dollár, P., Girshick, R., He, K.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Lin, L., Fan, H., Zhang, Z., Yong, X., Ling, H.: Swintrack: a simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 35, 16743–16754 (2022)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013)
Google Scholar
Kristan, M., et al.: The eighth visual object tracking VOT2020 challenge results. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 547–601. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_39
Chapter Google Scholar
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19
Chapter Google Scholar
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2019)
Article Google Scholar
Fan, H., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)
Google Scholar
Fan, H., et al.: Lasot: a high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis. 129, 439–461 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth $16 \times 16$ words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Zheng, M., et al.: End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315 (2020)
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C. Max-deeplab: end-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5463–5474 (2021b)
Google Scholar
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8844–8854 (2022)
Google Scholar
Sun, P., et al.: Transtrack: multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)
Wang, Z., Miao, D.: Spatial-temporal single object tracking with three-way decision theory. Int. J. Approx. Reason. 154, 38–47 (2023)
Article MathSciNet Google Scholar
Yao, Y., Zhong, N.: Granular computing (2008)
Google Scholar
Wang, Z., Shi, C., Wei, L., Yao, Y.: Tri-granularity attribute reduction of three-way concept lattices. Knowl.-Based Syst. 110762 (2023)
Google Scholar
Chen, Y., Zhu, P., Li, Q., Yao, Y.: Granularity-driven trisecting-and-learning models for interval-valued rule induction. Appl. Intell. 1–23 (2023)
Google Scholar
Deng, W., Wang, G., Zhang, X., Ji, X., Li, G.: A multi-granularity combined prediction model based on fuzzy trend forecasting and particle swarm techniques. Neurocomputing 173, 1671–1682 (2016)
Article Google Scholar
Liu, K., Li, T., Yang, X., Ju, H., Yang, X., Liu, D.: Feature selection in threes: neighborhood relevancy, redundancy, and granularity interactivity. Appl. Soft Comput. 110679 (2023)
Google Scholar
Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)
Article Google Scholar
Stepaniuk, J., Skowron, A.: Three-way approximation of decision granules based on the rough set approach. Int. J. Approx. Reason. 155, 1–16 (2023)
Article MathSciNet Google Scholar
Janusz, A., Zalewska, A., Wawrowski, Ł, Biczyk, P., Ludziejewski, J., Sikora, M., et al.: Brightbox-a rough set based technology for diagnosing mistakes of machine learning models. Appl. Soft Comput. 141, 110285 (2023)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 845–853 (2016)
Google Scholar
Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. In: ICLR Workshop. Cited on, p. 111 (2016)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_5
Chapter Google Scholar
Honari, S., Yosinski, J., Vincent, P., Pal, C.: Recombinator networks: learning coarse-to-fine feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5743–5752 (2016)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
ILoshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Zhang, Z., Xie, Y., Xing, F., McGough, M., Yang, L.: MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6428–6436 (2017)
Google Scholar
Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
Google Scholar
Yan, B., Peng, H., Wu, K., Wang, D., Fu, J., Lu, H.: Lighttrack: finding lightweight neural networks for object tracking via one-shot architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15189 (2021b)
Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)
Google Scholar
Bhat, G., Johnander, J., Danelljan, M., Khan, F.S., Felsberg, M.: Unveiling the power of deep tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 493–509. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_30
Chapter Google Scholar
Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)
Google Scholar
Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam R-CNN: visual tracking by re-detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6578–6588 (2020)
Google Scholar
Guo, D., Wang, J., Cui, Y., Wang, Z., Chen, S.: Siamcar: siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6269–6277 (2020)
Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Plan under Grant No. 2022YFB3104700, the National Science Foundation of China under Grant No. 61976158 and No. 62376198, the National Science Foundation of China under Grant No. 62076182. This paper is partially supported by the Jiangxi “Double Thousand Plan”, and the National Natural Science Foundation of China (Serial No. 62163016), and the Jiangxi Provincial natural science fund (No. 20212ACB202001) and the National Natural Science Foundation of China No. 62006172.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, No. 4800, Cao’an Highway, Jiading District, Shanghai, 201804, People’s Republic of China
Ziye Wang & Duoqian Miao

Authors

Ziye Wang
View author publications
You can also search for this author in PubMed Google Scholar
Duoqian Miao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziye Wang .

Editor information

Editors and Affiliations

IRCCS Istituto Ortopedico Galeazzi, Milano, Italy
Andrea Campagner
Ghent University, Ghent, Belgium
Oliver Urs Lenz
Chongqing University of Posts and Telecommunications, Chongqing, China
Shuyin Xia
University of Warsaw, Warsaw, Poland
Dominik Ślęzak
AGH University of Science and Technology, Kraków, Poland
Jarosław Wąs
University of Regina, Regina, SK, Canada
JingTao Yao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Miao, D. (2023). Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking. In: Campagner, A., Urs Lenz, O., Xia, S., Ślęzak, D., Wąs, J., Yao, J. (eds) Rough Sets. IJCRS 2023. Lecture Notes in Computer Science(), vol 14481. Springer, Cham. https://doi.org/10.1007/978-3-031-50959-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-50959-9_22
Published: 31 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50958-2
Online ISBN: 978-3-031-50959-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking