Abstract
Visual tracking is fundamentally the problem of conditional probability regressing of the target location in each video frame. Convolutional neural network (CNN) have been dominant in visual tracking these years, but CNN-based trackers neglect long-range dependency in likelihood representation and prior information, these destroy the spatial consistency of target. Recently emerging Transformer-based trackers mitigate these, however, they do not possess the ability to build interactions among features of cross-scale. Moreover, the sine position encoding prior in Transformer-based tracker is content-unaware and fails to reflect the relative index of different positions. To address these issues and inspired by Bayesian probabilistic formulation, we propose a cross-scale full Transformer tracker with content-based prior bias (named BTT). There are four main contributions of the method, (i) we propose a hierarchical full Transformer tracking architecture to introduce long-range dependency, which enriches the likelihood representation of model, and alleviates the destruction of spatial consistency. (ii) An expanding layer without using convolution or interpolation operation is proposed to aggregate layer information of different scales to construct cross-scale likelihood estimation. (iii) We further demonstrate the defect of sine position encoding with mathematical derivation, and introduce a content-based positional encoding bias as prior in the Transformer architecture to reflect the relative index of inputs. (iv) And extensive experiments show that the proposed tracker achieves better performance compared with CNN-based trackers in cases of illumination, low resolution, deformation on various datasets, and achieves superior performance on others attributes. The proposed tracker obtains 70.3%, 69.1%, 63.4% on OTB2015, UAV123, and LaSOT, respectively.
Similar content being viewed by others
References
Abuhussein A, Sadi M A H (2021) The impact of geomagnetically produced negative-sequence harmonics on power transformers. IEEE Access 9:19882–19890
Babaee M (2021) Multiple people tracking and gait recognition for video surveillance. Ph.D. dissertation, Technical University of Munich, Germany. http://www.dr.hut-verlag.de/978-3-8439-4860-9.html. Accessed 15 June 2021
Bertinetto L, Valmadre J, Henriques J F, Vedaldi A, Torr P H S (2016) Fully-convolutional siamese networks for object tracking. CoRR. arXiv:1606.09549
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr P H S (2016) Staple: complementary learners for real-time tracking. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, pp 1401–1409
Beshara P, Anderson D B, Pelletier M, Walsh W R (2021) The reliability of the microsoft kinect and ambulatory sensor-based motion tracking devices to measure shoulder range-of-motion: a systematic review and meta-analysis. Sensors 21(24):8186. [Online]. Available: https://doi.org/10.3390/s21248186
Bevilacqua M, Navigli R (2019) Quasi bidirectional encoder representations from transformers for word sense disambiguation. In: Mitkov R, Angelova G (eds) Proceedings of the international conference on recent advances in natural language processing, RANLP 2019, Varna, Bulgaria, September 2–4, 2019. INCOMA Ltd., pp 122–131
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2021) Swin-unet: unet-like pure transformer for medical image segmentation. CoRR. arXiv:2105.05537
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds) Computer vision—ECCV 2020—16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, ser. Lecture Notes in Computer Science, vol 12346. Springer, pp 213–229
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, seattle, WA, USA, June 13–19, 2020. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Siamese_Box_Adaptive_Network_for_Visual_Tracking_CVPR_2020_paper.html. Computer Vision Foundation/IEEE, pp 6667–6676
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille A L, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. CoRR. arXiv:2102.04306
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, virtual, June 19–25, 2021. Computer Vision Foundation/IEEE, pp 8126–8135
Cui Z, An J, Ye Q, Cui T (2020) Siamese cascaded region proposal networks with channel-interconnection-spatial attention for visual tracking. IEEE Access 8:154800–154815
Dai Z, Yang Z, Yang Y, Carbonell J G, Le Q V, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. CoRR. arXiv:1901.02860
Dai Z, Yang Z, Yang Y, Carbonell J G, Le Q V, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, volume 1: long papers. Association for Computational Linguistics, pp 2978–2988
Danelljan M, Häger G, Khan F S, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: Valstar MF, French AP, Pridmore TP (eds) British machine vision conference, BMVC 2014, Nottingham, UK, September 1–5, 2014. BMVA Press
Danelljan M, Häger G, Khan F S, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: 2015 IEEE International conference on computer vision workshop, ICCV workshops 2015, Santiago, Chile, December 7–13, 2015. IEEE Computer Society, pp 621–629
Danelljan M, Häger G, Khan F S, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: 2015 IEEE International conference on computer vision, ICCV 2015, Santiago, Chile, December 7–13, 2015. IEEE Computer Society, pp 4310–4318
Danelljan M, Bhat G, Khan F S, Felsberg M (2019) ATOM: accurate tracking by overlap maximization. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 4660–4669
Danelljan M, Gool L V, Timofte R (2020) Probabilistic regression for visual tracking. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/IEEE, pp 7181–7190
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. CoRR. arXiv:2010.11929
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 5374–5383
Fan X, Zhang S, Chen B, Zhou M (2020) Bayesian attention modules. CoRR. arXiv:2010.10604
Fu Z, Liu Q, Fu Z, Wang Y (2021) Stmtrack: template-free visual tracking with space-time memory networks. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, virtual, June 19–25, 2021. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2021/html/Fu_STMTrack_Template-Free_Visual_Tracking_With_Space-Time_Memory_Networks_CVPR_2021_paper.html. Computer Vision Foundation/IEEE, pp 13774–13783
Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann F A, Brendel W (2019) Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net. [Online]. Available: https://openreview.net/forum?id=Bygh9j09KX
Ghoshal B, Tucker A (2021) Hyperspherical weight uncertainty in neural networks. In: Abreu PH, Rodrigues PP, Fernández A, Gama J (eds) Advances in intelligent data analysis XIX—19th international symposium on intelligent data analysis, IDA 2021, Porto, Portugal, April 26–28, 2021, Proceedings, ser. Lecture Notes in Computer Science, vol 12695. Springer, pp 3–11
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Teh YW, Titterington DM (eds) Proceedings of the thirteenth international conference on artificial intelligence and statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13–15, 2010, ser. JMLR Proceedings, vol 9. JMLR.org, pp 249–256
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90. IEEE Computer Society, pp 770–778
Hermann K L, Lampinen A K (2020) What shapes feature representations? Exploring datasets, architectures, and training. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, virtual. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/71e9c6620d381d60196ebe694840aaaa-Abstract.html
Houssaine Hssayni E, Joudar N, Ettaouil M (2022) KRR-CNN: kernels redundancy reduction in convolutional neural networks. Neural Comput Appl 34(3):2443–2454. [Online]. Available: https://doi.org/10.1007/s00521-021-06540-3
Huang L, Zhao X, Huang K (2021) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577
Islam M A, Kowal M, Esser P, Jia S, Ommer B, Derpanis K G, Bruce N D B (2021) Shape or texture: understanding discriminative features in cnns. In: 9th International conference on learning representations, ICLR 2021, virtual event, Austria, May 3–7, 2021. Openreview.net. [Online]. Available: https://openreview.net/forum?id=NcFEZOi-rLa
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: Cortes C, Lawrence N D, Lee D D, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7–12, 2015, Montreal, Quebec, Canada. [Online]. Available: https://proceedings.neurips.cc/paper/2015/hash/33ceb07bf4eeb3da587e268d663aba1a-Abstract.html, pp 2017–2025
Junior F E F, Yen G G (2019) Particle swarm optimization of deep neural networks architectures for image classification. Swarm Evol Comput 49:62–74. [Online]. Available: https://doi.org/10.1016/j.swevo.2019.05.010
Kim C, Li F, Rehg J M (2018) Multi-object tracking with neural gating using bilinear LSTM. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, Part VIII, ser. Lecture Notes in Computer Science, vol 11212. Springer, pp 208–224. https://doi.org/10.1007/978-3-030-01237-3_13
Kim W, Moon S, Lee J W, Nam D, Jung C (2018) Multiple player tracking in soccer videos: an adaptive multiscale sampling approach. Multim Syst 24(6):611–623. [Online]. Available: https://doi.org/10.1007/s00530-018-0586-9
Le N, Ho Q, Nguyen T, Ou Y (2021) A transformer architecture based on BERT and 2d convolutional neural network to identify DNA enhancers from sequence information. Briefings Bioinform 5:22
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018. [Online]. Available: http://openaccess.thecvf.com/content_cvpr_2018/html/Li_High_Performance_Visual_CVPR_2018_paper.html. Computer Vision Foundation/IEEE Computer Society, pp 8971–8980
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2018) Siamrpn++: evolution of siamese visual tracking with very deep networks. CoRR. arXiv:1812.11703
Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: gradient-guided network for visual object tracking. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE, pp 6161–6170
Liao B, Wang C, Wang Y, Wang Y, Yin J (2020) Pg-net: pixel to global matching network for visual tracking. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds) Computer vision—ECCV 2020—16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII, ser. Lecture Notes in Computer Science, vol 12367. Springer, pp 429–444
Lin T, Maire M, Belongie S J, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L (2014) Microsoft COCO: common objects in context. In: Fleet DJ, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision—ECCV 2014—13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, Part V, ser. Lecture Notes in Computer Science, vol 8693. Springer, pp 740–755
Liu L, Xing J, Ai H, Ruan X (2012) Hand posture recognition using finger geometric feature. In: Proceedings of the 21st international conference on pattern recognition, ICPR 2012, Tsukuba, Japan, November 11–15, 2012. [Online]. Available: https://ieeexplore.ieee.org/document/6460197/. IEEE Computer Society, pp 565–568
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. CoRR. arXiv:2103.14030
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net
Lu X, Huo H, Fang T, Zhang H (2018) Learning deconvolutional network for object tracking. IEEE Access 6:18032–18041
Lukezic A, Matas J, Kristan M (2020) D3S—a discriminative single shot segmentation tracker. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/IEEE, pp 7131–7140
Mbelwa J T, Zhao Q, Lu Y, Liu H, Wang F, Mbise M (2019) Objectness-based smoothing stochastic sampling and coherence approximate nearest neighbor for visual tracking. Vis Comput 35(3):371–384. [Online]. Available: https://doi.org/10.1007/s00371-018-1470-5
Meinhardt T, Kirillov A, Leal-Taixé L, Feichtenhofer C (2021) Trackformer: multi-object tracking with transformers. CoRR. arXiv:2101.02702
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016—14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, ser. Lecture Notes in Computer Science, vol 9905. Springer, pp 445–461
Müller M, Bibi A, Giancola S, Al-Subaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, Part I, ser. Lecture Notes in Computer Science, vol 11205. Springer, pp 310–327
Nam H, Han B (2015) Learning multi-domain convolutional neural networks for visual tracking. CoRR. arXiv:1510.07945
Polson N, Sokolov V (2017) Deep learning: a bayesian perspective. CoRR. arXiv:1706.00473
Ren S, He K, Girshick R B, Sun J (2017) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. [Online]. Available: https://doi.org/10.1109/TPAMI.2016.2577031
Sha Y, Zhang Y, Ji X, Hu L (2021) Transformer-unet: raw image processing with unet. CoRR. arXiv:2109.08417
Shen Z, Dai Y, Rao Z (2021) Cfnet: cascade and fused cost volume for robust stereo matching. In: IEEE conference on computer vision and pattern recognition, CVPR 2021, virtual, June 19–25, 2021. Computer Vision Foundation/IEEE, pp 13906–13915
Sun P, Jiang Y, Zhang R, Xie E, Cao J, Hu X, Kong T, Yuan Z, Wang C, Luo P (2020) Transtrack: multiple-object tracking with transformer. CoRR. arXiv:2012.15460
Tao R, Gavves E, Smeulders A W M (2016) Siamese instance search for tracking. CoRR. arXiv:1605.05863
Tian S, Chen Z, Chen B, Zou W, Li X (2021) Channel and spatial attention-based siamese network for visual object tracking. J Electronic Imaging 30(3)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. CoRR. arXiv:1706.03762
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: IEEE Conference on computer vision and pattern recognition, CVPR 2021, virtual, June 19–25, 2021. Computer Vision Foundation/IEEE, pp 1571–1580. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2021/html/Wang_Transformer_Meets_Tracker_Exploiting_Temporal_Context_for_Robust_Visual_Tracking_CVPR_2021_paper.html
Wu Y, Lim J, Yang M (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Xu J, Ma J, Zhu Z (2019) Bayesian optimized continual learning with attention mechanism. CoRR. arXiv:1905.03980
Xu Y, Wang Z, Li Z, Ye Y, Yu G (2019) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. CoRR. arXiv:1911.06188
Xue B, Yu J, Xu J, Liu S, Hu S, Ye Z, Geng M, Liu X, Meng H (2021) Bayesian transformer language models for speech recognition. In: IEEE International conference on acoustics, speech and signal processing, ICASSP 2021, toronto, ON, Canada, June 6–11, 2021. [Online]. Available: https://doi.org/10.1109/ICASSP39728.2021.9414046. IEEE, pp 7378–7382
Yan H, Deng B, Li X, Qiu X (2019) TENER: adapting transformer encoder for named entity recognition. CoRR. arXiv:1911.04474
Yan B, Zhao H, Wang D, Lu H, Yang X (2019) ‘Skimming-perusal’ tracking: a framework for real-time and robust long-term tracking. In: 2019 IEEE/CVF International conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27—November 2, 2019. IEEE, pp 2385–2393
Zhang Z, Peng H (2020) Ocean: object-aware anchor-free tracking. CoRR. arXiv:2006.10721
Zhang G, Vela P A (2015) Good features to track for visual SLAM. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298743. IEEE Computer Society, pp 1373–1382
Zhang K, Zhang L, Yang M, Zhang D (2013) Fast tracking via spatio-temporal context learning. CoRR. arXiv:1311.1939
Zhang S, Fan X, Chen B, Zhou M (2021) Bayesian attention belief networks. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18–24 July 2021, Virtual Event. ser. Proceedings of machine learning research, vol 139. PMLR. [Online]. Available: http://proceedings.mlr.press/v139/zhang21f.html, pp 12413–12426
Zhang Z, Wang X, Huang D, Fang X, Zhou M, Zhang Y (2022) MRPT: millimeter-wave radar-based pedestrian trajectory tracking for autonomous urban driving. IEEE Trans Instrum Meas 71:1–17. [Online]. Available: https://doi.org/10.1109/TIM.2021.3139658
Zhu W, Pelecanos J W (2019) A bayesian attention neural network layer for speaker recognition. In: IEEE International conference on acoustics, speech and signal processing, ICASSP 2019, Brighton, United Kingdom, May 12–17, 2019. IEEE, pp 6241–6245
Zhu Z, Soricut R (2021) H-transformer-1d: fast one-dimensional hierarchical attention for sequences. CoRR. arXiv:2107.11906
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, Part IX, ser. Lecture Notes in Computer Science, vol 11213. Springer, pp 103–119
Zhu Y, Wang T, Zhu S (2022) A novel tracking system for human following robots with fusion of MMW radar and monocular vision. Ind Robot 49(1):120–131. [Online]. Available: https://doi.org/10.1108/IR-02-2021-003
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Chu He, Yan Huang and Kehan Chen contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fan, S., Chen, X., He, C. et al. Cross-scale content-based full Transformer network with Bayesian inference for object tracking. Multimed Tools Appl 82, 19877–19900 (2023). https://doi.org/10.1007/s11042-022-14162-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14162-7