Abstract
Object tracking with deep networks has recently achieved substantial improvement in terms of tracking performance. In this paper, we propose a multitask Siamese neural network that uses a residual hierarchical attention mechanism to achieve high-performance object tracking. This network is trained offline in an end-to-end manner, and it is capable of real-time tracking. To produce more efficient and generative attention-aware features, we propose residual hierarchical attention learning using residual skip connections in the attention module to receive hierarchical attention. Moreover, we formulate a multitask correlation filter layer to exploit the missing link between context awareness and regression target adaptation, and we insert this differentiable layer into a neural network to improve the discriminatory capability of the network. The results of experimental analyses conducted on the OTB, VOT and TColor-128 datasets, which contain various tracking scenarios, demonstrate the efficiency of our proposed real-time object-tracking network.
Similar content being viewed by others
References
Jiang H, Jin W (2019) Effective use of convolutional neural networks and diverse deep supervision for better crowd counting. Appl Intell 49(7):2415–2433
Yao G, Lei T, Zhong J, Jiang P (2019) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49(6):2017–2029
Zhao Y, Xu Z, Xiang Z, Liu Y (2017) Online learning of dynamic multi-view gallery for person re-identification. Multimed Tools Appl 76(1):217–241
Shi D, Zhu L, Cheng Z, Li Z, Zhang H (2018) Unsupervised multi-view feature extraction with dynamic graph learning. J Vis Commun Image Represent 56:256–264
Hou S, Zhou S, Liu W, Zheng Y (2018) Classifying advertising video by topicalizing high-level semantic concepts. Multimed Tools Appl 77(19):25475–25511
Huang W, Gu J, Ma X, Li Y (2017) Correlation filter-based self-paced object tracking. In: Proceedings of the IEEE international conference on robotics and automation, pp 4437–4442
Huang W, Gu J, Ma X, Li Y (2018) Correlation-filter based scale-adaptive visual tracking with hybrid-scheme sample learning. IEEE Access 6:125–137
Zhang B, Lei Z, Sun J, Zhang H (2018) Cross-media retrieval with collective deep semantic learning. Multimed Tools Appl 77(17):22247–22266
Sui X, Zheng Y, Wei B, Bi H, Wu J, Pan X, Yin Y, Zhang S (2017) Choroid segmentation from optical coherence tomography with graph edge weights learned from deep convolutional neural networks. Neurocomputing 237:332–341
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV Workshop, pp 850–865
Valmadre J, Bertinetto L, Henriques JF, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5000–5008
Zhu Z, Wu W, Zou W, Yan J (2018) End-to-end flow correlation tracking with spatial-temporal attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 548–557
Wang Q, Gao J, Xing J, Zhang M, Hu W (2017) Dcfnet: discriminant correlation filters network for visual tracking, arXiv:1704.04057
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1781–1789
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302
Fan H, Ling H (2017) Sanet: structure-aware network for visual tracking. In: CVPR Workshop, pp 2217–2224
Huang W, Gu J, Ma X, Li Y (2017) Self-paced model learning for robust visual tracking. J Electron Imag, 26(1)
Huang W, Gu J, Ma X (2016) Compressive sensing with weighted local classifiers for robot visual tracking. Int J Robot Autom 31(5):416–427
Henriques JF, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Tang M, Yu B, Zhang F, Wang J (2018) High-speed tracking with multi-kernel correlation filters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4874–4883
Galoogahi HK, Fagg A, Lucey S (2017) Learning background-aware correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1144–1152
Mueller M, Smith N, Ghanem B (2017) Context-aware correlation filter tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1387–1395
Bibi A, Mueller M, Ghanem B (2016) Target response adaptation for correlation filter tracking. In: Proceedings of the European Conference on Computer Vision, pp 419–433
Xing S, Liu F, Wang Q, Zhao X, Li T (2019) A hierarchical attention model for rating prediction by leveraging user and product reviews. Neurocomputing 332:417–427
Sun J, Liu X, Wan W, Li J, Zhao D, Zhang H (2016) Video hashing based on appearance and attention features fusion via dbn. Neurocomputing 213:84–94
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of the European conference on computer vision, pp 483–499
Choi J, Chang HJ, Yun S, Fischer T, Demiris Y, Choi JY (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4828–4837
Ren M, Zemel RS (2017) End-to-end instance segmentation with recurrent attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 293–301
Lukezic A, Vojir T, Cehovin L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4847–4856
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Zhang M, Xing J, Gao J, Hu W (2015) Robust visual tracking using joint scale-spatial correlation filters
Chen B, Wang D, Li P, Lu H (2018) Real-time ’actor-critic’ tracking. In: Proceedings of the European conference on computer vision, pp 328–345
Zhang Y, Wang L, Wang D, Feng M, Lu H, Qi J (2018) Structured siamese network for real-time visual tracking. In: Proceedings of the European conference on computer vision, pp 355–370
Dong X, Shen J (2018) Triplet loss in siamese network for object tracking. In: Proceedings of the European conference on computer vision, pp 472–488
Wang N, Song Y, Ma C, Zhou W, Liu W, Li H (2019) Unsupervised deep tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary leaners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409
Danelljan M, Hager G, Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: ICCV Workshop, pp 621–629
Park E, Berg AC (2018) Meta-tracker: fast and robust online adaptation for visual object trackers. In: Proceedings of the European conference on computer vision, pp 587–604
Zhang T, Xu C, Yang M-H (2017) Multi-task correlation particle filter for robust object tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4819–4827
Danelljan M, Hager G, Khan FS, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318
Hong Z, Chen Z, Wang C, Mei X, Prokhorov D, Tao D (2015) Multi-store tracker (muster): a cognitive psychology inspired approach to object tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 749–758
Choi J, Chang HJ, Yun S, Fischer T, Demiris Y, Choi JY (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4828–4837
Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang M-H (2016) Hedged deep tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4303–4311
Tao R, Gavves E, Smeulders AWM (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429
Gao J, Zhang T, Xu C (2019) Graph convolutional tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang X, Li C, Luo B, Tang J (2018) Sint++: robust visual tracking via adversarial positive instance generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4864–4873
Kristan M, Leonardis A, Matas J, Felsberg M et al (2017) The visual object tracking vot2017 challenge results. In: ICCV Workshop, pp 1949–1972
Danelljan M, Bhat G, Khan FS, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6931–6939
Danelljan M, Hager G, Khan FS, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: Proceedings of the British machine vision conference
Zhang J, Ma S, Sclaroff S (2014) Meem: robust tracking via multiple experts using entropy minimization. In: Proceedings of the European conference on computer vision, pp 188– 203
Poostchi M, Palaniappan K, Seetharaman G (2017) Spatial pyramid context-aware moving vehicle detection and tracking in urban aerial imagery. In: Proceedings of the IEEE international conference on advanced video and signal based surveillance, pp 1–6
Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Proceedings of the European conference on computer vision, pp 472–488
Wang L, Ouyang W, Wang X, Lu H (2015) Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International conference on computer vision, pp 3119– 3127
Choi J, Chang HJ, Jeong J, Demiris Y, Choi JY (2016) Visual tracking using attention-modulated disintegration and integration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4321–4330
Acknowledgments
This work is supported by the National High-tech Research and Development Program 863, China [2015AA042307]; the National Key Research and Development Program, China [2018YFB1305803]; the Joint Fund of National Natural Science Foundation and Shandong Province, China [U1706228]; the National Natural Science Foundation of China [61673245]; the Program for Outstanding PhD Candidate of Shandong University; the Natural Sciences and Engineering Research Council of Canada; the National Natural Science Foundation of China [61572300; 81871508; 61773246]; the Taishan Scholar Program of Shandong Province, China [TSHW201502038]; and the Major Program of Shandong Province Natural Science Foundation, China [ZR2018ZB0419].
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Based on the principle of convex functions and the circulant matrix, the closed-form solution of the filter parameter w in the proposed objective function can be deduced. The appendix shows the detailed solution process.
The objective function in (5) can be rewritten as follows using a circulant matrix, \(\mathbf {B}= \left [\begin {array}{cccccc} \mathbf {X}_{0}& \sqrt {\theta _{3}}\mathbf {X}_{1}&\sqrt {\theta _{3}}\mathbf {X}_{2}&\ldots &\sqrt {\theta _{3}}\mathbf {X}_{k} \end {array}\right ]^{T}\):
We define that \(\mathbf {z}=\left [\begin {array}{cccc}\mathbf {w}\\\mathbf {y}^{\prime } \end {array}\right ]\); then, we attempt to search for a z to optimize (10):
Because our proposed objective function is convex, (11) can be optimized using its first derivative:
B is a circulant matrix. Thus, we can use its characteristics as follows:
where F is the DFT matrix.
By substituting (13) into the result of (12), the equation can be transformed into (14) and (15):
where \(\mathbf {N}=diag(\hat {\mathbf {x}}^{*}_{0}\odot \hat {\mathbf {x}}_{0}+\theta _{3}{\sum }^{k}_{i=1}\hat {\mathbf {x}}^{*}_{i}\odot \hat {\mathbf {x}}_{i}+\theta _{1})\) and V = diag(1 + 𝜃2).
Thus, \(\left [\begin {array}{cccc} \hat {\mathbf {w}}^{*}\\\hat {\mathbf {y}^{\prime }}^{*} \end {array}\right ]\) canbe rewritten as follows:
In (16), \(\left [\begin {array}{cccc} \mathbf {N} & \mathbf {C}\\\mathbf {D}&\mathbf {V} \end {array}\right ]^{-1}\) is as follows:
Then, the solution of \(\hat {\mathbf {w}}^{*}\) according to (16) and (17):
where (N −CV− 1D)− 1 is:
Finally, we can obtain the solution of filter parameter w:
Rights and permissions
About this article
Cite this article
Huang, W., Gu, J., Ma, X. et al. End-to-end multitask Siamese network with residual hierarchical attention for real-time object tracking. Appl Intell 50, 1908–1921 (2020). https://doi.org/10.1007/s10489-019-01605-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01605-2