Skip to main content
Log in

End-to-end multitask Siamese network with residual hierarchical attention for real-time object tracking

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Object tracking with deep networks has recently achieved substantial improvement in terms of tracking performance. In this paper, we propose a multitask Siamese neural network that uses a residual hierarchical attention mechanism to achieve high-performance object tracking. This network is trained offline in an end-to-end manner, and it is capable of real-time tracking. To produce more efficient and generative attention-aware features, we propose residual hierarchical attention learning using residual skip connections in the attention module to receive hierarchical attention. Moreover, we formulate a multitask correlation filter layer to exploit the missing link between context awareness and regression target adaptation, and we insert this differentiable layer into a neural network to improve the discriminatory capability of the network. The results of experimental analyses conducted on the OTB, VOT and TColor-128 datasets, which contain various tracking scenarios, demonstrate the efficiency of our proposed real-time object-tracking network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Jiang H, Jin W (2019) Effective use of convolutional neural networks and diverse deep supervision for better crowd counting. Appl Intell 49(7):2415–2433

    Article  MathSciNet  Google Scholar 

  2. Yao G, Lei T, Zhong J, Jiang P (2019) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49(6):2017–2029

    Article  Google Scholar 

  3. Zhao Y, Xu Z, Xiang Z, Liu Y (2017) Online learning of dynamic multi-view gallery for person re-identification. Multimed Tools Appl 76(1):217–241

    Article  Google Scholar 

  4. Shi D, Zhu L, Cheng Z, Li Z, Zhang H (2018) Unsupervised multi-view feature extraction with dynamic graph learning. J Vis Commun Image Represent 56:256–264

    Article  Google Scholar 

  5. Hou S, Zhou S, Liu W, Zheng Y (2018) Classifying advertising video by topicalizing high-level semantic concepts. Multimed Tools Appl 77(19):25475–25511

    Article  Google Scholar 

  6. Huang W, Gu J, Ma X, Li Y (2017) Correlation filter-based self-paced object tracking. In: Proceedings of the IEEE international conference on robotics and automation, pp 4437–4442

  7. Huang W, Gu J, Ma X, Li Y (2018) Correlation-filter based scale-adaptive visual tracking with hybrid-scheme sample learning. IEEE Access 6:125–137

    Article  Google Scholar 

  8. Zhang B, Lei Z, Sun J, Zhang H (2018) Cross-media retrieval with collective deep semantic learning. Multimed Tools Appl 77(17):22247–22266

    Article  Google Scholar 

  9. Sui X, Zheng Y, Wei B, Bi H, Wu J, Pan X, Yin Y, Zhang S (2017) Choroid segmentation from optical coherence tomography with graph edge weights learned from deep convolutional neural networks. Neurocomputing 237:332–341

    Article  Google Scholar 

  10. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV Workshop, pp 850–865

  11. Valmadre J, Bertinetto L, Henriques JF, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5000–5008

  12. Zhu Z, Wu W, Zou W, Yan J (2018) End-to-end flow correlation tracking with spatial-temporal attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 548–557

  13. Wang Q, Gao J, Xing J, Zhang M, Hu W (2017) Dcfnet: discriminant correlation filters network for visual tracking, arXiv:1704.04057

  14. Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1781–1789

  15. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302

  16. Fan H, Ling H (2017) Sanet: structure-aware network for visual tracking. In: CVPR Workshop, pp 2217–2224

  17. Huang W, Gu J, Ma X, Li Y (2017) Self-paced model learning for robust visual tracking. J Electron Imag, 26(1)

  18. Huang W, Gu J, Ma X (2016) Compressive sensing with weighted local classifiers for robot visual tracking. Int J Robot Autom 31(5):416–427

    Google Scholar 

  19. Henriques JF, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

    Article  Google Scholar 

  20. Tang M, Yu B, Zhang F, Wang J (2018) High-speed tracking with multi-kernel correlation filters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4874–4883

  21. Galoogahi HK, Fagg A, Lucey S (2017) Learning background-aware correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1144–1152

  22. Mueller M, Smith N, Ghanem B (2017) Context-aware correlation filter tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1387–1395

  23. Bibi A, Mueller M, Ghanem B (2016) Target response adaptation for correlation filter tracking. In: Proceedings of the European Conference on Computer Vision, pp 419–433

  24. Xing S, Liu F, Wang Q, Zhao X, Li T (2019) A hierarchical attention model for rating prediction by leveraging user and product reviews. Neurocomputing 332:417–427

    Article  Google Scholar 

  25. Sun J, Liu X, Wan W, Li J, Zhao D, Zhang H (2016) Video hashing based on appearance and attention features fusion via dbn. Neurocomputing 213:84–94

    Article  Google Scholar 

  26. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of the European conference on computer vision, pp 483–499

  27. Choi J, Chang HJ, Yun S, Fischer T, Demiris Y, Choi JY (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4828–4837

  28. Ren M, Zemel RS (2017) End-to-end instance segmentation with recurrent attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 293–301

  29. Lukezic A, Vojir T, Cehovin L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4847–4856

  30. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  31. Zhang M, Xing J, Gao J, Hu W (2015) Robust visual tracking using joint scale-spatial correlation filters

  32. Chen B, Wang D, Li P, Lu H (2018) Real-time ’actor-critic’ tracking. In: Proceedings of the European conference on computer vision, pp 328–345

  33. Zhang Y, Wang L, Wang D, Feng M, Lu H, Qi J (2018) Structured siamese network for real-time visual tracking. In: Proceedings of the European conference on computer vision, pp 355–370

  34. Dong X, Shen J (2018) Triplet loss in siamese network for object tracking. In: Proceedings of the European conference on computer vision, pp 472–488

  35. Wang N, Song Y, Ma C, Zhou W, Liu W, Li H (2019) Unsupervised deep tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  36. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary leaners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409

  37. Danelljan M, Hager G, Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: ICCV Workshop, pp 621–629

  38. Park E, Berg AC (2018) Meta-tracker: fast and robust online adaptation for visual object trackers. In: Proceedings of the European conference on computer vision, pp 587–604

  39. Zhang T, Xu C, Yang M-H (2017) Multi-task correlation particle filter for robust object tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4819–4827

  40. Danelljan M, Hager G, Khan FS, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318

  41. Hong Z, Chen Z, Wang C, Mei X, Prokhorov D, Tao D (2015) Multi-store tracker (muster): a cognitive psychology inspired approach to object tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 749–758

  42. Choi J, Chang HJ, Yun S, Fischer T, Demiris Y, Choi JY (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4828–4837

  43. Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang M-H (2016) Hedged deep tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4303–4311

  44. Tao R, Gavves E, Smeulders AWM (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429

  45. Gao J, Zhang T, Xu C (2019) Graph convolutional tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  46. Wang X, Li C, Luo B, Tang J (2018) Sint++: robust visual tracking via adversarial positive instance generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4864–4873

  47. Kristan M, Leonardis A, Matas J, Felsberg M et al (2017) The visual object tracking vot2017 challenge results. In: ICCV Workshop, pp 1949–1972

  48. Danelljan M, Bhat G, Khan FS, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6931–6939

  49. Danelljan M, Hager G, Khan FS, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: Proceedings of the British machine vision conference

  50. Zhang J, Ma S, Sclaroff S (2014) Meem: robust tracking via multiple experts using entropy minimization. In: Proceedings of the European conference on computer vision, pp 188– 203

  51. Poostchi M, Palaniappan K, Seetharaman G (2017) Spatial pyramid context-aware moving vehicle detection and tracking in urban aerial imagery. In: Proceedings of the IEEE international conference on advanced video and signal based surveillance, pp 1–6

  52. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Proceedings of the European conference on computer vision, pp 472–488

  53. Wang L, Ouyang W, Wang X, Lu H (2015) Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International conference on computer vision, pp 3119– 3127

  54. Choi J, Chang HJ, Jeong J, Demiris Y, Choi JY (2016) Visual tracking using attention-modulated disintegration and integration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4321–4330

Download references

Acknowledgments

This work is supported by the National High-tech Research and Development Program 863, China [2015AA042307]; the National Key Research and Development Program, China [2018YFB1305803]; the Joint Fund of National Natural Science Foundation and Shandong Province, China [U1706228]; the National Natural Science Foundation of China [61673245]; the Program for Outstanding PhD Candidate of Shandong University; the Natural Sciences and Engineering Research Council of Canada; the National Natural Science Foundation of China [61572300; 81871508; 61773246]; the Taishan Scholar Program of Shandong Province, China [TSHW201502038]; and the Major Program of Shandong Province Natural Science Foundation, China [ZR2018ZB0419].

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jason Gu or Xin Ma.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Based on the principle of convex functions and the circulant matrix, the closed-form solution of the filter parameter w in the proposed objective function can be deduced. The appendix shows the detailed solution process.

The objective function in (5) can be rewritten as follows using a circulant matrix, \(\mathbf {B}= \left [\begin {array}{cccccc} \mathbf {X}_{0}& \sqrt {\theta _{3}}\mathbf {X}_{1}&\sqrt {\theta _{3}}\mathbf {X}_{2}&\ldots &\sqrt {\theta _{3}}\mathbf {X}_{k} \end {array}\right ]^{T}\):

$$ \begin{array}{@{}rcl@{}} \arg\underset{\mathbf{w},\mathbf{y}}{\min} \mathbb{O} &=& ||\begin{array}{cccccc} \left[\begin{array}{cccc} \mathbf{X}_{0}\\ \sqrt{\theta_{3}}\mathbf{X}_{1}\\\vdots\\\sqrt{\theta_{3}}\mathbf{X}_{k} \end{array}\right] \mathbf{w}- \left[\begin{array}{cccc} \mathbf{y}\\ 0\\\vdots\\0 \end{array}\right] \end{array}|| ^{2}_{2}+\theta_{1}||\mathbf{w}||^{2}_{2}\\ &&+\theta_{2} ||\begin{array}{cccccc} \left[\begin{array}{cccc} \mathbf{y}\\ 0\\\vdots\\0 \end{array}\right] -\left[\begin{array}{cccc} \mathbf{y}_{0}\\ 0\\\vdots\\0 \end{array}\right] \end{array}|| ^{2}_{2}\\ &=&||\mathbf{B}\mathbf{w}-\mathbf{y}^{\prime}||^{2}_{2}+\theta_{1}||\mathbf{w}||^{2}_{2}+\theta_{2}||\mathbf{y}^{\prime}-\mathbf{y}^{\prime}_{0}||^{2}_{2}. \end{array} $$
(10)

We define that \(\mathbf {z}=\left [\begin {array}{cccc}\mathbf {w}\\\mathbf {y}^{\prime } \end {array}\right ]\); then, we attempt to search for a z to optimize (10):

$$ \begin{array}{@{}rcl@{}} \arg\underset{\mathbf{z}}{\min} \mathbb{O}(\mathbf{z})&=& ||\left[\begin{array}{cccc}\mathbf{B} &-\mathbf{I} \end{array}\right]\mathbf{z}||^{2}_{2}+\theta_{1}||\left[\begin{array}{cccc}\mathbf{I} &0 \end{array}\right]\mathbf{z}||^{2}_{2}\\ &&+\theta_{2}||\left[\begin{array}{cccc}0&\mathbf{I} \end{array}\right]\mathbf{z}-\mathbf{y}^{\prime}_{0}||^{2}_{2}.\ \end{array} $$
(11)

Because our proposed objective function is convex, (11) can be optimized using its first derivative:

$$ \begin{array}{@{}rcl@{}} \nabla_{\mathbf{z}}\mathbb{O}(\mathbf{z})&=& \left[\begin{array}{cccc}\mathbf{B}^{T}\mathbf{B} & -\mathbf{B}^{T}\\-\mathbf{B}&\mathbf{I} \end{array}\right]\mathbf{z}+\theta_{1}\left[\begin{array}{cccc}\mathbf{I}&0\\0&0 \end{array}\right]\mathbf{z}\\ &&+\theta_{2}\left[\begin{array}{cccc}0&0\\0&\mathbf{I} \end{array}\right]\mathbf{z} -\theta_{2}\left[\begin{array}{cccc}0\\\mathbf{I} \end{array}\right]\mathbf{y}^{\prime}_{0}=0,\\ &\Rightarrow&\left[\begin{array}{cccc}\mathbf{B}^{T}\mathbf{B}+\theta_{1}\mathbf{I} & -\mathbf{B}^{T}\\-\mathbf{B}&(1+\theta_{2})\mathbf{I} \end{array}\right]\mathbf{z}=\theta_{2}\left[\begin{array}{cccc}0\\\mathbf{I} \end{array}\right]\mathbf{y}^{\prime}_{0}.\ \end{array} $$
(12)

B is a circulant matrix. Thus, we can use its characteristics as follows:

$$ \begin{array}{lllll} &\mathbf{B}=\mathbf{F}\left[\begin{array}{cccc}diag(\hat{\mathbf{x}}_{0})&\sqrt{\theta_{3}}diag(\hat{\mathbf{x}}_{1})&\ldots&\sqrt{\theta_{3}}diag(\hat{\mathbf{x}}_{k}) \end{array}\right]^{T}\mathbf{F}^{H}=-\mathbf{F}\mathbf{D}\mathbf{F}^{H},\\ & \mathbf{B}^{T}=\mathbf{F}\left[\begin{array}{cccc}diag(\hat{\mathbf{x}}^{*}_{0})&\sqrt{\theta_{3}}diag(\hat{\mathbf{x}}^{*}_{1})&\ldots&\sqrt{\theta_{3}}diag(\hat{\mathbf{x}}^{*}_{k}) \end{array}\right]\mathbf{F}^{H} =-\mathbf{F}\mathbf{C}\mathbf{F}^{H}. \end{array} $$
(13)

where F is the DFT matrix.

By substituting (13) into the result of (12), the equation can be transformed into (14) and (15):

$$ \begin{array}{@{}rcl@{}} &&\left[\begin{array}{cccc} \mathbf{F}&0\\0&\mathbf{F} \end{array}\right] \left[\begin{array}{cccc} diag(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{x}}_{0}+\theta_{3}{\sum}^{k}_{i=1}\hat{\mathbf{x}}^{*}_{i}\odot\hat{\mathbf{x}}_{i}+\theta_{1}) & \mathbf{C}\\\mathbf{D}&diag(1+\theta_{2}) \end{array}\right]\\ &&\times \left[\begin{array}{cccc} \mathbf{F}^{H}&0\\0&\mathbf{F}^{H} \end{array}\right] \mathbf{z}=\theta_{2} \left[\begin{array}{cccc}0\\\mathbf{I} \end{array}\right]\mathbf{y}^{\prime}_{0},\ \end{array} $$
(14)
$$ \left[\begin{array}{cccc} \mathbf{N} & \mathbf{C}\\\mathbf{D}&\mathbf{V} \end{array}\right] \left[\begin{array}{cccc} \hat{\mathbf{w}}^{*}\\\hat{\mathbf{y}^{\prime}}^{*} \end{array}\right] =\theta_{2} \left[\begin{array}{cccc}0\\\mathbf{F}^{H} \end{array}\right]\mathbf{y}^{\prime}_{0},\ $$
(15)

where \(\mathbf {N}=diag(\hat {\mathbf {x}}^{*}_{0}\odot \hat {\mathbf {x}}_{0}+\theta _{3}{\sum }^{k}_{i=1}\hat {\mathbf {x}}^{*}_{i}\odot \hat {\mathbf {x}}_{i}+\theta _{1})\) and V = diag(1 + 𝜃2).

Thus, \(\left [\begin {array}{cccc} \hat {\mathbf {w}}^{*}\\\hat {\mathbf {y}^{\prime }}^{*} \end {array}\right ]\) canbe rewritten as follows:

$$ \left[\begin{array}{cccc} \hat{\mathbf{w}}^{*}\\\hat{\mathbf{y}^{\prime}}^{*} \end{array}\right] =\theta_{2} \left[\begin{array}{cccc} \mathbf{N} & \mathbf{C}\\\mathbf{D}&\mathbf{V} \end{array}\right]^{-1} \left[\begin{array}{cccc}0\\\mathbf{F}^{H} \end{array}\right]\mathbf{y}^{\prime}_{0}.\ $$
(16)

In (16), \(\left [\begin {array}{cccc} \mathbf {N} & \mathbf {C}\\\mathbf {D}&\mathbf {V} \end {array}\right ]^{-1}\) is as follows:

$$ \begin{array}{lllll} &\left[\begin{array}{cccc} \mathbf{N} & \mathbf{C}\\\mathbf{D}&\mathbf{V} \end{array}\right]^{-1}\\ &= \left[\begin{array}{cccc} (\mathbf{N}- \mathbf{C}\mathbf{V}^{-1}\mathbf{D})^{-1}& -(\mathbf{N}- \mathbf{C}\mathbf{V}^{-1}\mathbf{D})^{-1}\mathbf{C}\mathbf{V}^{-1} \\-\mathbf{V}^{-1}\mathbf{D}(\mathbf{N}- \mathbf{C}\mathbf{V}^{-1}\mathbf{D})^{-1} & \mathbf{V}^{-1}\mathbf{D}(\mathbf{N}- \mathbf{C}\mathbf{V}^{-1}\mathbf{D})^{-1}\mathbf{C}\mathbf{V}^{-1}+\mathbf{V}^{-1} \end{array}\right] .\ \end{array} $$
(17)

Then, the solution of \(\hat {\mathbf {w}}^{*}\) according to (16) and (17):

$$ \hat{\mathbf{w}}^{*}=-\theta_{2}(\mathbf{N}- \mathbf{C}\mathbf{V}^{-1}\mathbf{D})^{-1}\mathbf{C}\mathbf{V}^{-1}\mathbf{F}^{H}\mathbf{y}^{\prime}_{0}, $$
(18)

where (NCV− 1D)− 1 is:

$$ \begin{array}{@{}rcl@{}} (\mathbf{N}- \mathbf{C}\mathbf{V}^{-1}\mathbf{D})^{-1}&=&[diag(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{x}}_{0}+\theta_{3}{\sum}^{k}_{i=1}\hat{\mathbf{x}}^{*}_{i}\odot\hat{\mathbf{x}}_{i}+\theta_{1})\\ &&-\frac{diag(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{x}}_{0}+\theta_{3}{\sum}^{k}_{i=1}\hat{\mathbf{x}}^{*}_{i}\odot\hat{\mathbf{x}}_{i})}{diag(1+\theta_{2})}]^{-1}\\ &=&diag(\frac{1 + \theta_{2}}{\theta_{2}(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{x}}_{0} + \theta_{3}{\sum}^{k}_{i=1}\hat{\mathbf{x}}^{*}_{i}\odot\hat{\mathbf{x}}_{i})+\theta_{1}(1 + \theta_{2})}).\ \end{array} $$
(19)

Finally, we can obtain the solution of filter parameter w:

$$ \begin{array}{llllllll} &\hat{\mathbf{w}}^{*}=\frac{\theta_{2}(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{y}}^{*}_{0})}{\theta_{2}(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{x}}_{0}+\theta_{3}{\sum}^{k}_{i=1}\hat{\mathbf{x}}^{*}_{i}\odot\hat{\mathbf{x}}_{i})+\theta_{1}(1+\theta_{2})}\\ &\Rightarrow\hat{\mathbf{w}}=\frac{\theta_{2}(\hat{\mathbf{x}}_{0}\odot\hat{\mathbf{y}}_{0})}{\theta_{2}(\hat{\mathbf{x}}^{*}_{0}\odot\hat{\mathbf{x}}_{0}+\theta_{3}{\sum}^{k}_{i=1}\hat{\mathbf{x}}^{*}_{i}\odot\hat{\mathbf{x}}_{i})+\theta_{1}(1+\theta_{2})}.\ \end{array} $$
(20)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, W., Gu, J., Ma, X. et al. End-to-end multitask Siamese network with residual hierarchical attention for real-time object tracking. Appl Intell 50, 1908–1921 (2020). https://doi.org/10.1007/s10489-019-01605-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01605-2

Keywords

Navigation