Skip to main content
Log in

Exploiting weak mask representation with convolutional neural networks for accurate object tracking

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recent years have witnessed the popularity of Convolutional Neural Networks (CNN) in a variety of computer vision tasks, including video object tracking. Existing object tracking methods with CNN employ either a scalar score or a confidence map as CNN’s output, which suffer the infeasibility of estimating the object’s accurate scale and rotation angle. Specifically, as with other traditional methods, they assume the targets’ scale aspect ratio and rotation angle are fixed. To address the limitation, we propose to take a binary mask as the output of CNN for tracking. To this end, we adapt a semantic segmentation model by online fine-tuning with augmented samples in the initial frame to uncover the target in the following frames. During the generation of training samples, we employ a Crop and Paste method to better utilize context information, add a random value to lightness component to mimic the illumination change, and take a Gaussian filtering approach to mimic the blur. During the tracking, due to the limitation of CNN’s receptive field size and spatial resolution, the network may fail to identify the target if the estimated bounding box is considerably incorrect. Therefore we propose a bounding box approximation method by considering temporal consistency. Excluding the initial training cost, our tracker runs at 41 FPS on a single GeForce 1080Ti GPU. Evaluated on benchmarks including OTB-2015, VOT-2016 and TempleColor, it achieves comparable results with non real-time top trackers and state-of-the-art performance among those real-time ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org

  2. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: CVPR, pp 1401–1409

  3. Bertinetto L, Valmadre J, Henriques J, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV, pp 850–865

  4. Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. In: CVPR, pp 2544–2550

  5. Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: CVPR

  6. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915

  7. Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587

  8. Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J (2017) Attentional correlation filter network for adaptive visual tracking. In: CVPR

  9. Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: BMVC

  10. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: ICCV Workshops, pp 58–66

  11. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp 4310–4318

  12. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2016) Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In: CVPR, pp 1430–1438

  13. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: ECCV, pp 472–488

  14. Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: CVPR

  15. Dwibedi D, Misra I, Hebert M (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In: ICCV

  16. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp 580–587

  17. Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: ICCV

  18. Henriques J, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. TPAMI 37(3):583–596

    Article  Google Scholar 

  19. Khoreva A, Benenson R, Hosang J, Hein M, Schiele B (2017) Simple does it: weakly supervised instance and semantic segmentation. In: CVPR

  20. Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2017) Lucid data dreaming for object tracking. arXiv:1703.09554

  21. Kiani Galoogahi H, Fagg A, Lucey S (2017) Learning background-aware correlation filters for visual tracking. In: ICCV

  22. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin L, Vojir T, et al (2016) The visual object tracking VOT2016 challenge results, pp 777–823

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

  24. Li Y, Zhu J (2014) A scale adaptive kernel correlation filter tracker with feature integration. In: ECCV Workshops, pp 254–265

  25. Li Y, Xu Z, Zhu J (2017) Cfnn: correlation filter neural network for visual object tracking. In: IJCAI, pp 2222–2229

  26. Liang P, Blasch E, Ling H (2015) Encoding color information for visual tracking: algorithms and benchmark. TIP 24(12):5630–5644

    MathSciNet  MATH  Google Scholar 

  27. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV, pp 740–755

  28. Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. In: ICCV, pp 3074–3082

  29. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp 4293–4302

  30. Nam H, Baek M, Han B (2016) Modeling and propagating CNNs in a tree structure for visual tracking. arXiv:1608.07242

  31. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: ICCV, pp 1520–1528

  32. Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: CVPR

  33. Pérez P, Gangnet M, Blake A (2003) Poisson image editing. In: TOG, vol 22, pp 313–318

  34. Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang MH (2016) Hedged deep tracking. In: CVPR, pp 4303–4311

  35. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: CVPR, pp 779–788

  36. Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In: TOG, vol 23, pp 309–314

  37. Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. TPAMI 39(4):640–651

    Article  Google Scholar 

  38. Son J, Jung I, Park K, Han B (2015) Tracking-by-segmentation with online gradient boosting decision tree. In: ICCV, pp 3056–3064

  39. Song J, Gao L, Nie F, Shen H, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. TIP 25(11):4999–5011

    MathSciNet  MATH  Google Scholar 

  40. Song Y, Ma C, Gong L, Zhang J, Lau RWH, Yang MH (2017) Crest: convolutional residual learning for visual tracking. In: ICCV

  41. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen H (2018) From deterministic to generative: multimodal stochastic RNNs for video captioning. TNNLS 99:1–12

    Google Scholar 

  42. Telea A (2004) An image inpainting technique based on the fast marching method. JGT 9(1):23–34

    Google Scholar 

  43. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: CVPR

  44. Wang N, Li S, Gupta A, Yeung DY (2015) Transferring rich feature hierarchies for robust visual tracking. arXiv:1501.04587

  45. Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-d CNN with LSTM for video action recognition. SPL 24(4):510–514

    Google Scholar 

  46. Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. TMM 20(3):634–644

    Google Scholar 

  47. Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. TPAMI 37(9):1834–1848

    Article  Google Scholar 

  48. Yeo D, Son J, Han B, Han JH (2017) Superpixel-based tracking-by-segmentation using Markov chains. In: CVPR, pp 511–520

  49. Yun S, Choi J, Yoo Y, Yun K, Young Choi J (2017) Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR

  50. Zhang J, Ma S, Sclaroff S (2014) Meem: robust tracking via multiple experts using entropy minimization. In: ECCV, pp 188–203

  51. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: ICLR

  52. Zhang T, Xu C, Yang MH (2017) Multi-task correlation particle filter for robust object tracking. In: CVPR

  53. Zhang YD, Pan C, Chen X, Wang F (2018) Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based stochastic pooling. JCS 27:57–68

    Google Scholar 

  54. Zhang YD, Pan C, Sun J, Tang C (2018) Multiple sclerosis identification by convolutional neural network with dropout and parametric relu. JCS 28:1–10

    MathSciNet  Google Scholar 

  55. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: ICCV, pp 1529–1537

Download references

Acknowledgments

This work was supported in part by NSFC under Contract 61822208 and Contract 61632019, in part by the Fundamental Research Funds for the Central Universities, in part by the Young Elite Scientists Sponsorship Program by the CAST under Grant 2016QNRC001, and in part by the Youth Innovation Promotion Association CAS under Grant 2018497.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wengang Zhou or Houqiang Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, J., Zhou, W., Tian, Q. et al. Exploiting weak mask representation with convolutional neural networks for accurate object tracking. Multimed Tools Appl 78, 20961–20985 (2019). https://doi.org/10.1007/s11042-019-7219-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-7219-y

Keywords

Navigation