Abstract
Recent years have witnessed the popularity of Convolutional Neural Networks (CNN) in a variety of computer vision tasks, including video object tracking. Existing object tracking methods with CNN employ either a scalar score or a confidence map as CNN’s output, which suffer the infeasibility of estimating the object’s accurate scale and rotation angle. Specifically, as with other traditional methods, they assume the targets’ scale aspect ratio and rotation angle are fixed. To address the limitation, we propose to take a binary mask as the output of CNN for tracking. To this end, we adapt a semantic segmentation model by online fine-tuning with augmented samples in the initial frame to uncover the target in the following frames. During the generation of training samples, we employ a Crop and Paste method to better utilize context information, add a random value to lightness component to mimic the illumination change, and take a Gaussian filtering approach to mimic the blur. During the tracking, due to the limitation of CNN’s receptive field size and spatial resolution, the network may fail to identify the target if the estimated bounding box is considerably incorrect. Therefore we propose a bounding box approximation method by considering temporal consistency. Excluding the initial training cost, our tracker runs at 41 FPS on a single GeForce 1080Ti GPU. Evaluated on benchmarks including OTB-2015, VOT-2016 and TempleColor, it achieves comparable results with non real-time top trackers and state-of-the-art performance among those real-time ones.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: CVPR, pp 1401–1409
Bertinetto L, Valmadre J, Henriques J, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV, pp 850–865
Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. In: CVPR, pp 2544–2550
Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: CVPR
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J (2017) Attentional correlation filter network for adaptive visual tracking. In: CVPR
Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: BMVC
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: ICCV Workshops, pp 58–66
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp 4310–4318
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2016) Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In: CVPR, pp 1430–1438
Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: ECCV, pp 472–488
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: CVPR
Dwibedi D, Misra I, Hebert M (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In: ICCV
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp 580–587
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: ICCV
Henriques J, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. TPAMI 37(3):583–596
Khoreva A, Benenson R, Hosang J, Hein M, Schiele B (2017) Simple does it: weakly supervised instance and semantic segmentation. In: CVPR
Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2017) Lucid data dreaming for object tracking. arXiv:1703.09554
Kiani Galoogahi H, Fagg A, Lucey S (2017) Learning background-aware correlation filters for visual tracking. In: ICCV
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin L, Vojir T, et al (2016) The visual object tracking VOT2016 challenge results, pp 777–823
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Li Y, Zhu J (2014) A scale adaptive kernel correlation filter tracker with feature integration. In: ECCV Workshops, pp 254–265
Li Y, Xu Z, Zhu J (2017) Cfnn: correlation filter neural network for visual object tracking. In: IJCAI, pp 2222–2229
Liang P, Blasch E, Ling H (2015) Encoding color information for visual tracking: algorithms and benchmark. TIP 24(12):5630–5644
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV, pp 740–755
Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. In: ICCV, pp 3074–3082
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp 4293–4302
Nam H, Baek M, Han B (2016) Modeling and propagating CNNs in a tree structure for visual tracking. arXiv:1608.07242
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: ICCV, pp 1520–1528
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: CVPR
Pérez P, Gangnet M, Blake A (2003) Poisson image editing. In: TOG, vol 22, pp 313–318
Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang MH (2016) Hedged deep tracking. In: CVPR, pp 4303–4311
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: CVPR, pp 779–788
Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In: TOG, vol 23, pp 309–314
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. TPAMI 39(4):640–651
Son J, Jung I, Park K, Han B (2015) Tracking-by-segmentation with online gradient boosting decision tree. In: ICCV, pp 3056–3064
Song J, Gao L, Nie F, Shen H, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. TIP 25(11):4999–5011
Song Y, Ma C, Gong L, Zhang J, Lau RWH, Yang MH (2017) Crest: convolutional residual learning for visual tracking. In: ICCV
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen H (2018) From deterministic to generative: multimodal stochastic RNNs for video captioning. TNNLS 99:1–12
Telea A (2004) An image inpainting technique based on the fast marching method. JGT 9(1):23–34
Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: CVPR
Wang N, Li S, Gupta A, Yeung DY (2015) Transferring rich feature hierarchies for robust visual tracking. arXiv:1501.04587
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-d CNN with LSTM for video action recognition. SPL 24(4):510–514
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. TMM 20(3):634–644
Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. TPAMI 37(9):1834–1848
Yeo D, Son J, Han B, Han JH (2017) Superpixel-based tracking-by-segmentation using Markov chains. In: CVPR, pp 511–520
Yun S, Choi J, Yoo Y, Yun K, Young Choi J (2017) Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR
Zhang J, Ma S, Sclaroff S (2014) Meem: robust tracking via multiple experts using entropy minimization. In: ECCV, pp 188–203
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: ICLR
Zhang T, Xu C, Yang MH (2017) Multi-task correlation particle filter for robust object tracking. In: CVPR
Zhang YD, Pan C, Chen X, Wang F (2018) Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based stochastic pooling. JCS 27:57–68
Zhang YD, Pan C, Sun J, Tang C (2018) Multiple sclerosis identification by convolutional neural network with dropout and parametric relu. JCS 28:1–10
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: ICCV, pp 1529–1537
Acknowledgments
This work was supported in part by NSFC under Contract 61822208 and Contract 61632019, in part by the Fundamental Research Funds for the Central Universities, in part by the Young Elite Scientists Sponsorship Program by the CAST under Grant 2016QNRC001, and in part by the Youth Innovation Promotion Association CAS under Grant 2018497.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, J., Zhou, W., Tian, Q. et al. Exploiting weak mask representation with convolutional neural networks for accurate object tracking. Multimed Tools Appl 78, 20961–20985 (2019). https://doi.org/10.1007/s11042-019-7219-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-7219-y