Exploiting weak mask representation with convolutional neural networks for accurate object tracking

Huang, Jianglei; Zhou, Wengang; Tian, Qi; Li, Houqiang

doi:10.1007/s11042-019-7219-y

Exploiting weak mask representation with convolutional neural networks for accurate object tracking

Published: 09 March 2019

Volume 78, pages 20961–20985, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jianglei Huang¹,
Wengang Zhou¹,
Qi Tian² &
…
Houqiang Li¹

301 Accesses
Explore all metrics

Abstract

Recent years have witnessed the popularity of Convolutional Neural Networks (CNN) in a variety of computer vision tasks, including video object tracking. Existing object tracking methods with CNN employ either a scalar score or a confidence map as CNN’s output, which suffer the infeasibility of estimating the object’s accurate scale and rotation angle. Specifically, as with other traditional methods, they assume the targets’ scale aspect ratio and rotation angle are fixed. To address the limitation, we propose to take a binary mask as the output of CNN for tracking. To this end, we adapt a semantic segmentation model by online fine-tuning with augmented samples in the initial frame to uncover the target in the following frames. During the generation of training samples, we employ a Crop and Paste method to better utilize context information, add a random value to lightness component to mimic the illumination change, and take a Gaussian filtering approach to mimic the blur. During the tracking, due to the limitation of CNN’s receptive field size and spatial resolution, the network may fail to identify the target if the estimated bounding box is considerably incorrect. Therefore we propose a bounding box approximation method by considering temporal consistency. Excluding the initial training cost, our tracker runs at 41 FPS on a single GeForce 1080Ti GPU. Evaluated on benchmarks including OTB-2015, VOT-2016 and TempleColor, it achieves comparable results with non real-time top trackers and state-of-the-art performance among those real-time ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generic Pixel Level Object Tracker Using Bi-Channel Fully Convolutional Network

Joint Classification and Regression for Visual Tracking with Fully Convolutional Siamese Networks

Article Open access 06 January 2022

Anchor-Free and Pixel-Wise Convolution for Visual Tracking

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Software available from tensorflow.org
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: CVPR, pp 1401–1409
Bertinetto L, Valmadre J, Henriques J, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV, pp 850–865
Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. In: CVPR, pp 2544–2550
Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: CVPR
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J (2017) Attentional correlation filter network for adaptive visual tracking. In: CVPR
Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: BMVC
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: ICCV Workshops, pp 58–66
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: ICCV, pp 4310–4318
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2016) Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In: CVPR, pp 1430–1438
Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: ECCV, pp 472–488
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: CVPR
Dwibedi D, Misra I, Hebert M (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In: ICCV
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp 580–587
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: ICCV
Henriques J, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. TPAMI 37(3):583–596
Article Google Scholar
Khoreva A, Benenson R, Hosang J, Hein M, Schiele B (2017) Simple does it: weakly supervised instance and semantic segmentation. In: CVPR
Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2017) Lucid data dreaming for object tracking. arXiv:1703.09554
Kiani Galoogahi H, Fagg A, Lucey S (2017) Learning background-aware correlation filters for visual tracking. In: ICCV
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin L, Vojir T, et al (2016) The visual object tracking VOT2016 challenge results, pp 777–823
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Li Y, Zhu J (2014) A scale adaptive kernel correlation filter tracker with feature integration. In: ECCV Workshops, pp 254–265
Li Y, Xu Z, Zhu J (2017) Cfnn: correlation filter neural network for visual object tracking. In: IJCAI, pp 2222–2229
Liang P, Blasch E, Ling H (2015) Encoding color information for visual tracking: algorithms and benchmark. TIP 24(12):5630–5644
MathSciNet MATH Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV, pp 740–755
Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. In: ICCV, pp 3074–3082
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp 4293–4302
Nam H, Baek M, Han B (2016) Modeling and propagating CNNs in a tree structure for visual tracking. arXiv:1608.07242
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: ICCV, pp 1520–1528
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: CVPR
Pérez P, Gangnet M, Blake A (2003) Poisson image editing. In: TOG, vol 22, pp 313–318
Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang MH (2016) Hedged deep tracking. In: CVPR, pp 4303–4311
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: CVPR, pp 779–788
Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In: TOG, vol 23, pp 309–314
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. TPAMI 39(4):640–651
Article Google Scholar
Son J, Jung I, Park K, Han B (2015) Tracking-by-segmentation with online gradient boosting decision tree. In: ICCV, pp 3056–3064
Song J, Gao L, Nie F, Shen H, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. TIP 25(11):4999–5011
MathSciNet MATH Google Scholar
Song Y, Ma C, Gong L, Zhang J, Lau RWH, Yang MH (2017) Crest: convolutional residual learning for visual tracking. In: ICCV
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen H (2018) From deterministic to generative: multimodal stochastic RNNs for video captioning. TNNLS 99:1–12
Google Scholar
Telea A (2004) An image inpainting technique based on the fast marching method. JGT 9(1):23–34
Google Scholar
Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: CVPR
Wang N, Li S, Gupta A, Yeung DY (2015) Transferring rich feature hierarchies for robust visual tracking. arXiv:1501.04587
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-d CNN with LSTM for video action recognition. SPL 24(4):510–514
Google Scholar
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. TMM 20(3):634–644
Google Scholar
Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. TPAMI 37(9):1834–1848
Article Google Scholar
Yeo D, Son J, Han B, Han JH (2017) Superpixel-based tracking-by-segmentation using Markov chains. In: CVPR, pp 511–520
Yun S, Choi J, Yoo Y, Yun K, Young Choi J (2017) Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR
Zhang J, Ma S, Sclaroff S (2014) Meem: robust tracking via multiple experts using entropy minimization. In: ECCV, pp 188–203
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: ICLR
Zhang T, Xu C, Yang MH (2017) Multi-task correlation particle filter for robust object tracking. In: CVPR
Zhang YD, Pan C, Chen X, Wang F (2018) Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based stochastic pooling. JCS 27:57–68
Google Scholar
Zhang YD, Pan C, Sun J, Tang C (2018) Multiple sclerosis identification by convolutional neural network with dropout and parametric relu. JCS 28:1–10
MathSciNet Google Scholar
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: ICCV, pp 1529–1537

Download references

Acknowledgments

This work was supported in part by NSFC under Contract 61822208 and Contract 61632019, in part by the Fundamental Research Funds for the Central Universities, in part by the Young Elite Scientists Sponsorship Program by the CAST under Grant 2016QNRC001, and in part by the Youth Innovation Promotion Association CAS under Grant 2018497.

Author information

Authors and Affiliations

CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China, Hefei, China
Jianglei Huang, Wengang Zhou & Houqiang Li
University of Texas at San Antonio, San Antonio, TX, USA
Qi Tian

Authors

Jianglei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qi Tian
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wengang Zhou or Houqiang Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Zhou, W., Tian, Q. et al. Exploiting weak mask representation with convolutional neural networks for accurate object tracking. Multimed Tools Appl 78, 20961–20985 (2019). https://doi.org/10.1007/s11042-019-7219-y

Download citation

Received: 30 May 2018
Revised: 20 November 2018
Accepted: 13 January 2019
Published: 09 March 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11042-019-7219-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting weak mask representation with convolutional neural networks for accurate object tracking

Abstract

Access this article

Similar content being viewed by others

Generic Pixel Level Object Tracker Using Bi-Channel Fully Convolutional Network

Joint Classification and Regression for Visual Tracking with Fully Convolutional Siamese Networks

Anchor-Free and Pixel-Wise Convolution for Visual Tracking

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting weak mask representation with convolutional neural networks for accurate object tracking

Abstract

Access this article

Similar content being viewed by others

Generic Pixel Level Object Tracker Using Bi-Channel Fully Convolutional Network

Joint Classification and Regression for Visual Tracking with Fully Convolutional Siamese Networks

Anchor-Free and Pixel-Wise Convolution for Visual Tracking

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation