Convolutional neural net bagging for online visual tracking
Introduction
Tracking-by-detection approaches prevail in recent years. These methods usually rely on predefined heuristics and construct a set of positive (objects) and negative (background) samples from the estimated object location. Often these samples have binary labels, which leads to a few positive samples and a large negative training set.
Since Convolutional Neural Networks (CNNs) have been successfully adopted for object detection, it is not surprising to witness a surge of deep learning methods for visual tracking (Fan, Xu, Wu, Gong, 2010, Wang, Yeung, 2013). However, although online training has proven a huge benefits (Babenko, Yang, Belongie, 2009, Gao, Ling, Hu, Xing, 2014, Hare, Saffari, Torr, 2011), the immediate adoption of CNN for online visual tracking is not straightforward.
To begin with, CNN requires a large number of training samples, which is often not available in visual tracking. Moreover, CNN tends to overfit to the most recent observation, e.g., most recent instance dominating the model, which may result in the drift problem. Besides, CNN training is computationally intensive for online visual tracking. The slow updating speed could prevent the CNN model from being practical. Li, Li, Porikli, 2015, Li, Li, Porikli, 2015 handled these problems by either using an ensemble of CNNs or a single CNN with different sampling methods for positive and negative classes. Most recently, as an enhanced version of (Li, Li, Porikli, 2015, Li, Li, Porikli, 2015) achieves high tracking accuracies by exploiting color information and label uncertainties.
The underlying reason why online tracking is challenging is that the object locations, except the first frame, are not always reliable as they are estimated by the visual tracker and the uncertainty is unavoidable (Babenko, Yang, Belongie, 2009, Zhang et al., 2016). One can treat this difficulty as the label noise problem (Lawrence, Schölkopf, 2001, Long, Servedio, 2010, Natarajan, Dhillon, Ravikumar, Tewari, 2013). Furthermore, in the literature of deep learning, the highly-nonconvex loss function of CNN is usually optimized in a stochastic fashion (Krizhevsky, Sutskever, Hinton, 2012, Nair, Hinton, 2010). As a result, local optimal is almost inevitable in the training procedure. For offline training tasks, this difficulty could be alleviated using a large number of training epochs based on large-size training data (Krizhevsky et al., 2012). In visual tracking, in contrast, the time budget is highly constrained and only hundreds of training samples are available for each frame. Given different initial parameters or different training data, the stochastic gradient descent (SGD) method will easily lead to totally different CNN models. The label noise and the model uncertainty could stimulus each other and cause serious object drifting accordingly. To cope with the model uncertainty, Li et al. (2015b) proposes to cache the CNN models over time in a CNN pool and select the best one in the test phrase. However, this requires to a extra feature matching process in the test time and the multiple CNN models also slow down the tracking speed significantly. On the other hand, Li, Li, Porikli, 2015, Li, Li, Porikli, 2015 employ the multiple-lifespan sampling strategy to handle the noisy labels in tracking. Li et al. (2015a) obtained higher accuracy and efficiency than (Li et al., 2015b), but performs unstably as it relies on merely a single CNN model. In this work, we propose to solve the above two problem in one framework, i.e., a CNN Bagging.
The bagging has a few superior characteristics. For example, bagging is more sensible than the methods based on multiple lifespans (Xing et al., 2013), because it does not require additional information to combine the detected results for multiple lifespans, and does not need to cope with the dilemma between long term and short term memory. However, the bagging usually results in significant computation loads, because each individual model needs to be updated simultaneously.
Instead of multiple CNNs, we propose a single multitask CNN for learning effective feature representations of the target object. In our model, all tasks share the same set of features and each task is trained using different set of random samples. Each task generates scores for all possible hypotheses of the object locations in a given frame, and the prediction of the object is obtained by simple soft-max operation of the scores in the current frame.
Our experiments on three recent benchmarks involving over 80 videos demonstrate that our method outperforms all the compared state-of-the-art algorithms and rarely loses the track of the objects. In addition, it achieves a practical tracking speed (from 3fps to 6fps depending on the sequence and setting), which is comparable to state of the art visual trackers. Our main contributions include:
- •
We proposed to use CNN bagging for coping with noisy labels and model uncertainty simultaneously in online visual tracking.
- •
We designed a single multitask CNN that implements the CNN bagging effectively.
- •
We achieved the best reported results in the literature at the speed up to 6fps.
Section snippets
Related work
Image features play a crucial role in many challenging computer vision tasks such as object recognition and detection. Unfortunately, in many online visual trackers features are manually defined and combined (Adam, Rivlin, Shimshoni, 2006, Collins, Liu, Leordeanu, 2005, Hare, Saffari, Torr, 2011, Pérez, Hue, Vermaak, Gangnet, 2002). Even though these methods report satisfactory results on individual datasets, hand-crafted feature representations would limit the performance of tracking. This
Our approach
We first introduce the basic ideas and notations for the online visual tracking using CNN, then we propose a multitask CNN framework as solution to a bag of CNN models. We further provide our sampling procedure and our loss function.
Benchmarks and experiment setting
We evaluated our method on three recently proposed tracking benchmarks: the CVPR2013 Visual Tracker Benchmark (Wu et al., 2013), the VOT2013 Challenge Benchmark (Kristan et al., 2013), and the TB-50 benchmark (Wu et al., 2015). Most parameters of the CNN tracker are given in Sec. 3.2. In addition, there are some motion parameters for sampling the image patches. In this work, we only consider the displacement Δx, Δy and the relative scale of the object, where h is object’s height.
Given a
Conclusion
This paper proposed a single multitask CNN as bagging for coping with noisy labels in visual tracking. We developed an efficient training method for the multitask CNN, resulting minimal overhead in training. Together with a new loss function and an improved sampling process, our CNN tracker outperformed state of the art methods on three recently proposed benchmarks (over 80 video sequences), which demonstrates the superiority of our purely online bagging framework.
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Grant 61462042 and Grant 61672079, and in part by the Australian Research Council's Discovery Projects Funding Scheme under Project DP150104645.
Hanxi Li is a Special Term Professor in the School of Computer and Information Engineering, Jiangxi Normal University, China. He was a Researcher at NICTA (Australia) from 2011-2015. He received his PhD from the Research School of Information Science and Engineering at the Australian National University, Canberra, Australia. His recent areas of interest include visual tracking, face recognition, and deep learning.
References (44)
- et al.
Robust fragments-based tracking using the integral histogram
CVPR
(2006) - et al.
Visual tracking with online multiple instance learning
CVPR
(2009) - et al.
Representation learning: A review and new perspectives
Trans. PAMI
(2013) Multitask learning
Machine learning
(1997)- et al.
Multi-column deep neural networks for image classification
CVPR
(2012) - et al.
Online selection of discriminative tracking features
Trans. PAMI
(2005) - et al.
Histograms of oriented gradients for human detection
CVPR, Vol. 1
(2005) - et al.
Context tracker: Exploring supporters and distracters in unconstrained environments
CVPR
(2011) - et al.
The pascal visual object classes (voc) challenge
IJCV
(2010) - et al.
Human tracking using convolutional neural networks
Trans. Neur. Netw.
(2010)
Enhanced distribution field tracking using channel representations
ICCV Workshops (ICCVW)
Transfer learning based visual tracking with gaussian processes regression
ECCV
Rich feature hierarchies for accurate object detection and semantic segmentation
CVPR
Struck: Structured output tracking with kernels
ICCV
Online tracking by learning discriminative saliency map with convolutional neural network
Visual tracking via adaptive structural local sparse appearance model
CVPR
Caffe: Convolutional architecture for fast feature embedding
Proceedings of the ACM International Conference on Multimedia, ACM
Pn learning: Bootstrapping binary classifiers by structural constraints
CVPR
The visual object tracking vot2013 challenge results
ICCV Workshops (ICCVW)
Imagenet classification with deep convolutional neural networks
NIPS
Cited by (0)
Hanxi Li is a Special Term Professor in the School of Computer and Information Engineering, Jiangxi Normal University, China. He was a Researcher at NICTA (Australia) from 2011-2015. He received his PhD from the Research School of Information Science and Engineering at the Australian National University, Canberra, Australia. His recent areas of interest include visual tracking, face recognition, and deep learning.
Yi Li is a Senior Research Scientist in Toyota Research Institute at Ann Arbor, Michigan, and an Adjunct Fellow at the Australian National University at Canberra, Australia. He was a Senior Researcher at NICTA (Australia) from 2011-2015. He received his PhD from the ECE Dept. at the University of Maryland at College Park. His PhD research, entitled "Cognitive Robots for Social Intelligence", focus on visual navigation for mobile robots, optical motion capture, causal inference for coordinated groups, and action recognition and representation. He was the recipient Future Faculty Fellow at Maryland from 2008-2010, received the Best Student paper of ICHFR, and the second price in the Semantic Robot Vision Challenge (SRVC). He co-organized the first two DeepVision workshops in IEEE Conference on Computer Vision and Pattern Recognition, and served the Area Chair of IEEE Winter Conference on Applications of Computer Vision in 2015 and 2016. His recent areas of interest include autonomous driving, robotics, and deep learning.
Fatih Porikli in an IEEE Fellow. He received the Ph.D. degree from the New York University, NY. He is currently a Professor in the Research School of Engineering, Australian National University (ANU). He is also managing the Computer Vision Research Group at Data61. Until 2013, he was a Distinguished Research Scientist with Mitsubishi Electric Research Labs (MERL), Cambridge, USA. His research interests include computer vision, pattern recognition, manifold learning, sparse optimization, online learning, and image enhancement with commercial applications in video surveillance, intelligent transportation, satellite, and medical systems. He authored more than 140 publications and invented 66 patents. He received the R&D100 2006 Award in the Scientist of the Year category in addition to four IEEE Best Paper Awards and five Professional Prizes. He serves as an Associate Editor of the IEEE Signal Processing Magazine, the SIAM Journal on Imaging Sciences Real-Time Image and Video Processing (Springer), and the EURASIP Journal on Image and Video Processing. He was the General Chair of the IEEE Winter Conference on Applications of Computer Vision in 2014 and the IEEE Advanced Video and Signal Based Surveillance Conference in 2010, and serves at the Organizing Committee of many IEEE events.