Convolutional neural net bagging for online visual tracking

https://doi.org/10.1016/j.cviu.2016.07.002Get rights and content

Highlights

  • The proposed CNN bagging method is simple yet effective.

  • It addresses the label noise and model uncertainty problems simultaneously for CNN-based trackers.

  • The state-of-the-art performances on 3 recent benchmarks i.e., CVPR2013, VOT2013 and TB50 illustrate the validity of the proposed algorithm.

Abstract

Recently, Convolutional Neural Nets (CNNs) have been successfully applied to online visual tracking. However, a major problem is that such models may be inevitably over-fitted due to two main factors. The first one is the label noise because the online training of any models relies solely on the detection of the previous frames. The second one is the model uncertainty due to the randomized training strategy. In this work, we cope with noisy labels and the model uncertainty within the framework of bagging (bootstrap aggregating), resulting in efficient and effective visual tracking. Instead of using multiple models in a bag, we design a single multitask CNN for learning effective feature representations of the target object. In our model, each task has the same structure and shares the same set of convolutional features, but is trained using different random samples generated for different tasks. A significant advantage is that the bagging overhead for our model is minimal, and no extra efforts are needed to handle the outputs of different tasks as done in those multi-lifespan models. Experiments demonstrate that our CNN tracker outperforms the state-of-the-art methods on three recent benchmarks (over 80 video sequences), which illustrates the superiority of the feature representations learned by our purely online bagging framework.

Introduction

Tracking-by-detection approaches prevail in recent years. These methods usually rely on predefined heuristics and construct a set of positive (objects) and negative (background) samples from the estimated object location. Often these samples have binary labels, which leads to a few positive samples and a large negative training set.

Since Convolutional Neural Networks (CNNs) have been successfully adopted for object detection, it is not surprising to witness a surge of deep learning methods for visual tracking (Fan, Xu, Wu, Gong, 2010, Wang, Yeung, 2013). However, although online training has proven a huge benefits (Babenko, Yang, Belongie, 2009, Gao, Ling, Hu, Xing, 2014, Hare, Saffari, Torr, 2011), the immediate adoption of CNN for online visual tracking is not straightforward.

To begin with, CNN requires a large number of training samples, which is often not available in visual tracking. Moreover, CNN tends to overfit to the most recent observation, e.g., most recent instance dominating the model, which may result in the drift problem. Besides, CNN training is computationally intensive for online visual tracking. The slow updating speed could prevent the CNN model from being practical. Li, Li, Porikli, 2015, Li, Li, Porikli, 2015 handled these problems by either using an ensemble of CNNs or a single CNN with different sampling methods for positive and negative classes. Most recently, as an enhanced version of (Li, Li, Porikli, 2015, Li, Li, Porikli, 2015) achieves high tracking accuracies by exploiting color information and label uncertainties.

The underlying reason why online tracking is challenging is that the object locations, except the first frame, are not always reliable as they are estimated by the visual tracker and the uncertainty is unavoidable (Babenko, Yang, Belongie, 2009, Zhang et al., 2016). One can treat this difficulty as the label noise problem (Lawrence, Schölkopf, 2001, Long, Servedio, 2010, Natarajan, Dhillon, Ravikumar, Tewari, 2013). Furthermore, in the literature of deep learning, the highly-nonconvex loss function of CNN is usually optimized in a stochastic fashion (Krizhevsky, Sutskever, Hinton, 2012, Nair, Hinton, 2010). As a result, local optimal is almost inevitable in the training procedure. For offline training tasks, this difficulty could be alleviated using a large number of training epochs based on large-size training data (Krizhevsky et al., 2012). In visual tracking, in contrast, the time budget is highly constrained and only hundreds of training samples are available for each frame. Given different initial parameters or different training data, the stochastic gradient descent (SGD) method will easily lead to totally different CNN models. The label noise and the model uncertainty could stimulus each other and cause serious object drifting accordingly. To cope with the model uncertainty, Li et al. (2015b) proposes to cache the CNN models over time in a CNN pool and select the best one in the test phrase. However, this requires to a extra feature matching process in the test time and the multiple CNN models also slow down the tracking speed significantly. On the other hand, Li, Li, Porikli, 2015, Li, Li, Porikli, 2015 employ the multiple-lifespan sampling strategy to handle the noisy labels in tracking. Li et al. (2015a) obtained higher accuracy and efficiency than (Li et al., 2015b), but performs unstably as it relies on merely a single CNN model. In this work, we propose to solve the above two problem in one framework, i.e., a CNN Bagging.

The bagging has a few superior characteristics. For example, bagging is more sensible than the methods based on multiple lifespans (Xing et al., 2013), because it does not require additional information to combine the detected results for multiple lifespans, and does not need to cope with the dilemma between long term and short term memory. However, the bagging usually results in significant computation loads, because each individual model needs to be updated simultaneously.

Instead of multiple CNNs, we propose a single multitask CNN for learning effective feature representations of the target object. In our model, all tasks share the same set of features and each task is trained using different set of random samples. Each task generates scores for all possible hypotheses of the object locations in a given frame, and the prediction of the object is obtained by simple soft-max operation of the scores in the current frame.

Our experiments on three recent benchmarks involving over 80 videos demonstrate that our method outperforms all the compared state-of-the-art algorithms and rarely loses the track of the objects. In addition, it achieves a practical tracking speed (from 3fps to 6fps depending on the sequence and setting), which is comparable to state of the art visual trackers. Our main contributions include:

  • We proposed to use CNN bagging for coping with noisy labels and model uncertainty simultaneously in online visual tracking.

  • We designed a single multitask CNN that implements the CNN bagging effectively.

  • We achieved the best reported results in the literature at the speed up to 6fps.

Section snippets

Related work

Image features play a crucial role in many challenging computer vision tasks such as object recognition and detection. Unfortunately, in many online visual trackers features are manually defined and combined (Adam, Rivlin, Shimshoni, 2006, Collins, Liu, Leordeanu, 2005, Hare, Saffari, Torr, 2011, Pérez, Hue, Vermaak, Gangnet, 2002). Even though these methods report satisfactory results on individual datasets, hand-crafted feature representations would limit the performance of tracking. This

Our approach

We first introduce the basic ideas and notations for the online visual tracking using CNN, then we propose a multitask CNN framework as solution to a bag of CNN models. We further provide our sampling procedure and our loss function.

Benchmarks and experiment setting

We evaluated our method on three recently proposed tracking benchmarks: the CVPR2013 Visual Tracker Benchmark (Wu et al., 2013), the VOT2013 Challenge Benchmark (Kristan et al., 2013), and the TB-50 benchmark (Wu et al., 2015). Most parameters of the CNN tracker are given in Sec. 3.2. In addition, there are some motion parameters for sampling the image patches. In this work, we only consider the displacement Δx, Δy and the relative scale s=h/32 of the object, where h is object’s height.

Given a

Conclusion

This paper proposed a single multitask CNN as bagging for coping with noisy labels in visual tracking. We developed an efficient training method for the multitask CNN, resulting minimal overhead in training. Together with a new loss function and an improved sampling process, our CNN tracker outperformed state of the art methods on three recently proposed benchmarks (over 80 video sequences), which demonstrates the superiority of our purely online bagging framework.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant 61462042 and Grant 61672079, and in part by the Australian Research Council's Discovery Projects Funding Scheme under Project DP150104645.

Hanxi Li is a Special Term Professor in the School of Computer and Information Engineering, Jiangxi Normal University, China. He was a Researcher at NICTA (Australia) from 2011-2015. He received his PhD from the Research School of Information Science and Engineering at the Australian National University, Canberra, Australia. His recent areas of interest include visual tracking, face recognition, and deep learning.

References (44)

  • A. Adam et al.

    Robust fragments-based tracking using the integral histogram

    CVPR

    (2006)
  • B. Babenko et al.

    Visual tracking with online multiple instance learning

    CVPR

    (2009)
  • Y. Bengio et al.

    Representation learning: A review and new perspectives

    Trans. PAMI

    (2013)
  • R. Caruana

    Multitask learning

    Machine learning

    (1997)
  • D.C. Ciresan et al.

    Multi-column deep neural networks for image classification

    CVPR

    (2012)
  • R.T. Collins et al.

    Online selection of discriminative tracking features

    Trans. PAMI

    (2005)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    CVPR, Vol. 1

    (2005)
  • T.B. Dinh et al.

    Context tracker: Exploring supporters and distracters in unconstrained environments

    CVPR

    (2011)
  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    IJCV

    (2010)
  • J. Fan et al.

    Human tracking using convolutional neural networks

    Trans. Neur. Netw.

    (2010)
  • M. Felsberg

    Enhanced distribution field tracking using channel representations

    ICCV Workshops (ICCVW)

    (2013)
  • J. Gao et al.

    Transfer learning based visual tracking with gaussian processes regression

    ECCV

    (2014)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    CVPR

    (2014)
  • S. Hare et al.

    Struck: Structured output tracking with kernels

    ICCV

    (2011)
  • Henriques, J. F., Caseiro, R., Martins, P., Batista, J.,. High-speed tracking with kernelized correlation filters....
  • S. Hong et al.

    Online tracking by learning discriminative saliency map with convolutional neural network

  • X. Jia et al.

    Visual tracking via adaptive structural local sparse appearance model

    CVPR

    (2012)
  • Y. Jia et al.

    Caffe: Convolutional architecture for fast feature embedding

    Proceedings of the ACM International Conference on Multimedia, ACM

    (2014)
  • Z. Kalal et al.

    Pn learning: Bootstrapping binary classifiers by structural constraints

    CVPR

    (2010)
  • Kristan, M., Matas, J., Leonardis, A., Vojir, T., Pflugfelder, R., Fernandez, G., Nebehay, G., Porikli, F., Čehovin,...
  • M. Kristan et al.

    The visual object tracking vot2013 challenge results

    ICCV Workshops (ICCVW)

    (2013)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    NIPS

    (2012)
  • Cited by (0)

    Hanxi Li is a Special Term Professor in the School of Computer and Information Engineering, Jiangxi Normal University, China. He was a Researcher at NICTA (Australia) from 2011-2015. He received his PhD from the Research School of Information Science and Engineering at the Australian National University, Canberra, Australia. His recent areas of interest include visual tracking, face recognition, and deep learning.

    Yi Li is a Senior Research Scientist in Toyota Research Institute at Ann Arbor, Michigan, and an Adjunct Fellow at the Australian National University at Canberra, Australia. He was a Senior Researcher at NICTA (Australia) from 2011-2015. He received his PhD from the ECE Dept. at the University of Maryland at College Park. His PhD research, entitled "Cognitive Robots for Social Intelligence", focus on visual navigation for mobile robots, optical motion capture, causal inference for coordinated groups, and action recognition and representation. He was the recipient Future Faculty Fellow at Maryland from 2008-2010, received the Best Student paper of ICHFR, and the second price in the Semantic Robot Vision Challenge (SRVC). He co-organized the first two DeepVision workshops in IEEE Conference on Computer Vision and Pattern Recognition, and served the Area Chair of IEEE Winter Conference on Applications of Computer Vision in 2015 and 2016. His recent areas of interest include autonomous driving, robotics, and deep learning.

    Fatih Porikli in an IEEE Fellow. He received the Ph.D. degree from the New York University, NY. He is currently a Professor in the Research School of Engineering, Australian National University (ANU). He is also managing the Computer Vision Research Group at Data61. Until 2013, he was a Distinguished Research Scientist with Mitsubishi Electric Research Labs (MERL), Cambridge, USA. His research interests include computer vision, pattern recognition, manifold learning, sparse optimization, online learning, and image enhancement with commercial applications in video surveillance, intelligent transportation, satellite, and medical systems. He authored more than 140 publications and invented 66 patents. He received the R&D100 2006 Award in the Scientist of the Year category in addition to four IEEE Best Paper Awards and five Professional Prizes. He serves as an Associate Editor of the IEEE Signal Processing Magazine, the SIAM Journal on Imaging Sciences Real-Time Image and Video Processing (Springer), and the EURASIP Journal on Image and Video Processing. He was the General Chair of the IEEE Winter Conference on Applications of Computer Vision in 2014 and the IEEE Advanced Video and Signal Based Surveillance Conference in 2010, and serves at the Organizing Committee of many IEEE events.

    View full text