Elsevier

Neurocomputing

Volume 349, 15 July 2019, Pages 133-144
Neurocomputing

Learning transform-aware attentive network for object tracking

https://doi.org/10.1016/j.neucom.2019.02.021Get rights and content

Abstract

Existing trackers often decompose the task of visual tracking into multiple independent components, such as target appearance sampling, classifier learning, and target state inferring. In this paper, we present a transform-aware attentive tracking framework, which uses a deep attentive network to directly predict the target states via spatial transform parameters. During off-line training, the proposed network learns generic motion patterns of target objects from auxiliary large-scale videos. These leaned motion patterns are then applied to track target objects on test sequences. Built on the Spatial Transform Network (STN), the proposed attentive network is fully differentiable and can be trained in an end-to-end manner. Notably, we only fine-tune the pre-trained network in the initial frame. The proposed tracker requires neither online model update nor appearance sampling during the tracking process. Extensive experiments on OTB-2013, OTB-2015, VOT-2014 and UAV-123 datasets demonstrate the competitive performance of our method against state-of-the-art attentive tracking methods.

Introduction

Object tracking is a fundamental problem in computer vision with a wide range of applications, including human-machine interface, video surveillance, traffic monitoring, etc. Typically, given the ground-truth of target objects in the first frame, object tracking aims at predicting the target states, e.g., position and scale, in subsequent frames. Recent years have witnessed the success of the tracking by detection approach, which incrementally learns a binary classifier to discriminate target objects from the background. This approach requires generating a large number of samples in each frame using either sliding windows [1], [2], [3], or random samples [4], [5], or region proposals [6], [7]. For training the discriminative classifier, the samples are assigned to binary labels according to their overlap ratio scores with respect to the tracked result in the previous frame. For the tracking process, the classifier is used to compute the confidence scores of the samples. The sample with the highest confidence score indicates the tracked result. Note that independently computing the confidence scores of samples often causes heavy computation burden, which is even heavier for deep learning trackers. For example, the speed of the recently proposed MDNet [3] tracker is less than one frame per second. To avoid drawing samples, an alternative approach is to learn correlation filters [8], [9]. The output correlation response maps can be used to locate target objects precisely. However, such response maps are hardly aware of the scale changes. We also note that correlation filters heavily rely on an incremental update scheme, which occurs frame by frame on the fly. Slight inaccuracy in a frame is easily aggregated to degrade the learned correlation filters.

In this work, instead of drawing a large number of samples to learn a discriminative classifier or directly learning correlation filters, we exploit a novel framework to infer target states in terms of both position and scale changes in an end-to-end manner (see Fig. 1). We take the inspirations from the recent success of the spatial transformer [10] as well as the visual attention mechanism in learning deep neural networks. On the one hand, Spatial Transformer Networks (STN) learn invariance to translation, scale, rotation and more generic warping. Therefore, STN can attend to the task-relevant regions via predicted transformation parameters. It is straightforward to exploit this invariance to estimate the appearance changes of target objects. On the other hand, existing attentive tracking methods built on deep neural networks such as Restricted Boltzmann Machine (RBM) [11] and Recurrent Neural Network (RNN) [12] cannot deal with spatial transformation. Therefore, multiple independent components are needed for position and scale estimation. In other words, the visual attention mechanism in [11], [12] is only exploited as one submodule for estimating location changes. This work aims at learning a unified attention network that directly predicts both the position and scale changes via spatial transformer parameters.

The proposed transform-aware attentive network (TAAT) is a Siamese matching network with two input branches. We constantly feed ground truth of the target in the first frame into one branch, while sequentially feed image frames into the other branch. Each branch consists of multiple convolutional layers to generate deep features. Features from two branches are then concatenated and fed into fully connected layers that output spatial transformer parameters. The proposed network naturally attends to regions of interest where the target object is likely to be. Compared to traditional attentive tracking methods, the proposed network outputs a considerably finer attentive area defined by spatial transformer parameters. This naturally facilitates visual tracking algorithms being more invariant to translation and scale changes. We first train the proposed TAAT network off-line in an end-to-end manner on large labeled video dataset. We use a data augmentation scheme in both the temporal and spatial domains. In each iteration, we feed triplet pairs, i.e., reference image, search image, ground-truth image of target objects, into the network. We use an ℓ1 loss constraint to speed up convergence. During the tracking process, we apply this pre-trained network to search frames. The output directly shows the moving states of the target as well as glimpse [13] from the input image. Fig. 2 illustrates an overview of the proposed tracker.

We summarize the contributions of this work as follows:

  • We propose a transform-aware attentive network for object tracking by integrating the attention mechanism into a tailored Spatial Transformer Network. The proposed network attends to the region of interest with finer attention and can be trained in an end-to-end manner. With the use of an ℓ1 loss constraint, the proposed network converges fast in the training stage.

  • We cast the visual tracking problem as pairwise matching. We effectively get rid of the cumbersome sampling scheme. The proposed algorithm achieves a satisfying tracking speed.

  • Extensive experiments on popular benchmark datasets demonstrate the favorable performance of the proposed algorithm when compared with state-of-the-art trackers.

The rest of this paper is organized as follows. In Section 2, we review the works closely related to our proposed approach. Section 3 gives a detailed description of the proposed transform-aware attentive model. Experimental results are reported and analyzed in Section 4. We conclude this paper in Section 5.

Section snippets

Related work

Visual tracking has long been an active research area and deep learning has become popular for visual tracking. We briefly categorize the most related works into the following aspects: (1) tracking by sampling target states in images, (2) tracking by inferring target states from response maps, and (3) tracking by attention models.

Transform-aware attentive tracking

In this section, we first overview the proposed transform-aware attentive network. Then we present the network architecture in more details. We then introduce the training scheme on large-scale data sets. Lastly, we show how to use the training model to perform visual tracking.

Implementation

The proposed model is implemented in MATLAB with the caffe library [56]. The model is trained on an Intel Xeon 1.60 GHz CPU with 16G RAM and TITAN X GPU. We utilize Alex-Network [49], VGG-Network [30] and ResNet [57] to build the Convnet part in localization network, respectively. Specifically, we leverage the deep features from Conv5, Conv5_3 layer of VGGNet and res4f layer in ResNet for target appearance representation. For ResNet, we add a 1 × 1 convolution layer which reduces the feature

Conclusion

In this paper, we propose a transform-aware attentive tracking method inspired by deep attentive network. With the use of the revised Spatial Transformer Network, the proposed network attends to regions of interest where novel target objects might be. The output spatial transfer parameters indicate the target states with both the location and scale information. The proposed algorithm does not require the cumbersome state sampling and model updating as existing tracking algorithms do. It is

Xiankai Lu received the B.S. degree in automation from the Shan Dong University, Jinan, China, in 2012. He is currently pursing his Ph.D. degreein Shanghai Jiao Tong University, Shanghai, China. His research interests include image processing, object tracking and deep learning.

References (65)

  • HuangD. et al.

    Enable scale and aspect ratio adaptability in visual tracking with detection proposals

    Proceedings of British Machine Vision Conference, BMVC

    (2015)
  • G. Zhu et al.

    Beyond local search: Tracking objects everywhere with instance-specific proposals

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2016)
  • J.F. Henriques et al.

    High-speed tracking with kernelized correlation filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • MaC. et al.

    Hierarchical convolutional features for visual tracking

    Proceedings of IEEE International Conference on Computer Vision

    (2015)
  • M. Jaderberg et al.

    Spatial transformer networks

    Proceedings of Advances in Neural Information Processing Systems

    (2015)
  • M. Denil et al.

    Learning where to attend with deep architectures for image tracking

    Neural Comput.

    (2012)
  • S.E. Kahou et al.

    RATM: recurrent attentive tracking model

    IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    (2017)
  • K. Gregor et al.

    DRAW: a recurrent neural network for image generation

    Proceedings of International Conference on Machine Learning

    (2015)
  • Z. Kalal et al.

    Tracking-learning-detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • D. Held et al.

    Learning to track at 100 FPS with deep regression networks

    Proceedings of European Conference on Computer Vision

    (2016)
  • MaB. et al.

    Discriminative tracking using tensor pooling

    IEEE Trans. Cybern.

    (2016)
  • GuoQ. et al.

    Structure-regularized compressive tracking with online data-driven sampling

    IEEE Trans. Image Process.

    (2017)
  • LiC. et al.

    Visual tracking via dynamic graph learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • WangX. et al.

    SINT++: robust visual tracking via adversarial positive instance generation

    Proceedings of Computer Vision and Pattern Recognition, CVPR

    (2018)
  • TaoR. et al.

    Siamese instance search for tracking

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • ShenJ. et al.

    Submodular trajectories for better motion segmentation in videos

    IEEE Trans. Image Process.

    (2018)
  • NiB. et al.

    Progressively parsing interactional objects for fine grained action detection

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • LuX. et al.

    Deep regression tracking with shrinkage loss

    Proceedings of European Conference on Computer Vision

    (2018)
  • ZhangT. et al.

    Correlation particle filter for visual tracking

    IEEE Trans. Image Process.

    (2018)
  • WangL. et al.

    Visual tracking with fully convolutional networks

    Proceedings of IEEE International Conference on Computer Vision

    (2015)
  • WangL. et al.

    STCT: sequentially training convolutional networks for visual tracking

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-scale Image Recognitionabs/1409.1556...
  • Cited by (0)

    Xiankai Lu received the B.S. degree in automation from the Shan Dong University, Jinan, China, in 2012. He is currently pursing his Ph.D. degreein Shanghai Jiao Tong University, Shanghai, China. His research interests include image processing, object tracking and deep learning.

    Bingbing Ni received a B.Eng. in electrical engineering from Shanghai Jiao Tong University, Shanghai, China, in 2005, and a Ph.D. from the National University of Singapore, Singapore, in 2011. He is currently a Professor with the Department of Electrical Engineering, Shanghai Jiao Tong University. Before that, he was a Research Scientist with the Advanced Digital Sciences Center, Singapore. He was with Microsoft Research Asia, Beijing, China, as a Research Intern in 2009. He was also a Software Engineer Intern with Google Inc., Mountain View, CA, USA, in 2010. Dr. Ni was a recipient of the Best Paper Award from PCM11 and the Best Student Paper Award from PREMIA08. He was also the recipient of the first prize in the International Contest on Human Activity Recognition and Localization in conjunction with the International Conference on Pattern Recognition in 2012.

    Chao Ma is a senior research associate with the Australian Centre for Robotic Vision at The University of Adelaide. He received a Ph.D. from Shanghai Jiao Tong University in 2016. His research interests include computer vision and machine learning. He was sponsored by China Scholarship Council as a visiting Ph.D. student at the University of California at Merced from the fall of 2013 to the fall of 2015. He is a member of the IEEE.

    Xiaokang Yang received a B.S. from Xiamen University, Xiamen, China, in 1994, an M.S. from the Chinese Academy of Sciences, Shanghai, China, in 1997, and a Ph.D. from Shanghai Jiao Tong University, Shanghai, in 2000. He is currently a Distinguished Professor with the School of Electronic Information and Electrical Engineering and the Deputy Director of the Institute of Image Communication and Information Processing at Shanghai Jiao Tong University. He has authored over 200 refereed papers and holds 40 patents. His current research interests include visual signal processing and communication, media analysis and retrieval, and pattern recognition. He is an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and a Senior Associate Editor of the IEEE SIGNAL PROCESSING LETTERS.

    View full text