Elsevier

Pattern Recognition

Volume 87, March 2019, Pages 80-93
Pattern Recognition

Multi attention module for visual tracking

https://doi.org/10.1016/j.patcog.2018.10.005Get rights and content

Highlights

  • The first tracker leverage multi-level visual attention.

  • Utilize LSTM to encode the video temporal information.

  • Make DNN-based tracking-by-detection more computationally feasible.

  • A promoting strategy for trackers with detection results.

  • Yield superior performance in two benchmark.

Abstract

We propose a new visual tracking algorithm leveraging multi-level visual attention to take full use of the information during tracking. Visual attention has been widely applied in many visual tasks, such as image captioning and question answering. However, most existing attention models only focus on one or two aspects, ignoring the other useful information in visual tracking. Here, we think there are four main attentional aspects in the tracking task and propose a unified network to leverage multi-level visual attention, which includes layer-wise attention, temporal attention, spatial attention and channel-wise attention. Considering that deep features of different levels may be suitable for different scenarios, we propose to train an attention network in the off-line stage to facilitate feature selection in online tracking. To better exploit the temporal consistency assumption of visual tracking, we implement the attention network with long short term memory (LSTM) units, which are capable of capturing the historical context information to perform more reliable inference at the current time step. Different from the image classification task, background clutter is more complicated in the tracking task. Thus, we purify the features by spatial attention and channel-wise attention to effectively suppress the background noise and highlight the target region. In addition, we also enforce deep feature sharing across target candidates using Region of Interest pooling, allowing the features of all candidates to be extracted in only one forward pass of the DNN. To further improve tracking accuracy, a promoting strategy for trackers with detection results of a generic object detector is proposed, reducing the risk of tracking drifts. The proposed tracking algorithm compares favorably against state-of-the-art methods on three popular benchmark datasets. Extensive experimental evaluations demonstrate the effectiveness of the proposed techniques.

Introduction

Online single target tracking [1], [2], [3] aims to infer the location of an arbitrary object in subsequent frames given its initial position in the first frame. It is still a very challenging task to design a robust tracking algorithm that can deliver satisfactory performance under real-world scenarios. Recently, deep neural networks (DNNs) have demonstrated remarkable performance and become the de facto standard methods in many computer vision problems, e.g., image classification [4], [5], semantic segmentation, object detection [6], [7], etc. Its application to visual tracking [8], [9] also significantly pushes the state-of-the-art performance.

Prior works [10], [11], [12] have shown that features in higher layers of DNNs encode semantic concepts of object categories and robust to significant target appearance changes, while features in lower layers preserve more spatial details but are very sensitive to dramatic appearance changes. Thus, prior works [10], [11] integrate features in both higher and lower layers to take use of their advantages, which have different properties when facing with different tracking problems. These simple combinations are lack of effective attention mechanism to apply the finite computational resource to the most useful layer. Nonetheless, there exist few principled methods to determine which layers are the optimal representations given the current frame.

Another issue of existing DNN-based trackers is that they mostly ignore the temporal correlation in the visual tracking problem. It is well acknowledged that video data demonstrate a strong temporal coherence, where the appearance and motion information of the target can seldom suffer from significantly variations within consecutive frames. These temporal consistency may serve as essential priors to suppress spurious false positives and improve tracking accuracy. Unfortunately, most existing DNN-based tracking methods perform target localization almost independently for each frame and fail to explore the strong temporal correlations. Recurrent neural networks (RNNs), which are effective in handling sequential data with correlations, have found very limited applications in online visual tracking.

In addition, the CNN features pre-trained on ImageNet are adopted for distinguishing generic object and treat all channels equally, which is not very appropriate for the tracking task. Generally, different channels represent extracting different semantic information on the image. Some of them are useful for determining the target location, while others may serve as noises and cause information redundancy, leading to tracking drifts. Effective selection among them is capable of highlighting the target as well as suppressing response from background. Another way to achieve this goal is to add spatial attention to features, which is contained in the inter-frame relationship. That is to say, we can easily get the spatial weight according to the target positions in the previous frames. However, most of the algorithms do not take full use of the relationship among frames to calculate the accurate position of interest region.

To address the above issues, we propose a unified DNN-based tracking framework using multi-level visual attention mechanism, integrating temporal, spatial, channel-wise and layer-wise attention in an end-to-end network simultaneously. Instead of only relying on the current state, we formulate the visual tracking as a sequential inference problem with temporal context and implement the attention network with long short term memory [13] units, which takes as input the current features to be selected as well as hidden states in the last time step, allowing us to better model the temporal consistency. Based on the features produced by deep network, we apply the attention module to create an attention map, shown in Fig. 1, encoding both attentive region and interesting channels. Then, we can get candidates much more smartly and effectively by the guidance of the attention map. Moreover, the proposed network also helps to evaluate the robustness of features from different layers and to complete the layer switch.

With the selected deep features, we adopt the tracking-by-detection framework to determine the location of the target. Some existing DNN-based methods [9], [14] also adopt the tracking-by-detection framework, where numerous candidate image windows have to be evaluated by forward-propagating through the DNNs. The overlap among candidate windows and unnecessary candidates cause redundant computations, which are also memory inefficient and hinder the application of DNNs to real-time tasks. As opposed to these methods, we take advantage of feature sharing by using the region of interest pooling to extract feature for all target candidates through only one forward pass of DNNs, which significantly reduces both computational and memory overhead. To further improve tracking accuracy, we develop a promoting strategy by leveraging detection results of state-of-the-art object detectors.

In summary, the contributions of our paper are mainly three folds:

  • We are the first to leverage multi-level visual attention information which includes temporal, spatial, channel-wise and layer-wise attention in a unified framework for visual tracking to effectively improve tracking performance.

  • We propose a novel guidance technique for candidates filtering with the attention map. Together with the ROI pooling, we make DNN-based tracking-by-detection more computationally feasible.

  • We develop a promoting strategy for trackers by leveraging detection results of state-of-the-art object detectors with minimal addition computational overhead. Our method yields superior performance in three widely-adopted benchmark datasets.

Section snippets

Related work

Visual tracking algorithms. Visual tracking as one of the fundamental topics in the computer vision area has been intensively studied over the last decades. Recent works mainly focus on developing more effective target appearance models. On one hand, more sophisticated features have been designed to better characterize the target, e.g., haar-like features [15], HOG features [16], color names [17], edge-based features [18], etc. On the other hand, a large number of online learning algorithms

Proposed algorithm

The proposed tracking algorithm consists of four components: a feature extraction network, a multi attention network combining temporal, spatial, channel-wise and layer-wise attention, a tracking module for target localization, and an incorporation module. Fig. 2 presents the pipeline of our method. Given a new frame containing the target, we feed the whole image into the VGG16 network. Following prior works [10], [12], we adopt feature maps in both higher and lower convolutional layers to

Experiment

We evaluated our algorithm on three popular datasets for tracking performance evaluation. On Online Tracking Benchmark (OTB), Visual Object Tracking 2016 benchmark (VOT2016) and Temple Color 128, we compare performance of our algorithm with state-of-the-art trackers. The proposed tracker is implemented in MATLAB with Caffe framework [38], and runs at 3 fps on a PC with a 3.4GHz CPU and a TITAN GPU. The source code will be publicly available.

Conclusion and future work

In this paper, we propose a novel visual tracking algorithm leveraging multi attention mechanism. The layer-wise attention is utilized in facing with different scenes while the spatial and channel-wise attention are applied for reducing redundant information as well as background noise during the tracking process. The usage of temporal attention effectively model the correlation of continuous frames. All of these are integrated in a unified framework where an end-to-end attention network is

References (52)

  • P. Li et al.

    Deep visual tracking: review and experimental comparison

    Pattern Recognit.

    (2018)
  • J.F. Henriques et al.

    High-speed tracking with kernelized correlation filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • J. Kwon et al.

    Tracking by sampling trackers

    ICCV

    (2011)
  • F. Yang et al.

    Robust superpixel tracking

    IEEE Trans. Image Process.

    (2014)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    ICLR

    (2015)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    NIPS

    (2012)
  • S. Ren et al.

    Faster r-cnn: towards real-time object detection with region proposal networks.

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • J. Redmon et al.

    You only look once: unified, real-time object detection

    CVPR

    (2016)
  • R. Tao et al.

    Siamese instance search for tracking

    CVPR

    (2016)
  • N. Wang et al.

    Learning a deep compact image representation for visual tracking

    NIPS

    (2013)
  • C. Ma et al.

    Hierarchical convolutional features for visual tracking

    ICCV

    (2015)
  • Y. Qi et al.

    Hedged deep tracking

    CVPR

    (2016)
  • L. Wang et al.

    Visual tracking with fully convolutional networks

    ICCV

    (2015)
  • A. Graves

    Long Short-Term Memory

    (2012)
  • H. Nam et al.

    Learning multi-domain convolutional neural networks for visual tracking

    CVPR

    (2016)
  • B. Babenko et al.

    Visual tracking with online multiple instance learning

    CVPR

    (2009)
  • J.F. Henriques et al.

    Exploiting the circulant structure of tracking-by-detection with kernels

    ECCV

    (2012)
  • M. Danelljan et al.

    Adaptive color attributes for real-time visual tracking

    CVPR

    (2014)
  • G. Zhu et al.

    Beyond local search: tracking objects everywhere with instance-specific proposals

    CVPR

    (2016)
  • H. Grabner et al.

    Semi-supervised on-line boosting for robust tracking

    ECCV

    (2008)
  • S. Hare et al.

    Struck: Structured output tracking with kernels

    ICCV

    (2011)
  • M. Danelljan et al.

    Learning spatially regularized correlation filters for visual tracking

    ICCV

    (2015)
  • H. Li et al.

    Robust online visual tracking with a single convolutional neural network

    ACCV

    (2014)
  • L. Wang et al.

    Sequentially training convolutional networks for visual tracking

    CVPR

    (2016)
  • M. Danelljan et al.

    Beyond correlation filters: learning continuous convolution operators for visual tracking

    ECCV

    (2016)
  • J. Deng et al.

    Imagenet: a large-scale hierarchical image database

    CVPR

    (2009)
  • Cited by (71)

    • Visual object tracking: A survey

      2022, Computer Vision and Image Understanding
    View all citing articles on Scopus
    View full text