Multi attention module for visual tracking
Introduction
Online single target tracking [1], [2], [3] aims to infer the location of an arbitrary object in subsequent frames given its initial position in the first frame. It is still a very challenging task to design a robust tracking algorithm that can deliver satisfactory performance under real-world scenarios. Recently, deep neural networks (DNNs) have demonstrated remarkable performance and become the de facto standard methods in many computer vision problems, e.g., image classification [4], [5], semantic segmentation, object detection [6], [7], etc. Its application to visual tracking [8], [9] also significantly pushes the state-of-the-art performance.
Prior works [10], [11], [12] have shown that features in higher layers of DNNs encode semantic concepts of object categories and robust to significant target appearance changes, while features in lower layers preserve more spatial details but are very sensitive to dramatic appearance changes. Thus, prior works [10], [11] integrate features in both higher and lower layers to take use of their advantages, which have different properties when facing with different tracking problems. These simple combinations are lack of effective attention mechanism to apply the finite computational resource to the most useful layer. Nonetheless, there exist few principled methods to determine which layers are the optimal representations given the current frame.
Another issue of existing DNN-based trackers is that they mostly ignore the temporal correlation in the visual tracking problem. It is well acknowledged that video data demonstrate a strong temporal coherence, where the appearance and motion information of the target can seldom suffer from significantly variations within consecutive frames. These temporal consistency may serve as essential priors to suppress spurious false positives and improve tracking accuracy. Unfortunately, most existing DNN-based tracking methods perform target localization almost independently for each frame and fail to explore the strong temporal correlations. Recurrent neural networks (RNNs), which are effective in handling sequential data with correlations, have found very limited applications in online visual tracking.
In addition, the CNN features pre-trained on ImageNet are adopted for distinguishing generic object and treat all channels equally, which is not very appropriate for the tracking task. Generally, different channels represent extracting different semantic information on the image. Some of them are useful for determining the target location, while others may serve as noises and cause information redundancy, leading to tracking drifts. Effective selection among them is capable of highlighting the target as well as suppressing response from background. Another way to achieve this goal is to add spatial attention to features, which is contained in the inter-frame relationship. That is to say, we can easily get the spatial weight according to the target positions in the previous frames. However, most of the algorithms do not take full use of the relationship among frames to calculate the accurate position of interest region.
To address the above issues, we propose a unified DNN-based tracking framework using multi-level visual attention mechanism, integrating temporal, spatial, channel-wise and layer-wise attention in an end-to-end network simultaneously. Instead of only relying on the current state, we formulate the visual tracking as a sequential inference problem with temporal context and implement the attention network with long short term memory [13] units, which takes as input the current features to be selected as well as hidden states in the last time step, allowing us to better model the temporal consistency. Based on the features produced by deep network, we apply the attention module to create an attention map, shown in Fig. 1, encoding both attentive region and interesting channels. Then, we can get candidates much more smartly and effectively by the guidance of the attention map. Moreover, the proposed network also helps to evaluate the robustness of features from different layers and to complete the layer switch.
With the selected deep features, we adopt the tracking-by-detection framework to determine the location of the target. Some existing DNN-based methods [9], [14] also adopt the tracking-by-detection framework, where numerous candidate image windows have to be evaluated by forward-propagating through the DNNs. The overlap among candidate windows and unnecessary candidates cause redundant computations, which are also memory inefficient and hinder the application of DNNs to real-time tasks. As opposed to these methods, we take advantage of feature sharing by using the region of interest pooling to extract feature for all target candidates through only one forward pass of DNNs, which significantly reduces both computational and memory overhead. To further improve tracking accuracy, we develop a promoting strategy by leveraging detection results of state-of-the-art object detectors.
In summary, the contributions of our paper are mainly three folds:
- •
We are the first to leverage multi-level visual attention information which includes temporal, spatial, channel-wise and layer-wise attention in a unified framework for visual tracking to effectively improve tracking performance.
- •
We propose a novel guidance technique for candidates filtering with the attention map. Together with the ROI pooling, we make DNN-based tracking-by-detection more computationally feasible.
- •
We develop a promoting strategy for trackers by leveraging detection results of state-of-the-art object detectors with minimal addition computational overhead. Our method yields superior performance in three widely-adopted benchmark datasets.
Section snippets
Related work
Visual tracking algorithms. Visual tracking as one of the fundamental topics in the computer vision area has been intensively studied over the last decades. Recent works mainly focus on developing more effective target appearance models. On one hand, more sophisticated features have been designed to better characterize the target, e.g., haar-like features [15], HOG features [16], color names [17], edge-based features [18], etc. On the other hand, a large number of online learning algorithms
Proposed algorithm
The proposed tracking algorithm consists of four components: a feature extraction network, a multi attention network combining temporal, spatial, channel-wise and layer-wise attention, a tracking module for target localization, and an incorporation module. Fig. 2 presents the pipeline of our method. Given a new frame containing the target, we feed the whole image into the VGG16 network. Following prior works [10], [12], we adopt feature maps in both higher and lower convolutional layers to
Experiment
We evaluated our algorithm on three popular datasets for tracking performance evaluation. On Online Tracking Benchmark (OTB), Visual Object Tracking 2016 benchmark (VOT2016) and Temple Color 128, we compare performance of our algorithm with state-of-the-art trackers. The proposed tracker is implemented in MATLAB with Caffe framework [38], and runs at 3 fps on a PC with a 3.4GHz CPU and a TITAN GPU. The source code will be publicly available.
Conclusion and future work
In this paper, we propose a novel visual tracking algorithm leveraging multi attention mechanism. The layer-wise attention is utilized in facing with different scenes while the spatial and channel-wise attention are applied for reducing redundant information as well as background noise during the tracking process. The usage of temporal attention effectively model the correlation of continuous frames. All of these are integrated in a unified framework where an end-to-end attention network is
References (52)
- et al.
Deep visual tracking: review and experimental comparison
Pattern Recognit.
(2018) - et al.
High-speed tracking with kernelized correlation filters
IEEE Trans. Pattern Anal. Mach. Intell.
(2015) - et al.
Tracking by sampling trackers
ICCV
(2011) - et al.
Robust superpixel tracking
IEEE Trans. Image Process.
(2014) - et al.
Very deep convolutional networks for large-scale image recognition
ICLR
(2015) - et al.
Imagenet classification with deep convolutional neural networks
NIPS
(2012) - et al.
Faster r-cnn: towards real-time object detection with region proposal networks.
IEEE Trans. Pattern Anal. Mach. Intell.
(2016) - et al.
You only look once: unified, real-time object detection
CVPR
(2016) - et al.
Siamese instance search for tracking
CVPR
(2016) - et al.
Learning a deep compact image representation for visual tracking
NIPS
(2013)
Hierarchical convolutional features for visual tracking
ICCV
Hedged deep tracking
CVPR
Visual tracking with fully convolutional networks
ICCV
Long Short-Term Memory
Learning multi-domain convolutional neural networks for visual tracking
CVPR
Visual tracking with online multiple instance learning
CVPR
Exploiting the circulant structure of tracking-by-detection with kernels
ECCV
Adaptive color attributes for real-time visual tracking
CVPR
Beyond local search: tracking objects everywhere with instance-specific proposals
CVPR
Semi-supervised on-line boosting for robust tracking
ECCV
Struck: Structured output tracking with kernels
ICCV
Learning spatially regularized correlation filters for visual tracking
ICCV
Robust online visual tracking with a single convolutional neural network
ACCV
Sequentially training convolutional networks for visual tracking
CVPR
Beyond correlation filters: learning continuous convolution operators for visual tracking
ECCV
Imagenet: a large-scale hierarchical image database
CVPR
Cited by (71)
Satellite video single object tracking: A systematic review and an oriented object tracking benchmark
2024, ISPRS Journal of Photogrammetry and Remote SensingMicroscopic mechanical properties and fabric anisotropic evolution law of open graded gravel permeable base under dynamic loading
2023, Construction and Building MaterialsSocial interaction model enhanced with speculation stage for human trajectory prediction
2023, Robotics and Autonomous SystemsVisual object tracking: A survey
2022, Computer Vision and Image UnderstandingTwo-stage aware attentional Siamese network for visual tracking
2022, Pattern Recognition