Multi attention module for visual tracking

doi:10.1016/j.patcog.2018.10.005

Pattern Recognition

Volume 87, March 2019, Pages 80-93

https://doi.org/10.1016/j.patcog.2018.10.005 Get rights and content

Highlights

•
The first tracker leverage multi-level visual attention.
•
Utilize LSTM to encode the video temporal information.
•
Make DNN-based tracking-by-detection more computationally feasible.
•
A promoting strategy for trackers with detection results.
•
Yield superior performance in two benchmark.

Abstract

We propose a new visual tracking algorithm leveraging multi-level visual attention to take full use of the information during tracking. Visual attention has been widely applied in many visual tasks, such as image captioning and question answering. However, most existing attention models only focus on one or two aspects, ignoring the other useful information in visual tracking. Here, we think there are four main attentional aspects in the tracking task and propose a unified network to leverage multi-level visual attention, which includes layer-wise attention, temporal attention, spatial attention and channel-wise attention. Considering that deep features of different levels may be suitable for different scenarios, we propose to train an attention network in the off-line stage to facilitate feature selection in online tracking. To better exploit the temporal consistency assumption of visual tracking, we implement the attention network with long short term memory (LSTM) units, which are capable of capturing the historical context information to perform more reliable inference at the current time step. Different from the image classification task, background clutter is more complicated in the tracking task. Thus, we purify the features by spatial attention and channel-wise attention to effectively suppress the background noise and highlight the target region. In addition, we also enforce deep feature sharing across target candidates using Region of Interest pooling, allowing the features of all candidates to be extracted in only one forward pass of the DNN. To further improve tracking accuracy, a promoting strategy for trackers with detection results of a generic object detector is proposed, reducing the risk of tracking drifts. The proposed tracking algorithm compares favorably against state-of-the-art methods on three popular benchmark datasets. Extensive experimental evaluations demonstrate the effectiveness of the proposed techniques.

Introduction

Online single target tracking [1], [2], [3] aims to infer the location of an arbitrary object in subsequent frames given its initial position in the first frame. It is still a very challenging task to design a robust tracking algorithm that can deliver satisfactory performance under real-world scenarios. Recently, deep neural networks (DNNs) have demonstrated remarkable performance and become the de facto standard methods in many computer vision problems, e.g., image classification [4], [5], semantic segmentation, object detection [6], [7], etc. Its application to visual tracking [8], [9] also significantly pushes the state-of-the-art performance.

Prior works [10], [11], [12] have shown that features in higher layers of DNNs encode semantic concepts of object categories and robust to significant target appearance changes, while features in lower layers preserve more spatial details but are very sensitive to dramatic appearance changes. Thus, prior works [10], [11] integrate features in both higher and lower layers to take use of their advantages, which have different properties when facing with different tracking problems. These simple combinations are lack of effective attention mechanism to apply the finite computational resource to the most useful layer. Nonetheless, there exist few principled methods to determine which layers are the optimal representations given the current frame.

Another issue of existing DNN-based trackers is that they mostly ignore the temporal correlation in the visual tracking problem. It is well acknowledged that video data demonstrate a strong temporal coherence, where the appearance and motion information of the target can seldom suffer from significantly variations within consecutive frames. These temporal consistency may serve as essential priors to suppress spurious false positives and improve tracking accuracy. Unfortunately, most existing DNN-based tracking methods perform target localization almost independently for each frame and fail to explore the strong temporal correlations. Recurrent neural networks (RNNs), which are effective in handling sequential data with correlations, have found very limited applications in online visual tracking.

In addition, the CNN features pre-trained on ImageNet are adopted for distinguishing generic object and treat all channels equally, which is not very appropriate for the tracking task. Generally, different channels represent extracting different semantic information on the image. Some of them are useful for determining the target location, while others may serve as noises and cause information redundancy, leading to tracking drifts. Effective selection among them is capable of highlighting the target as well as suppressing response from background. Another way to achieve this goal is to add spatial attention to features, which is contained in the inter-frame relationship. That is to say, we can easily get the spatial weight according to the target positions in the previous frames. However, most of the algorithms do not take full use of the relationship among frames to calculate the accurate position of interest region.

To address the above issues, we propose a unified DNN-based tracking framework using multi-level visual attention mechanism, integrating temporal, spatial, channel-wise and layer-wise attention in an end-to-end network simultaneously. Instead of only relying on the current state, we formulate the visual tracking as a sequential inference problem with temporal context and implement the attention network with long short term memory [13] units, which takes as input the current features to be selected as well as hidden states in the last time step, allowing us to better model the temporal consistency. Based on the features produced by deep network, we apply the attention module to create an attention map, shown in Fig. 1, encoding both attentive region and interesting channels. Then, we can get candidates much more smartly and effectively by the guidance of the attention map. Moreover, the proposed network also helps to evaluate the robustness of features from different layers and to complete the layer switch.

With the selected deep features, we adopt the tracking-by-detection framework to determine the location of the target. Some existing DNN-based methods [9], [14] also adopt the tracking-by-detection framework, where numerous candidate image windows have to be evaluated by forward-propagating through the DNNs. The overlap among candidate windows and unnecessary candidates cause redundant computations, which are also memory inefficient and hinder the application of DNNs to real-time tasks. As opposed to these methods, we take advantage of feature sharing by using the region of interest pooling to extract feature for all target candidates through only one forward pass of DNNs, which significantly reduces both computational and memory overhead. To further improve tracking accuracy, we develop a promoting strategy by leveraging detection results of state-of-the-art object detectors.

In summary, the contributions of our paper are mainly three folds:

•
We are the first to leverage multi-level visual attention information which includes temporal, spatial, channel-wise and layer-wise attention in a unified framework for visual tracking to effectively improve tracking performance.
•
We propose a novel guidance technique for candidates filtering with the attention map. Together with the ROI pooling, we make DNN-based tracking-by-detection more computationally feasible.
•
We develop a promoting strategy for trackers by leveraging detection results of state-of-the-art object detectors with minimal addition computational overhead. Our method yields superior performance in three widely-adopted benchmark datasets.

Section snippets

Related work

Visual tracking algorithms. Visual tracking as one of the fundamental topics in the computer vision area has been intensively studied over the last decades. Recent works mainly focus on developing more effective target appearance models. On one hand, more sophisticated features have been designed to better characterize the target, e.g., haar-like features [15], HOG features [16], color names [17], edge-based features [18], etc. On the other hand, a large number of online learning algorithms

Proposed algorithm

The proposed tracking algorithm consists of four components: a feature extraction network, a multi attention network combining temporal, spatial, channel-wise and layer-wise attention, a tracking module for target localization, and an incorporation module. Fig. 2 presents the pipeline of our method. Given a new frame containing the target, we feed the whole image into the VGG16 network. Following prior works [10], [12], we adopt feature maps in both higher and lower convolutional layers to

Experiment

We evaluated our algorithm on three popular datasets for tracking performance evaluation. On Online Tracking Benchmark (OTB), Visual Object Tracking 2016 benchmark (VOT2016) and Temple Color 128, we compare performance of our algorithm with state-of-the-art trackers. The proposed tracker is implemented in MATLAB with Caffe framework [38], and runs at 3 fps on a PC with a 3.4GHz CPU and a TITAN GPU. The source code will be publicly available.

Conclusion and future work

In this paper, we propose a novel visual tracking algorithm leveraging multi attention mechanism. The layer-wise attention is utilized in facing with different scenes while the spatial and channel-wise attention are applied for reducing redundant information as well as background noise during the tracking process. The usage of temporal attention effectively model the correlation of continuous frames. All of these are integrated in a unified framework where an end-to-end attention network is

References (52)

P. Li et al.
Deep visual tracking: review and experimental comparison
Pattern Recognit.
(2018)
J.F. Henriques et al.
High-speed tracking with kernelized correlation filters
IEEE Trans. Pattern Anal. Mach. Intell.
(2015)
J. Kwon et al.
Tracking by sampling trackers
ICCV
(2011)
F. Yang et al.
Robust superpixel tracking
IEEE Trans. Image Process.
(2014)
K. Simonyan et al.
Very deep convolutional networks for large-scale image recognition
ICLR
(2015)
A. Krizhevsky et al.
Imagenet classification with deep convolutional neural networks
NIPS
(2012)
S. Ren et al.
Faster r-cnn: towards real-time object detection with region proposal networks.
IEEE Trans. Pattern Anal. Mach. Intell.
(2016)
J. Redmon et al.
You only look once: unified, real-time object detection
CVPR
(2016)
R. Tao et al.
Siamese instance search for tracking
CVPR
(2016)
N. Wang et al.
Learning a deep compact image representation for visual tracking
NIPS
(2013)

C. Ma et al.

Hierarchical convolutional features for visual tracking

ICCV

(2015)

Y. Qi et al.

Hedged deep tracking

CVPR

(2016)

L. Wang et al.

Visual tracking with fully convolutional networks

ICCV

(2015)

A. Graves

Long Short-Term Memory

(2012)

H. Nam et al.

Learning multi-domain convolutional neural networks for visual tracking

CVPR

(2016)

B. Babenko et al.

Visual tracking with online multiple instance learning

CVPR

(2009)

J.F. Henriques et al.

Exploiting the circulant structure of tracking-by-detection with kernels

ECCV

(2012)

M. Danelljan et al.

Adaptive color attributes for real-time visual tracking

CVPR

(2014)

G. Zhu et al.

Beyond local search: tracking objects everywhere with instance-specific proposals

CVPR

(2016)

H. Grabner et al.

Semi-supervised on-line boosting for robust tracking

ECCV

(2008)

S. Hare et al.

Struck: Structured output tracking with kernels

ICCV

(2011)

M. Danelljan et al.

Learning spatially regularized correlation filters for visual tracking

ICCV

(2015)

H. Li et al.

Robust online visual tracking with a single convolutional neural network

ACCV

(2014)

L. Wang et al.

Sequentially training convolutional networks for visual tracking

CVPR

(2016)

M. Danelljan et al.

Beyond correlation filters: learning continuous convolution operators for visual tracking

ECCV

(2016)

J. Deng et al.

Imagenet: a large-scale hierarchical image database

CVPR

(2009)

Cited by (71)

Satellite video single object tracking: A systematic review and an oriented object tracking benchmark
2024, ISPRS Journal of Photogrammetry and Remote Sensing
Single object tracking (SOT) in satellite video (SV) enables the continuous acquisition of position and range information of an arbitrary object, showing promising value in remote sensing applications. However, existing trackers and datasets rarely focus on the SOT of oriented objects in SV. To bridge this gap, this article presents a comprehensive review of various tracking paradigms and frameworks covering both the general video and satellite video domains and subsequently proposes the oriented object tracking benchmark (OOTB) to advance the field of visual tracking. OOTB contains 29,890 frames from 110 video sequences, covering common satellite video object categories including car, ship, plane, and train. All frames are manually annotated with oriented bounding boxes, and each sequence is labeled with 12 fine-grained attributes. Additionally, a high-precision evaluation protocol is proposed for comprehensive and fair comparisons of trackers. To validate the existing trackers and explore frameworks suitable for SV tracking, we benchmark 33 state-of-the-art trackers totaling 58 models with different features, backbones, and tracker tags. Finally, extensive experiments and insightful thoughts are also provided to help understand their performance and offer baseline results for future research. OOTB is available at https://github.com/YZCU/OOTB.
Microscopic mechanical properties and fabric anisotropic evolution law of open graded gravel permeable base under dynamic loading
2023, Construction and Building Materials
Under traffic dynamic load, the microscopic mechanical properties and the fabric anisotropic evolution of open graded gravel permeable base directly determine the ability of open graded gravel permeable base to resist external loads and maintain stability in service. Firstly, this article constructed a discrete element numerical model for open graded gravel permeable base, and verified the accuracy of the model through experiments. Then, the effects of gradation, nominal maximum particle size, friction coefficient between aggregate particles, and effective elastic modulus of aggregate on the microscopic mechanical properties, migration behavior, and the fabric anisotropic evolution of open graded gravel aggregate particles were studied in sequence. The research results indicate that the proposed image moving target tracking algorithm based on the three branches twin network can achieve precise tracking and measurement of particle microscopic migration behavior. The method proposed in this study for constructing a three-dimensional surface model of gravel particles based on the principle of multi view averaging and the scale invariant feature transformation algorithm can achieve rapid numerical reconstruction of the three-dimensional morphology of gravel particles at a low cost. The internal aggregate particles of intermediate grade gravel are more likely to form a stable network force chain topology structure that can effectively resist external loads. As the maximum nominal particle size increases, the topological structure of the contact force of gravel particles becomes sparse, and the proportion of strong contact force increases. Increasing the friction coefficient of the contact interface between aggregate particles can to some extent hinder the migration behavior of aggregate particles inside the graded gravel. As the effective elastic modulus of the aggregate increases, the internal network strong contact force chain topology of open graded gravel gradually forms. Under external loads, the migration speed and displacement of the aggregate particles inside open graded gravel decrease, and the degree of internal normal contact force and branch vector anisotropy also decrease.
Social interaction model enhanced with speculation stage for human trajectory prediction
2023, Robotics and Autonomous Systems
Accurate human trajectory prediction is still challenging due to the complicated interactions with surroundings. A Spatio-Temporal Graph Convolution Neural Network based Social Interaction Model (STGCNN-SIM) is proposed to address this challenge. In addition to historical trajectory information, the presented method employs the speculated trajectories in the future to extract social interactive features and model interaction behaviors. Three social interactive features are extracted explicitly from the observed and speculated trajectories: (1) the relative distance, (2) the angle between the velocity vectors of two interacting partners, and (3) the angles between the velocity vectors of interacting partners and the distance vector. STGCNN-SIM utilizes these social interactive features to model interactions with surroundings in the historical and speculated stages. Then an attention mechanism is adopted to improve the model by focusing on more relevant features. Experimental results on three public datasets demonstrate that STGCNN-SIM achieves higher accuracy and stability than the state-of-the-art methods.
Visual object tracking: A survey
2022, Computer Vision and Image Understanding
Visual object tracking is an important area in computer vision, and many tracking algorithms have been proposed with promising results. Existing object tracking approaches can be categorized into generative trackers, discriminative trackers, and collaborative trackers. Recently, object tracking algorithms based on deep neural networks have emerged and obtained great attention from researchers due to their outstanding tracking performance. To summarize the development of object tracking, a few surveys give analyses on either deep or non-deep trackers. In this paper, we provide a comprehensive overview of state-of-the-art tracking frameworks including both deep and non-deep trackers. We present both quantitative and qualitative tracking results of various trackers on five benchmark datasets and conduct a comparative analysis of their results. We further discuss challenging circumstances such as occlusion, illumination, deformation, and motion blur. Finally, we list the challenges and the future work in this fast-growing field.
Two-stage aware attentional Siamese network for visual tracking
2022, Pattern Recognition
Siamese networks have achieved great success in visual tracking with the advantages of speed and accuracy. However, how to track an object precisely and robustly still remains challenging. One reason is that multiple types of features are required to achieve good precision and robustness, which are unattainable by a single training phase. Moreover, Siamese networks usually struggle with online adaption problem. In this paper, we present a novel two-stage aware attentional Siamese network for tracking (Ta-ASiam). Concretely, we first propose a position-aware and an appearance-aware training strategy to optimize different layers of Siamese network. By introducing diverse training patterns, two types of required features can be captured simultaneously. Then, following the rule of feature distribution, an effective feature selection module is constructed by combining both channel and spatial attention networks to adapt to rapid appearance changes of the object. Extensive experiments on various latest benchmarks have well demonstrated the effectiveness of our method, which significantly outperforms state-of-the-art trackers.
A metric-based meta-learning approach combined attention mechanism and ensemble learning for few-shot learning
2021, Displays
Meta-learning is one of the latest research directions in machine learning, which is considered to be one of the most probably ways to realize strong artificial intelligence. Meta-learning focuses on seeking solutions for machines to learn like human beings do - to recognize things through only few sample data and quickly adapt to new tasks. Challenges occur in how to train an efficient machine model with limited labeled data, since the model is easily over-fitted. In this paper, we address this obvious but important problem and propose a metric-based meta-learning model, which combines attention mechanisms and ensemble learning method. In our model, we first design a dual path attention module which considers both channel attention and spatial attention module, and the attention modules have been stacked to conduct a meta-learner for few shot meta-learning. Then, we apply an ensemble method called snap-shot ensemble to the attention-based meta-learner in order to generate more models in a single episode. Features abstracted from the models are put into the metric-based architecture to compute a prototype for each class. Our proposed method intensifies the feature extracting ability of backbone network in meta-learner and reduces over-fitting through ensemble learning and metric learning method. Experimental results toward several meta-learning datasets show that our approach is effective.

View all citing articles on Scopus

View full text

Multi attention module for visual tracking

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed algorithm

Experiment

Conclusion and future work

Pattern Recognit.

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

Tracking by sampling trackers

ICCV

Robust superpixel tracking

IEEE Trans. Image Process.

Very deep convolutional networks for large-scale image recognition

ICLR

Imagenet classification with deep convolutional neural networks

NIPS

Faster r-cnn: towards real-time object detection with region proposal networks.

IEEE Trans. Pattern Anal. Mach. Intell.

You only look once: unified, real-time object detection

CVPR

Siamese instance search for tracking

CVPR

Learning a deep compact image representation for visual tracking

NIPS

Hierarchical convolutional features for visual tracking

ICCV

Hedged deep tracking

CVPR

Visual tracking with fully convolutional networks

ICCV

Long Short-Term Memory

Learning multi-domain convolutional neural networks for visual tracking

CVPR

Visual tracking with online multiple instance learning

CVPR

Exploiting the circulant structure of tracking-by-detection with kernels

ECCV

Adaptive color attributes for real-time visual tracking

CVPR

Beyond local search: tracking objects everywhere with instance-specific proposals

CVPR

Semi-supervised on-line boosting for robust tracking

ECCV

Struck: Structured output tracking with kernels

ICCV

Learning spatially regularized correlation filters for visual tracking

ICCV

Robust online visual tracking with a single convolutional neural network

ACCV

Sequentially training convolutional networks for visual tracking

CVPR

Beyond correlation filters: learning continuous convolution operators for visual tracking

ECCV

Imagenet: a large-scale hierarchical image database

CVPR