CSART: Channel and spatial attention-guided residual learning for real-time object tracking
Introduction
Visual object tracking is one of the most challenging tasks of computer vision. The definition is to locate the object in the subsequent frames with a specified target annotated by the bounding box of first frame. It has a large range of vision applications, such as video surveillance [1], motion analysis [2] and human-computer interaction [3]. Though recent deep tracking methods [4], [5], [6], [7] have made large progress, there are some challenging factors including background clutter, motion blur, occlusions, fast motion, deformation and scale variation, etc. In particular, robust trackers with well tracking accuracy and real-time speed are required in practical applications.
In recent year, some trackers based on Siamese network [8], [9], [6], [10], [11] also have aroused great concern in visual tracking community, due to balancing tracker’s performance and speed well. Especially, SiamFC [8] views the tracking as a template matching problem in certain embedding space and builds a full convolutional Siamese network to learn the similarity. As a following work, SA-Siam [9] is designed by a twofold Siamese network consisting of an appearance branch and a semantic branch to utilize heterogeneous features for improving SiamFC. SiamRPN [6] embeds the region proposal network into the Siamese network, formulating the task as a one-shot local detection problem. At the same period, RASNet [10] introduces three types of attention mechanisms over the feature maps of the template branch to weight the operation of cross-correlation. Therefore, feature representation and similarity learning are decoupled to relieve over-fitting. SiamDW [11] and SiamRPN++ [12] eliminate the impact of padding operation in deeper networks via random translations sampling and adding crop operation, respectively. As a result, some popular and powerful networks like ResNet [13] are successfully utilized for the backbone of Siamese networks, which is very helpful for the improvement of performance.
Despite all these significant success, most Siamese trackers can just discriminate the target from non-semantic background due to employing the features of last convolutional layer from template branch to equally represent the target. So the performance can not be guaranteed in some challenging scenes, like deformation, background clutter and rotation. Since what the networks learned is general features, the learned features usually cannot adapt well to the arbitrary object during tracking. Some trackers [5], [14] update the deep features online, which lead to expensive computation and tend to over-fitting. Moreover, most existing algorithms pay attention to the local semantic features of the object but ignore the global contextual information of the object. This limits the capability of capturing robust features. Motivated by these considerations, an effective and efficient Siamese network with different kinds of residual self-attention modules is developed to perform high performance visual tracking. Therefore, our intention is to improve feature representation via utilizing self-attention mechanism: focusing on the meaningful features with inter-dependency and suppressing redundant unimportant features for object tracking.
In this paper, we design an end-to-end architecture named CSART to learn powerful feature representation and adaptive residual attention weight for visual object tracking. For the template branch, the self-attention module selectively emphasizes interdependent channel-wise feature via channel attention and captures strong contextual information with spatial attention. For the search branch, the criss-cross attention learns global spatial inter-dependencies, while the cross-attention activates the corresponding channels. Then these attention features are integrated with the base feature via residual learning for adaptive feature enhancement, which is benefical to adapt the offline trained feature representation to an arbitrary tracking target. Moreover, we introduce a novel joint loss function to emphasize hard negative samples and guide the learning process of our model. To guarantee real-time tracking, all these learning processes are carried out on the training phase. Since the self-attention is only computed in the first frame and the criss-cross attention is light-weight enough, the parameters and calculations can be ignored in most cases.
In fact, SiamFC [8] is considered as our baseline. We embed three types of self-attention module into Siamese networks to train the architecture and apply the same online tracking mechanism. Fig. 1 shows that CSART achieves more powerful representation and better tracking accuracy. Extensive experiments and analysis on several benchmarks verify the effectiveness of our method.
To summarize, our main contributions are four-fold.
- •
An end-to-end deep architecture especially designed for the offline training of Siamese trackers is proposed, which inherits the merits from deep networks and self-attention mechanism to improve the capacity of feature representation for high performance tracking without any fine-tuning.
- •
We integrate the base feature and the features weighted by the context-aware self-attention containing spatial and channel dimensions via residual learning to enhance discriminative representation for both branches, respectively, which can alleviate over-fitting efficiently to some extent.
- •
To address the problem of imbalance of easy and hard samples, we design a joint loss function containing classic logistic loss and focal softmax loss to stress hard instances and promote the effective training of our network.
- •
Comprehensive experiments, on several representative tracking benchmarks, demonstrate that our proposed CSART achieves state-of-the-art performance while having a speed far beyond real-time.
Section snippets
Related work
Visual representation is important for object tracking. Existing deep trackers [5], [16], [17], [18] mainly exploit the pre-trained networks [19], [13], and have achieved great success in visual tracking community. In this section, we give a brief review on Siamese network based trackers and attention mechanism in computer vision, and state the main difference between our method and others.
The proposed tracking framework
In this section, we first introduce an effective and efficient tracking framework of attention-guided Siamese network, which can be trained in an end-to-end manner. Fig. 2 illustrates the pipeline of the proposed framework. Then three self-attention modules including spatial residual attention, channel residual attention and criss-cross attention are introduced to capture contextual information in spatial and channel dimension. Different from previous tracking architecture, CSART reformulates
Experiments
In this section, we first introduce the implementation details of network and parameters. Then, our tracker is evaluated on several popular tracking benchmarks including OTB-2013 [44], OTB-2015 [15], VOT-2016 [45], VOT-2017 [46], UAV123 [47] and LaSOT [48] and compared with other Siamese trackers and the state-of-the-art tracking algorithms. In addition, we also conduct ablation study by experiments to analyze the effectiveness of each component. The proposed method is implemented in Python
Conclusion and future work
In this work, we propose a self-attention guided deep Siamese network for visual tracking, which is especially designed for fast online tracking. We incorporate diverse self-attention features for Siamese networks by effective residual learning. The channel residual self-attention selectively emphasizes associated channels and reflects channel-wise quality of features, while the spatial residual self-attention adaptively aggregates the information at every pixels by capturing the weighted sum
CRediT authorship contribution statement
Dawei Zhang: Conceptualization, Methodology, Visualization, Writing - original draft. Zhonglong Zheng: Validation, Writing - review & editing. Minglu Li: Writing - review & editing. Rixian Liu: Data curation, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by National Natural Science Foundation of China under Grants No. 61672467 and No. 61827810, and in part by Zhejiang Provincial Natural Science Foundation of China under Grant No.LGG18F020017.
Dawei Zhang received the B.E. degree from Huaiyin Institute of Technology, China, in 2017. He is working toward the Ph.D. degree in college of mathematics and computer science of Zhejiang Normal University, China. His research interests cover deep learning, reinforcement learning and computer vision.
References (66)
- et al.
Multiple people tracking with articulation detection and stitching strategy
Neurocomputing
(2020) - et al.
Human detection and tracking for video surveillance: a cognitive science approach, in
- et al.
Learning correspondence from the cycle-consistency of time, in
- et al.
Dynamic hand gesture recognition using vision-based approach for human-computer interaction
Neural Comput. Appl.
(2018) - M. Danelljan, G. Bhat, F. Shahbaz Khan, M. Felsberg, Efficient convolution operators for tracking, in: The IEEE...
- et al.
Learning multi-domain convolutional neural networks for visual tracking, in
- B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in: The...
- M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Atom: accurate tracking by overlap maximization, in: The IEEE Conference...
- L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr, Fully-convolutional siamese networks for object...
- et al.
A twofold siamese network for real-time object tracking, in
Deeper and wider siamese networks for real-time visual tracking, in
Object tracking benchmark
IEEE Trans. Pattern Anal. Mach. Intell.
Fast online tracking with detection refinement
IEEE Trans. Intell. Transp. Syst.
Robust object tracking using manifold regularized convolutional neural networks
IEEE Trans. Multimedia
Quadruplet network with one-shot learning for fast visual object tracking
IEEE Trans. Image Process.
Gradnet: Gradient-guided network for visual object tracking, in
Reinforced similarity learning: Siamese relation networks for robust object tracking, in
Joint representation learning with deep quadruplet network for real-time visual tracking
High performance visual tracking with siamese actor-critic network
Residual attention network for image classification, in
Cited by (29)
Attention-enhanced multi-source cost volume multi-view stereo
2024, Engineering Applications of Artificial IntelligenceNPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioning
2024, Engineering Applications of Artificial IntelligencePerturbation-augmented Graph Convolutional Networks: A Graph Contrastive Learning architecture for effective node classification tasks
2024, Engineering Applications of Artificial IntelligenceImproved SiamCAR with ranking-based pruning and optimization for efficient UAV tracking
2024, Image and Vision ComputingAn efficient multi-scale learning method for image super-resolution networks
2024, Neural Networks
Dawei Zhang received the B.E. degree from Huaiyin Institute of Technology, China, in 2017. He is working toward the Ph.D. degree in college of mathematics and computer science of Zhejiang Normal University, China. His research interests cover deep learning, reinforcement learning and computer vision.
Zhonglong Zheng received the B.E. degree from University of Petroleum, China in 1999, and the Ph.D. degree from Shanghai Jiaotong University, China in 2005. He is currently a full professor in college of mathematics and computer science of Zhejiang Normal University, China. His research interests include machine learning, computer vision and blockchain. He is the corresponding author of this article.
Minglu Li graduated from the School of Electronic Technology at The PLA Information Engineering University in 1985, and received the Ph.D. degree in computer software from Shanghai Jiao Tong University in 1996. He was a tenured full professor and the director of Grid Computing Center of Shanghai Jiao Tong University. Currently, he is a full professor in college of mathematics and computer science of Zhejiang Normal University, China. His research interests include grid computing, services computing, and sensor networks.
Rixian Liu received the B.E. degree from Northeast Normal University, China in 2002, and the Ph.D. degree from Zhejiang University of Technology, China in 2019. She is currently an associate professor in college of Information Engineering of Jinhua Polytechnic, China. Her research interests include machine learning, object detection, surface defects detection, etc.