Elsevier

Neurocomputing

Volume 436, 14 May 2021, Pages 260-272
Neurocomputing

CSART: Channel and spatial attention-guided residual learning for real-time object tracking

https://doi.org/10.1016/j.neucom.2020.11.046Get rights and content

Abstract

Siamese networks have achieved great success in object tracking due to the balance of precision and speed. However, Siamese trackers usually utilize the local feature of the last layer, which may degrade tracking performance in some difficult scenarios. In this paper, we propose a novel Channel and Spatial Attention-guided Residual learning framework for Tracking, referred to as CSART, which can improve feature representation of Siamese networks by exploiting self-attention mechanism to capture powerful contextual information. Specifically, to be efficient and seamless integration, different kinds of self-attention are appended on the template and search branches of Siamese networks respectively, that model global semantic inter-dependencies in channel and spatial dimensions. To avoid representation degradation, we consider to adaptively aggregate basic feature and its attention-weighted features with residual learning. Furthermore, a joint loss consisting of classic logistic loss as well as focal softmax loss is designed to emphasize difficult samples and guide the learning process of the whole model. Benefiting from the above scheme, CSART alleviates the over-fitting problem to some extent and enhances the discriminability. Extensive experiments on six popular tracking datasets indicate that the proposed tracker achieves better performance with a speed of 65 fps than other state-of-the-art trackers.

Introduction

Visual object tracking is one of the most challenging tasks of computer vision. The definition is to locate the object in the subsequent frames with a specified target annotated by the bounding box of first frame. It has a large range of vision applications, such as video surveillance [1], motion analysis [2] and human-computer interaction [3]. Though recent deep tracking methods [4], [5], [6], [7] have made large progress, there are some challenging factors including background clutter, motion blur, occlusions, fast motion, deformation and scale variation, etc. In particular, robust trackers with well tracking accuracy and real-time speed are required in practical applications.

In recent year, some trackers based on Siamese network [8], [9], [6], [10], [11] also have aroused great concern in visual tracking community, due to balancing tracker’s performance and speed well. Especially, SiamFC [8] views the tracking as a template matching problem in certain embedding space and builds a full convolutional Siamese network to learn the similarity. As a following work, SA-Siam [9] is designed by a twofold Siamese network consisting of an appearance branch and a semantic branch to utilize heterogeneous features for improving SiamFC. SiamRPN [6] embeds the region proposal network into the Siamese network, formulating the task as a one-shot local detection problem. At the same period, RASNet [10] introduces three types of attention mechanisms over the feature maps of the template branch to weight the operation of cross-correlation. Therefore, feature representation and similarity learning are decoupled to relieve over-fitting. SiamDW [11] and SiamRPN++ [12] eliminate the impact of padding operation in deeper networks via random translations sampling and adding crop operation, respectively. As a result, some popular and powerful networks like ResNet [13] are successfully utilized for the backbone of Siamese networks, which is very helpful for the improvement of performance.

Despite all these significant success, most Siamese trackers can just discriminate the target from non-semantic background due to employing the features of last convolutional layer from template branch to equally represent the target. So the performance can not be guaranteed in some challenging scenes, like deformation, background clutter and rotation. Since what the networks learned is general features, the learned features usually cannot adapt well to the arbitrary object during tracking. Some trackers [5], [14] update the deep features online, which lead to expensive computation and tend to over-fitting. Moreover, most existing algorithms pay attention to the local semantic features of the object but ignore the global contextual information of the object. This limits the capability of capturing robust features. Motivated by these considerations, an effective and efficient Siamese network with different kinds of residual self-attention modules is developed to perform high performance visual tracking. Therefore, our intention is to improve feature representation via utilizing self-attention mechanism: focusing on the meaningful features with inter-dependency and suppressing redundant unimportant features for object tracking.

In this paper, we design an end-to-end architecture named CSART to learn powerful feature representation and adaptive residual attention weight for visual object tracking. For the template branch, the self-attention module selectively emphasizes interdependent channel-wise feature via channel attention and captures strong contextual information with spatial attention. For the search branch, the criss-cross attention learns global spatial inter-dependencies, while the cross-attention activates the corresponding channels. Then these attention features are integrated with the base feature via residual learning for adaptive feature enhancement, which is benefical to adapt the offline trained feature representation to an arbitrary tracking target. Moreover, we introduce a novel joint loss function to emphasize hard negative samples and guide the learning process of our model. To guarantee real-time tracking, all these learning processes are carried out on the training phase. Since the self-attention is only computed in the first frame and the criss-cross attention is light-weight enough, the parameters and calculations can be ignored in most cases.

In fact, SiamFC [8] is considered as our baseline. We embed three types of self-attention module into Siamese networks to train the architecture and apply the same online tracking mechanism. Fig. 1 shows that CSART achieves more powerful representation and better tracking accuracy. Extensive experiments and analysis on several benchmarks verify the effectiveness of our method.

To summarize, our main contributions are four-fold.

  • An end-to-end deep architecture especially designed for the offline training of Siamese trackers is proposed, which inherits the merits from deep networks and self-attention mechanism to improve the capacity of feature representation for high performance tracking without any fine-tuning.

  • We integrate the base feature and the features weighted by the context-aware self-attention containing spatial and channel dimensions via residual learning to enhance discriminative representation for both branches, respectively, which can alleviate over-fitting efficiently to some extent.

  • To address the problem of imbalance of easy and hard samples, we design a joint loss function containing classic logistic loss and focal softmax loss to stress hard instances and promote the effective training of our network.

  • Comprehensive experiments, on several representative tracking benchmarks, demonstrate that our proposed CSART achieves state-of-the-art performance while having a speed far beyond real-time.

Section snippets

Related work

Visual representation is important for object tracking. Existing deep trackers [5], [16], [17], [18] mainly exploit the pre-trained networks [19], [13], and have achieved great success in visual tracking community. In this section, we give a brief review on Siamese network based trackers and attention mechanism in computer vision, and state the main difference between our method and others.

The proposed tracking framework

In this section, we first introduce an effective and efficient tracking framework of attention-guided Siamese network, which can be trained in an end-to-end manner. Fig. 2 illustrates the pipeline of the proposed framework. Then three self-attention modules including spatial residual attention, channel residual attention and criss-cross attention are introduced to capture contextual information in spatial and channel dimension. Different from previous tracking architecture, CSART reformulates

Experiments

In this section, we first introduce the implementation details of network and parameters. Then, our tracker is evaluated on several popular tracking benchmarks including OTB-2013 [44], OTB-2015 [15], VOT-2016 [45], VOT-2017 [46], UAV123 [47] and LaSOT [48] and compared with other Siamese trackers and the state-of-the-art tracking algorithms. In addition, we also conduct ablation study by experiments to analyze the effectiveness of each component. The proposed method is implemented in Python

Conclusion and future work

In this work, we propose a self-attention guided deep Siamese network for visual tracking, which is especially designed for fast online tracking. We incorporate diverse self-attention features for Siamese networks by effective residual learning. The channel residual self-attention selectively emphasizes associated channels and reflects channel-wise quality of features, while the spatial residual self-attention adaptively aggregates the information at every pixels by capturing the weighted sum

CRediT authorship contribution statement

Dawei Zhang: Conceptualization, Methodology, Visualization, Writing - original draft. Zhonglong Zheng: Validation, Writing - review & editing. Minglu Li: Writing - review & editing. Rixian Liu: Data curation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by National Natural Science Foundation of China under Grants No. 61672467 and No. 61827810, and in part by Zhejiang Provincial Natural Science Foundation of China under Grant No.LGG18F020017.

Dawei Zhang received the B.E. degree from Huaiyin Institute of Technology, China, in 2017. He is working toward the Ph.D. degree in college of mathematics and computer science of Zhejiang Normal University, China. His research interests cover deep learning, reinforcement learning and computer vision.

References (66)

  • Y. Liu et al.

    Multiple people tracking with articulation detection and stitching strategy

    Neurocomputing

    (2020)
  • V. Gajjar et al.

    Human detection and tracking for video surveillance: a cognitive science approach, in

  • X. Wang et al.

    Learning correspondence from the cycle-consistency of time, in

  • J. Singha et al.

    Dynamic hand gesture recognition using vision-based approach for human-computer interaction

    Neural Comput. Appl.

    (2018)
  • M. Danelljan, G. Bhat, F. Shahbaz Khan, M. Felsberg, Efficient convolution operators for tracking, in: The IEEE...
  • H. Nam et al.

    Learning multi-domain convolutional neural networks for visual tracking, in

  • B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in: The...
  • M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Atom: accurate tracking by overlap maximization, in: The IEEE Conference...
  • L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr, Fully-convolutional siamese networks for object...
  • A. He et al.

    A twofold siamese network for real-time object tracking, in

  • Q. Wang, Z. Teng, J. Xing, W. Hu, S. Maybank, Learning attentions: Residual attentional siamese network for high...
  • Z. Zhang et al.

    Deeper and wider siamese networks for real-time visual tracking, in

  • B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, Siamrpn++: Evolution of siamese visual tracking with very deep...
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: The IEEE Conference on Computer...
  • Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. Lau, M.-H. Yang, Vital: Visual tracking via adversarial...
  • Y. Wu et al.

    Object tracking benchmark

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • J. Shen et al.

    Fast online tracking with detection refinement

    IEEE Trans. Intell. Transp. Syst.

    (2017)
  • H. Hu et al.

    Robust object tracking using manifold regularized convolutional neural networks

    IEEE Trans. Multimedia

    (2018)
  • A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: Neural...
  • R. Tao, E. Gavves, A.W. Smeulders, Siamese instance search for tracking, in: The IEEE Conference on Computer Vision and...
  • Q. Guo, F. Wei, C. Zhou, H. Rui, W. Song, Learning dynamic siamese network for visual object tracking, in:...
  • X. Dong, J. Shen, Triplet loss in siamese network for object tracking, in: The European Conference on Computer Vision...
  • X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, F. Porikli, Hyperparameter optimization for tracking with continuous deep...
  • X. Dong et al.

    Quadruplet network with one-shot learning for fast visual object tracking

    IEEE Trans. Image Process.

    (2019)
  • X. Li, C. Ma, B. Wu, Z. He, M.-H. Yang, Target-aware deep tracking, in: IEEE Conference on Computer Vision and Pattern...
  • P. Li et al.

    Gradnet: Gradient-guided network for visual object tracking, in

  • D. Zhang et al.

    Reinforced similarity learning: Siamese relation networks for robust object tracking, in

  • D. Zhang et al.

    Joint representation learning with deep quadruplet network for real-time visual tracking

  • X. Dong, J. Shen, W. Wang, L. Shao, H. Ling, F. Porikli, Dynamical hyperparameter optimization via deep reinforcement...
  • D. Zhang et al.

    High performance visual tracking with siamese actor-critic network

  • J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern...
  • F. Wang et al.

    Residual attention network for image classification, in

  • X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks,...
  • Cited by (29)

    • Attention-enhanced multi-source cost volume multi-view stereo

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    Dawei Zhang received the B.E. degree from Huaiyin Institute of Technology, China, in 2017. He is working toward the Ph.D. degree in college of mathematics and computer science of Zhejiang Normal University, China. His research interests cover deep learning, reinforcement learning and computer vision.

    Zhonglong Zheng received the B.E. degree from University of Petroleum, China in 1999, and the Ph.D. degree from Shanghai Jiaotong University, China in 2005. He is currently a full professor in college of mathematics and computer science of Zhejiang Normal University, China. His research interests include machine learning, computer vision and blockchain. He is the corresponding author of this article.

    Minglu Li graduated from the School of Electronic Technology at The PLA Information Engineering University in 1985, and received the Ph.D. degree in computer software from Shanghai Jiao Tong University in 1996. He was a tenured full professor and the director of Grid Computing Center of Shanghai Jiao Tong University. Currently, he is a full professor in college of mathematics and computer science of Zhejiang Normal University, China. His research interests include grid computing, services computing, and sensor networks.

    Rixian Liu received the B.E. degree from Northeast Normal University, China in 2002, and the Ph.D. degree from Zhejiang University of Technology, China in 2019. She is currently an associate professor in college of Information Engineering of Jinhua Polytechnic, China. Her research interests include machine learning, object detection, surface defects detection, etc.

    View full text