Teacher-student knowledge distillation for real-time correlation tracking
Introduction
Visual tracking is a fundamental problem in computer vision and has been applied in many fields, which tracks a specified target given in the first frame in a changing video sequence automatically. Many methods [1], [2], [3], [4], [35], [54], [55], [56] have proposed to solve problems in visual tracking, such as occlusion, slow running speed and so on. Recently, correlation filters (CF) based trackers have been widely concerned and studied because of their computational efficiency in Fourier domain. The raw deep convolutional neural networks from other tasks are generally used to extract the target feature presentation for CF based trackers. Compared with the traditional hand-craft feature (e,g, HoG [5]), the deep convolutional features are more effective for the representation of target feature. Based on the deep convolutional features, correlation filter based trackers achieve more robust and accurate results on several popular benchmarks [1], [2], [3], [4]. However, the accuracy of these trackers is improved by using these deep convolution features, while the running speed is seriously reduced, especially on resource-constrained platform. The main reasons are: (1) more time consumption of correlation filters process. Because these deep convolutional features are designed to cover general objectives in large datasets, such as ImageNet, they have high dimension. And the computation time of correlation filter increases with the increase of feature dimension. (2) more time consumption of feature extraction. When extracting convolutional features of a image, a lot of convolution operation will be conducted, and thus more time is consumed during extracting feature. Furthermore, using raw deep convolutional neural network as the feature extractor, trackers require huge memory storage. For example, the original VGG-M [6] is used as the feature extractor by most CF based trackers [7], [8]. Including the full connection layer, and the model size of VGG-M is about 369 MB. Although GPU and can be used to accelerate trackers to some extent, the practical application scope is severely limited. In this work, we explore the way to optimize the running speed of CF based trackers using the raw deep convolutional neural network. Our goal is to make the improved CF tracker run on a single CPU platform without significantly reducing the performance, thus providing insights into the application scope of CF based trackers. According to our observations, improvements of running speed can be made in two aspects:
(1) Reducing the model capacity of the feature extraction network. The smaller capacity deep convolutional neural network can reduce the time consumption generated in the target feature extraction process, and thus reducing the memory storage occupied by the algorithm;
(2) Reducing the dimension of the extracted target features, which reduces the computation time of the correlation filter.
To this end, we introduce a teacher-student knowledge distillation training framework to obtain a lightweight convolutional neural network, which has lower feature dimension, less feature extraction time and smaller memory storage. And then the lightweight model is used as the feature extractor to speed up CF based trackers. Specifically, we take a pretrained deep convolutional neural network from the image classification task, namely VGG-M [6], as the teacher network, and then a lightweight convolutional neural network is designed as the student network. In general knowledge distillation training process, a student network is generated by compressing a teacher network, and the student network is applied to the same domain as the teacher network. In this work, the student network and the teacher network are in two different domain, that is correlation tracking and image classification. To achieve model compression and reduce differences between domains, we propose two kinds of loss functions to guide the training process of the student network model: the attention transfer loss (AT loss) function and the correlation tracking loss (CT loss) function. The AT loss ensures that the lightweight student network to maintain feature representation of large-capacity teacher network. And the CF loss improves the student network discriminant ability, and shifts the student network suitable for the image classification task to the correlation tracking task to narrow the gap between domains. Meanwhile, to enrich feature representation of a student network, we carried out the distillation process on shallow, middle and deep convolutional layers jointly.
After offline training based on the teacher-student knowledge distillation framework, we obtained a lightweight feature extraction network with a model size of about 1.3 MB. Compared with teacher network size of 90 MB (excluding all full connection layer), the student network reduces the model capacity by about 69 times. When the trained lightweight student network is combined with the state-of-the-art correlation filter based tracker, namely ECO [7], the tracker achieves real-time running speed (26 FPS) on a CPU platform. Meanwhile, a large number of experiments on the popular benchmarks show that the proposed method almost maintains the performance similar to that of the original ECO.
We summarize our main contributions as follow:
(1) A new teacher-student knowledge distillation training framework is proposed to learn a lightweight network for DF based visual tracker. During training the lightweight network, we propose an attention transfer loss function and a correlation tracking loss function to jointly guide the training process of the lightweight student network.
(2) We propose to distillate the lightweight student through the attention transfer process and the correlation tracking process on shallow, middle and high level convolutional layers jointly to enrich feature representation of the student network.
(3) We combine the learned lightweight student network with state-of-the-art CF based tracker [7]. The evaluation on the four popular benchmarks shows that our method can improve the running speed of the tracker on a CPU while maintain almost similar tracking performance.
Section snippets
Related works
In this section, we give a brief review closely related to this work on three aspects: correlation filter for visual tracking, real-time visual tracking based on deep learning, knowledge distillation.
Proposed methods
The framework of the proposed teacher-student knowledge distillation is given in Fig. 1. In the following sub-sections, we introduce its network structure, the attention transfer training process, the correlation tracking training process, and the online correlation filter tracking process with the learned lightweight network.
Experiments
In this section, we first introduce the implementation details. Secondly, the results of OTB2013 [1], OTB2015 [2], VOT2017 [3] and Temple Color [4] prove the effectiveness and robustness of our method. Finally, we conducted ablation experiments to analyze the contribution of each part of the tracker to the performance of the tracker and the effectiveness of the network structure.
Conclusion
In this work, we propose to use a lightweight feature extraction network to optimize the speed of CF based tracker from the feature extraction and the learning time consumption of correlation filtering. A highly compressed and lightweight feature extraction network is obtained by model compression and transfer of a raw large-capacity teacher network from image classification task. A large number of experiments show that our training strategy is effective. Although the obtained network is very
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by the Project of Guangxi Science and Technology (No. 2022GXNSFDA035079 and GuiKeAD21075030), the National Natural Science Foundation of China (No. 61972167 and 62076214), the Guangxi “Bagui Scholar” Teams for Innovation and Research Project, the Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, and the Guangxi Talent Highland Project of Big Data Intelligence and Application.
Qihuang Chen is currently a visiting researcher at Guangxi Normal University, Guilin, China. He received the M.S. degree from School of Computer Science and Technology, Huaqiao University, in 2020. His research interests include computer vision and machine learning.
References (56)
- et al.
Online object tracking: A benchmark
Proceedings of the IEEE conference on computer vision and pattern recognition
(2013:) - et al.
Object Tracking Benchmark
IEEE Trans. Pattern Anal. Mach. Intell.
(2015) - et al.
The visual object tracking vot2017 challenge results
- et al.
Encoding Color Information for Visual Tracking: Algorithms and Benchmark
IEEE Trans. Image Process.
(2015) - et al.
Object detection with discriminatively trained part-based models
TPAMI
(2010) - K. Chatfield, K. Simonyan, A. Vedaldi, et al. Return of the devil in the details: Delving deep into convolutional nets,...
- M. Danelljan, G. Bhat, F. Shahbaz Khan, et al., Eco: Efficient convolution operators for tracking, in: Proceedings of...
- et al.
Convolutional Features for Correlation Filter Based Visual Tracking
- et al.
Visual object tracking using adaptive correlation filters[C]
Twenty-third IEEE Conference on Computer Vision & Pattern Recognition, IEEE
(2010) - Y. Bo, Z.Q. Ling, Optimal Control for Large-scale Descriptor Systems with Symmetric Circulant Structure, J....
High-speed tracking with kernelized correlation filters
IEEE Trans. Pattern Anal. Mach. Intell.
Adaptive color attributes for real-time visual tracking
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Hierarchical Convolutional Features for Visual Tracking
A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration
Learning spatially regularized correlation filters for visual tracking
Proceedings of the IEEE international conference on computer vision
Context-aware correlation filter tracking
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Multi-kernel correlation filter for visual tracking
Proceedings of the IEEE international conference on computer vision
Attentional correlation filter network for adaptive visual tracking
Proceedings of the IEEE conference on computer vision and pattern recognition
Reliable patch trackers: Robust visual tracking by exploiting reliable patches
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Real-time part-based visual tracking via adaptive correlation filters
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Real-time visual tracking: Promoting the robustness of correlation filter learning[C]//European conference on computer vision
Beyond correlation filters: Learning continuous convolution operators for visual tracking
Learning multi-domain convolutional neural networks for visual tracking
Proceedings of the IEEE conference on computer vision and pattern recognition
Crest: Convolutional residual learning for visual tracking
Proceedings of the IEEE International Conference on Computer Vision
High performance visual tracking with siamese region proposal network
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (5)
SiamDF: Tracking training data-free siamese tracker
2023, Neural NetworksObject Knowledge Distillation for Joint Detection and Tracking in Satellite Videos
2024, IEEE Transactions on Geoscience and Remote SensingSiamOHOT: A Lightweight Dual Siamese Network for Onboard Hyperspectral Object Tracking via Joint Spatial-Spectral Knowledge Distillation
2023, IEEE Transactions on Geoscience and Remote SensingRegress 3D human pose from 2D skeleton with kinematics knowledge
2023, Electronic Research Archive
Qihuang Chen is currently a visiting researcher at Guangxi Normal University, Guilin, China. He received the M.S. degree from School of Computer Science and Technology, Huaqiao University, in 2020. His research interests include computer vision and machine learning.
Bineng Zhong received the B.S., M.S., and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 2004, 2006, and 2010, respectively. From 2007 to 2008, he was a Research Fellow with the Institute of Automation and Institute of Computing Technology, Chinese Academy of Science. From September 2017 to September 2018, he was a visiting scholar in Northeastern University, Boston, MA, USA. From November 2010 to October 2020, he was a professor with the School of Computer Science and Technology, Huaqiao University, Xiamen, China. Currently, he is a professor with the School of Computer Science and Engineering, Guangxi Normal University, Guilin, China. His current research interests include pattern recognition, machine learning, and computer vision.
Qihua Liang received the B.S degree in accounting major from the Xiamen University, Xiamen, China, in 2014. Currently, she is a teacher with the School of Computer Science and Engineering, Guangxi Normal University, Guilin, China. Her current research interests include computer vision and pattern recognition.
Deng Qingyong is an associate professor at the School of Computer Science and Engineering & School of Software, Guangxi Normal University, China. He received his master’s degree in Signal and Information Processing from Xiangtan University, China in 2009 and Ph.D. degree in Beijing University of Posts and Telecommunications (BUPT), China in 2019. He has published more than 30 referred journal papers in his current research interests, including IoT, AI and wireless network. He is a member of IEEE and CCF.
Xianxian Li received the Ph.D. degree in computer science and technology from Beihang University, Beijing, China. He is currently a professor with the School of Computer Science and Engineering, Guangxi Normal University. His research interests include machine learning, data security, blockchain and distributed system.