Deep discriminative correlation tracking based on adaptive feature fusion
Introduction
As one of the fundamental problem in computer vision, visual tracking has been widely applied in many fields, such as visual surveillance, traffic monitoring, and human-computer interface, etc. During the past decade, a lot of trackers are proposed to improve the tracking performance, which can be found in the recent surveys of visual tracking. Despite much progress has been achieved in these traditional methods, it remains tremendous challenges in achieving a robust tracker due to the severe occlusion or serious appearance changes of the target.
Recently, along with the fast development of the deep learning technology, the deep features based on Convolutional Neural Network (CNN) has demonstrated outstanding performance in computer vision applications, e.g., object recognition [1], [2], image classification [3], [4], and saliency detection [5], [6]. The deep CNN has the strong capability to learn the rich high-level semantic feature representations which are of great significance in distinguishing objects from different categories. Some recent studies [7], [8], [9] have shown that the multilayer CNN architecture can efficiently capture sophisticated hierarchical features which have different properties for tracking problem. The higher layers capture more abstract and semantic features which are effective to distinguish the objects from various categories and robust to the dramatic changes of targets appearance. However, if the background objects have the similar appearances with target, the high-level features will be less effective to differentiate them. The lower layers provide more detailed local features. They are less robust to the changes of targets appearance but are very helpful to separate the true target from its similar background objects due to the detail representations. This phenomenon motivates some researchers to apply hierarchical convolutional features to improve the tracking precision [8], [9].
Besides the deep learning tracking methods, Correlation Filter (CF) [10], [11] is proved to be efficient and effective for visual tracking problem. As one of the traditional tracking methods, CF-based tracking has attracted considerable attention due to its high computational efficiency with the use of fast Fourier transforms. The earlier evaluation demonstrates that the discriminative correlation filters (DCF) based trackers obtain relative better performances compared with other traditional trackers. After the prevalent of deep learning based tracking, the integration of deep learning and discriminative correlation filters achieves state-of-the-art performance compared with other deep learning tracking methods. The Convolutional Neural Networks based on deep learning provide the stable image features while discriminative correlation filters are served as the discriminative classifier to produce the tracking results. The HCF is a typical DCF based tracker which exploits features extracted from deep convolutional neural networks trained on object recognition datasets to improve tracking accuracy and robustness.
Although the integration of CNN and DCF improves the performance of visual tracking significantly, there are still some challenging problems with these deep learning based trackers. Firstly, it is usually not the optimal choice to manually preset the weights when fusing the different CNN layer features. The weight values should adapt to different video sequences and even different frames in the same sequence. Secondly, exist works has a large performance gap between success plots and precision plots. The main reason is that the scale estimation is a very challenging problem during the tracking course. Thirdly, the target is easy to be missed once the serious occlusion occurs due to some traditional update strategies of constantly updating. The constantly update strategies tend to introduce background clutter into the positive samples, which will result in error accumulation until the target is lost.
In order to address the above problems, we designed an adaptive feature fusion strategy to fuse the correlation response maps generated from different CNN layer features. Based on this, we proposed a deep discriminative correlation tracking algorithm with scale adaption and model update. The main contributions of our work can be summarized as follows:
- (1)
We designed a metric to measure the uncertainty of the correlation response map and proposed an adaptive feature fusion strategy to integrate the hierarchical CNN features with discriminative correlation filter. The integrated tracker can online adjust the fusion weights to obtain more robust tracking results compared with the baseline trackers.
- (2)
We constructed a scale filter to estimate the scale of target based on a set of scale correlation filters. By using the scale filter, the varied scales of target will be accurately estimated after locating its center location. This design relieves the mutual influence of location errors and scale errors, and reduces computational complexity efficiently.
- (3)
We proposed a new adaptive and selective updating mechanism to relief the model drift. It provides two model updating strategies and can adaptively select the optimal one by the correlation value of two adjacent frames. This updating mechanism further improves the tracking success rate under some typical challenging conditions.
Extensive experiments are carried out on the OTB2013 [12] (including 50 challenging videos), OTB2015 [13] (including 100 challenging videos) and Temple Color 128 [14] (including 128 challenging videos, TC128 for short) tracking benchmark datasets. The experimental results demonstrate that the proposed tracker achieves outstanding performance against state-of-the-art trackers (most of them are proposed from 2015–2017).
Section snippets
Related work
As a very hot topic in computer vision, visual tracking has been researched for decades. Lots of tracking algorithms are proposed to resolve the hard problems and improve the overall performances. Especially recent years, the deep learning based trackers have attracted very board attentions and improved the performances significantly. However, a comprehensive survey of visual tracking literature is out of the range of this paper. We only briefly review the closely related works according to
Proposed algorithm
The recent advancement of visual tracking demonstrates that the CNN-based trackers achieved a relative higher performance and CF-based tracker owns a remarkable efficiency. The integration of these two reveals great attraction to researchers. However, how to develop the potential of CNN features as much as possible is a tough question. In this section, we will introduce the proposed algorithm, which consists of deep discriminative correlation filters learning, feature fusion based on
Experiments
In order to evaluate the proposed algorithm, we implement the tracker using the mixed programming of MATLAB and VC++ based on the experimental platform of CPU (Intel Xeon 2.4 GHz) and GPU (GTX Titan X), We evaluate the proposed tracker on some recent visual tracking benchmark dataset with comparisons of some state-of-the-art trackers under one-pass evaluation (OPE). These trackers can be broadly categorized into four classes: (i) CNN-based trackers including CNN-SVM [16] and STCT [17]; (ii)
Discussion
In this paper, by integrating the deep discriminative correlation filters learning and a novel fusion strategy, we propose a deep discriminative correlation tracking algorithm based on adaptive feature fusion. In order to further improve the tracking performance, we designed an online fast scale estimation method. Moreover, we present a new adaptive and selective update mechanism to update both the discriminative correlation filters and scale correlation filters. The new update mechanism solves
Declaration of Competing Interest
None.
Acknowledgement
This work was supported in part by Nation Natural Science Foundation of China under grant #61703423, #61773396, and #41601436.
Wangsheng Yu was born in Hunan province of China. He received his M.S. and Ph.D degrees both in Communication and Information System from the Air Force Engineering University (AFEU) in 2010 and 2014, respectively. He is currently a lecturer with the Information and Navigation College, Air Force Engineering University. His research interests include computer vision and image processing.
References (49)
- et al.
Deep neural network for halftone image classification based on sparse auto-encoder
Eng. Appl. Artif. Intell.
(2016) - et al.
Obust visual tracking via co-trained Kernelized correlation filters
Pattern Recog.
(2017) - et al.
Deep neural networks for object detection
Proceedings of the NIPS, Lake Tahoe
(2013) - et al.
Multimodal deep learning for robust RGB-d object recognition
Proceedings of the IROS, Hamburg
(2015) - et al.
CNN-RNN: A Unified Framework for Multi-label Image Classification
Proceedings of the CVPR, Las Vegas
(2016) - et al.
Saliency Detection with Recurrent Fully Convolutional Networks
Proceedings of the ECCV
(2016) - et al.
Visual saliency detection based on multiscale deep CNN features
IEEE Trans. Image Process.
(2016) - et al.
Hypercolumns for object segmentation and fine–grained localization
Proceedings of the CVPR
(2015) - et al.
Hierarchical convolutional features for visual tracking
Proceedings of the ICCV, Santiago
(2015) - et al.
Analyzing the performance of multilayer neural networks for object recognition
Proceedings of theECCV
(2014)
Visual object tracking using adaptive correlation filters
Proceedings of the CVPR
Correlation filters for object alignment
Proceedings of the CVPR
Online Object tracking: a benchmark
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Object tracking benchmark
IEEE Trans. Pattern Anal. Mach. Intell.
Encoding color information for visual tracking: Algorithms and benchmark
IEEE Trans. Image Process.
Cited by (2)
Target recognition of basketball sports image based on embedded system and internet of things
2021, Microprocessors and MicrosystemsCitation Excerpt :Electronic Writing [18] Independent production and solid dynamic information services and self-selling scaling and extraordinary female tennis players process [19] I use a research facility to investigate 3D Rapid video laws and videos on the American mountain top in the Human Intelligence Lab. Expertise attributes provide an upgrade system for individual and specific working conditions under product [20], another general and specific update channel size relationship that distinguishes the channel success [21]. Checking the basketball signal is testing the movement of humans.
An Ensemble of Complementary Models for Deep Tracking
2022, Cognitive Computation
Wangsheng Yu was born in Hunan province of China. He received his M.S. and Ph.D degrees both in Communication and Information System from the Air Force Engineering University (AFEU) in 2010 and 2014, respectively. He is currently a lecturer with the Information and Navigation College, Air Force Engineering University. His research interests include computer vision and image processing.