Learning spatio-temporal context via hierarchical features for visual tracking

https://doi.org/10.1016/j.image.2018.04.010Get rights and content

Highlights

  • An improved approach to Spatio-temporal context based visual tracking algorithm.

  • A mapping neural network is used to acquire dynamic training confidence map.

  • Hierarchical features is exploited for the construction of context prior models.

  • Training confidence index is resorted to guide updating process.

  • Experiments of both ordinary and aerial tracking show excellent tracking results.

Abstract

Spatio-temporal context (STC) based visual tracking algorithms have demonstrated remarkable tracking capabilities in recent years. In this paper, we propose an improved STC method, which seamlessly integrates capabilities of the powerful feature representations and mappings from the convolutional neural networks (CNNs) based on the theory of transfer learning. Firstly, the dynamic training confidence map, obtained from a mapping neural network using transferred CNN features, rather than the fixed training confidence map is utilized in our tracker to adapt the practical tracking scenes better. Secondly, we exploit hierarchical features from both the original image intensity and the transferred CNN features to construct context prior models. In order to enhance the accuracy and robustness of our tracker, we simultaneously transfer the fine-grained and semantic features from deep networks. Thirdly, we adopt the training confidence index (TCI), reflected from the dynamic training confidence map, to guide the updating process. It can determine whether back propagations should be conducted in the mapping neural network, and whether the STC model should be updated. The introduction of the dynamic training confidence map could effectively deal with the problem of location ambiguity further in our tracker. Overall, the comprehensive experimental results illustrate that the tracking capability of our tracker is competitive against several state-of-the-art trackers, especially the baseline STC tracker, on the existing OTB-2015 and UAV123 visual tracking benchmarks.

Introduction

Visual tracking, a basic but significant part in computer science, has incurred great attention in recent years deriving from its tremendous potential applications in intelligent surveillance, autonomous driving, search missions and video analysis for sports, etc. Under the hypothesis that we only know the ground truth information in the first frame, the main objective in visual tracking lies in predicting the trajectory in the rest continuous sequences. Moreover, the core issue we interest in the above scenes is single object tracking (SOT), which numerous tracking algorithms [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34]] and several tracking benchmarks [[35], [36], [37], [38], [39], [40]] are devoting to cope with. Thus far, the existing algorithms cannot adapt well to the environment due to tough conditions such as illumination variation, full or partial occlusion, similar object, scale variation, viewpoint change, out of view, etc.

To achieve better tracking capability, most trackers depend on improving the performance of models or classifiers [[2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [19], [20], [22], [23], [24], [25], [26], [27], [28], [29], [30], [32], [33]] and designing robust feature representations [[1], [13], [14], [15], [16], [17], [18], [21], [26], [29], [31], [34], [41]]. For the first research direction, the existed state-of-the-art tracking models utilize improved correlation filters, particle filters, support vector machines (SVM) and AdaBoost, etc., in their work. J. Henriques et al. [3] exploit the circulant structure and Fast Fourier Transform to acquire extremely fast tracking speed, and they utilize kernel tricks to form different kinds of kernel classifiers. DSST [2] tracker is proposed by M. Danelljan et al., which emphasizes the significance of the scale estimation and relies on a scale pyramid to obtain the most appropriate scale parameter based on correlation filters. S. Hare et al. [7] exploit kernelized structured output support vector machine (SVM) in their Struck tracker, and a budgeting mechanism is introduced to deal with the problem of support vectors. Z. Fan et al. [19] form their tracking algorithms by utilizing iterative particle filter, and it can approximate the true target state more precisely compared with other particle filters based trackers, thus the accuracy and robustness could improve a lot.

As for designing robust feature representations, N. Wang et al. [42] highly and sufficiently evaluate its function in dramatically enhancing tracking ability. M. Danelljan et al. [21] extend the color representations from primary to abundant in their tracker based on [3], and an effective feature dimensions reduction method is proposed to achieve real time tracking. P. Chen et al. [17] propose a new model of sparse learning for multi-task features selection. Recently, deep learning has successfully boosting the development of many fields in computer vision such as object detection, action recognition and semantic segmentation, etc. [[43], [44], [45], [46], [47], [48], [49], [50]]. Without exception, several deep learning based algorithms  [[13], [14], [15], [16], [18], [26], [27], [29], [30], [31], [34]] emerge and achieve state-of-the-art performance. Hence, a research climax aiming at designing robust deep abstract feature representations is coming.

The success of existing deep trackers is mainly owing to the theory of transfer learning. Most trackers  [[13], [14], [15], [26], [27], [29], [30], [31], [34]] use deep convolutional neural networks (CNNs), e.g., VGG net [45], trained on large-scale image detection or recognition datasets [[51], [52], [53]] as their online feature extractors. Though these deep CNNs involve a great deal of training samples in supervised training, it has indeed acquired powerful feature representations in essence. Given the generalization ability of deep CNNs, transferring abstract CNN features from one field, e.g., object detection, to another, e.g., visual tracking, becomes effective. However, existing CNN based tracking algorithms suffer the problem of feature selection. Some trackers [27] only directly use abstract CNN features from one layer, i.e., the convolutional layer or the fully connected layer, in the deep CNN. Honestly speaking, using abstract features from one layer may improve tracking performance, but it is not an optimal choice because of the different feature characteristics in lower and higher layers. Exploiting deep abstract features of higher layers, which is common in object detection [[43], [44], [45], [46]], could obtain semantic information in classification level. Nevertheless, abstract features in lower layers also contain ample fine-grained information, which is effective in searching the object and building spatial context. To enrich the abstract features in the tracker, other algorithms [[14], [15], [26]] resort to directly combine convolutional features from different levels namely simultaneously exploit partial hierarchical deep features. However, such combinations only consider several limited layers, which restrains the transfer ability and neglects the rest utility features. Furthermore, the CNNs also possess the unique ability of feature mapping, which could map the abstract features into another form you require [[13], [29]]. The technique could better describe the tracking object and constantly adapt the appearance changes caused by rigorous tracking environments such as partial or full occlusion, similar object, illumination variation and so on. In a word, how to exploit abstract CNN features to realize high performance tracking becomes an open problem and fully designing the feature transfer pattern to acquire better discrimination ability is worth considering.

In this paper, we present a novel hierarchical features based tracker for spatio-temporal context (STC) learning, which devotes to enhancing tracking performance by constructing more robust model and designing more useful feature representations. We first utilize the dynamic training confidence map instead of the fixed training confidence map to adapt the practical tracking scenes better and overcome the circumstance of tracking failures. The dynamic training confidence map, acquired from a mapping neural network using the transferred deep abstract features, is designed to capture the changing details of the tracking object effectively. In addition to serve as a confidence map in the online training process, the dynamic training confidence map is also utilized in the updating process. Specially, we use the training confidence index (TCI) reflected from the map, which could determine whether the mapping neural network should need back propagations and whether the STC model should be updated in current frame. The introduction of dynamic training confidence map could solve the problem of location ambiguity [10] further. Then, we take advantage of the original image grayscale and sufficient abstract CNN features as hierarchical features to construct our robust feature representations, which result in considerable performance improvements as displayed in Fig. 1. For every feature map in our tracker, we train an independent STC model, which computes a predictive confidence map for locating the tracking object in the next frame. Since the obtained predictive confidence maps predict the trajectory from various viewpoints, the final tracking result is determined by the fusion of these maps, which is demonstrated effective in the practical tracking situations.

The significant contributions of this paper could be summarized as follows:

  • The dynamic confidence map, obtained from a mapping neural network, is exploited to cope with the problem of practical tracking challenges more effective.

  • The hierarchical features from both the original image intensity and the transferred CNN features are simultaneously used in the process of constructing context prior models.

  • The training confidence index is introduced to guide the network fine-tuning and the model updating.

  • The extensive experiment results on the popular visual tracking benchmark OTB-2015 [40] and newly proposed UAV-based benchmark UAV123 [39] indicate the remarkable tracking capability of our hierarchical spatio-temporal context (HSTC) tracker compared with other state-of-the-art trackers.

The remainder of this paper is organized as follows. Firstly, Section 2 illustrates the related work. Then we describe the proposed algorithms in detail in Section 3. Thirdly, a series of comparative experiments are conducted between our tracker and several other state-of-the-art trackers on two tracking benchmarks in Section 4. Finally, we conclude the paper in Section 5.

Section snippets

Related work

The related state-of-the-art tracking algorithms are briefly reviewed in this section. Specially, we put emphasis on the classical STC tracker, which acts as our baseline algorithm.

Proposed algorithm

Based on the existed theory, some problems are worthy of attention. As we can see from [10], the baseline method utilizes the fixed training confidence map mx, which may confuses the tracker especially in the situation such as full or part occlusions. Moreover, the feature of the image grayscale alone could not effectively capture the entire context appearance information in building the context prior model Pcz|o, which limits the tracking capability. Furthermore, the frame-by-frame updating

Experiments

In the following sections, we first present the implementation details in the experiments, and then conduct comprehensive experiments on a mass of video sequences from two tracking benchmarks [[39], [40]]. To demonstrate the remarkable performance of our HSTC tracker, we exploit two kinds of comparisons: comparing our HSTC tracker with several state-of-the-art trackers and comparing our HSTC tracker with the ablation versions.

Conclusions

In this paper, we propose an improved STC based tracker that makes fully use of capabilities, i.e. the feature representation and mapping, of abstract features from deep CNNs. Firstly, we exploit the transferred abstract features to acquire the dynamic training confidence map by training a mapping neural network online. Then, we combine hierarchical features of the original image intensity with all CNN features to gain tracking ability of the model. Thirdly, we resort to the TCI value reflected

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant No. 61501357 and Basic Science Research Project of Shaanxi Province under grant No. 2016JQ6080.

References (55)

  • MaC. et al.

    Long-term correlation tracking

  • HareS. et al.

    Struck: structured output tracking with kernels

  • H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: 2006 British Machine Vision...
  • ZhangK. et al.

    Fast visual tracking via dense spatio-temporal context learning

  • RossD. et al.

    Incremental learning for robust visual tracking

    Int. J. Comput. Vis.

    (2008)
  • WangL. et al.

    STCT: Sequentially training convolutional networks for visual tracking

  • QiY. et al.

    Hedged deep tracking

  • MaC. et al.

    Hierarchical convolutional features for visual tracking

  • WangN. et al.

    Learning a deep compact image representation for visual tracking

  • FanJ. et al.

    Human tracking using convolutional neural networks

    IEEE Trans. Neural Netw.

    (2010)
  • DanelljanM. et al.

    Adaptive color attributes for real-time visual tracking

  • WenL. et al.

    Robust online learned spatio-temporal context model for visual tracking

    IEEE Trans. Image Process.

    (2014)
  • ZhuG. et al.

    Weighted part context learning for visual tracking

    IEEE Trans. Image Process.

    (2015)
  • YangM. et al.

    Context-aware visual tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • DinhT. et al.

    Context tracker: Exploring supporters and distracters in unconstrained environments

  • M. Wang, Y. Liu, Z. Huang, Large margin object tracking with circulant feature maps, arXiv preprint arXiv:170305020,...
  • HongS. et al.

    Online Tracking By Learning Discriminative Saliency Map with Convolutional Neural Network

    (2015)
  • Cited by (10)

    • Learning region sparse constraint correlation filter for tracking

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      Given the initial position of the target in the first frame, visual tracking algorithms aim to find the target location in the rest of video sequence. During the past decades, a variety of tracking algorithms have been proposed [1–3]. Despite the great success, visual tracking is still a challenging problem, due to the illumination change, occlusion, non-rigid target deformation and many other challenging factors.

    • Recent trends in multicue based visual tracking: A review

      2020, Expert Systems with Applications
    • Deep Learning for Visual Tracking: A Comprehensive Survey

      2022, IEEE Transactions on Intelligent Transportation Systems
    • Context-aware target tracking algorithm fused with redetection mechanism

      2021, Guangdianzi Jiguang/Journal of Optoelectronics Laser
    View all citing articles on Scopus
    View full text