Hierarchical Convolutional Features for Long-Term Correlation Tracking

Chen, Huizhi; Fan, Baojie

doi:10.1007/978-981-10-7305-2_57

Huizhi Chen¹⁶ &
Baojie Fan¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 773))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2611 Accesses
1 Citations

Abstract

Visual tracking is of great significance in computer vision. In this paper, we propose a long-term tracking method to deal with appearance variation caused by occlusion, fast motion, out-of-view, etc. Firstly, we extract CNNs features from both low and high depth of a pre-trained VGGNet and estimate target translation via correlation filters separately trained on features from three different layers. Then a HOG based scale correlation filter is applied to search target pyramids cropped around target position for optimal scale. In case of tracking failure, we train a SVM to re-detect the target. In addition, scale model is only updated when scale response is higher than a pre-defined threshold. Experimental results on OTB2013 show that our algorithm is of effectiveness and robustness in case of heavy appearance changes.

You have full access to this open access chapter, Download conference paper PDF

Accurate Scale-Variable Tracking

Scale adaptive correlation tracking based on convolutional features

Article 13 February 2018

Robust and Real-Time Visual Tracking Based on Single-Layer Convolutional Features and Accurate Scale Estimation

Keywords

1 Introduction

Visual Tracking is one of the most important area in computer vision with applications in surveillance, navigation, human computer interaction, robotics, etc. Visual object tracking can be described as: given an unknown target by a bounding-box on the first frame of a sequence and estimating states of the target on the following frames. After decades of development, tracking methods have been progressed a lot but there are still factors limiting tracking performance.

This paper is based on three observations upon prior works. Firstly, tracking methods suffer from large appearance changes caused by illumination changes, occlusion, abrupt motion, background clutter and deformation. Therefore, we apply an efficient strategy to update appearance templates, which is capable of preventing target models from being polluted by background information.

Secondly, convolutional networks have achieved significant success in object detection and image recognition regions. CNNs features, coming from large-scale vision datasets and efficient learning, prove to be more robust and discriminative than hand-crafted features. Thus we use deep CNNs features to cope with distractors.

In addition, tracking methods based on hand-crafted features take advantage of efficient computation owing to low dimension. When it comes to deep features with hundreds of channels, tracking speed would be slowed down. To achieve computation efficiency, we make use of correlation filters and alleviate the sampling ambiguity at the same time.

2 Related Work

Lots of visual tracking methods have been proposed during decades of study. Tracking-by-detection is known as a popular discriminative framework for object tracking, which takes context information into consideration and classifies target and background by learning a classifier. Recently, different kinds of machine learning algorithms are applied into discriminative methods, for instance, multiple instance learning [1], boosting [3], structured support vector machine (SVM) [10], etc.

Apart from classification, target representations play an important role in visual tracking. CNNs features have promoted image recognition precious to a higher level than human. DLT [12] pre-trains a small network on tiny images dataset and then uses particle filters to localize targets. The idea of pre-trained model with offline tuning is inherited by many later methods. Wang et al. propose FCNT [11] based on fully convolutional networks, where features selected from conv4-3 and conv5-3 layer of VGGNet [6] are separately used to construct two special nets designed to catch position information and discriminate distractors. Ma et al. [7] apply features coarse-to-fine to learn three filters and the final response map is weighted sum of the three sub-maps, maximum value of which determines the location of targets. With the use of hierarchical features, FCNT and HCFT perform well with background clutter.

Correlation tracking has recently arisen much attention due to computational efficiency. Correlation tracking takes circular-shifted visions of features with Gaussian-weighted labels around target position as training samples, which alleviating the problem of sampling ambiguity. Nevertheless, circulant matrices enable correlation operations to be computed with high speed in frequency domain. Numerous extensions of correlation filters have been proposed to elevate tracking accuracy, including KCF [5] with HOG and Gaussian kernel and CSK [4] with linear kernel. Features like color-name or HOG adopted by these correlation filter based methods limit the robustness of tracker, leading to drifting under situations of severe deformation and occlusion.

3 Proposed Algorithm

This section consists of (1) correlation tracking, (2) online detection, and (3) model update. The following content will introduce these parts.

3.1 Correlation Tracking

As typical correlation trackers do, we learn a discriminative classifier and tracking targets by searching maximum value of correlation response map. The appearance of the target is modeled using correlation filter $w^{(l)}$. Feature vector $\mathbf {x}$ with size of $M \times N \times D$ are extract from searching window centered around target position, where M, N and D respectively indicate the width, height and depth of features. Training samples are all the circular shifts of $\mathbf {x}$ along the M and N dimensions, where each sample $x_{(m,n)}^{(l)}$, $m\!\in \! \left\{ 0,1,\dots ,M-1 \right\} $, $n\!\in \!\left\{ 0,1,\dots ,N-1 \right\} $ has a Gaussian function label $y^l (m,n)$ computed by: $y^{(1)} (m,n)=exp(-((m-M/2)^2+(n-M/2)^2)/2\sigma ^2)$, where $\sigma $ is the kernel width. The correlation filter $w^{(l)}$ with the same size of feature $\mathbf {x}$ is trained by solving the ridge regression:

$$\begin{aligned} \min _{w^{(l)}} \sum _{m,n}\left| \varPhi (x^{(l)}_{m,n}w^{(l)}-y^{(l)}(m,n)) \right| ^2+\left| w^{(l)} \right| ^2 \end{aligned}$$

(1)

where $\varPhi $ denotes the mapping to a kernel space and $\lambda $ is a non-negative regularization parameter. The learned filter $w^{(l)}$ can be expressed as

$$\begin{aligned} w^{(l)}=\sum _{m,n}a^{(l)}(m,n)\varPhi (x^{(l)}_{m,n}) \end{aligned}$$

(2)

where the coefficient a is defined as

$$\begin{aligned} A^{(l)}=\mathcal {F}(a^{(l)})=\frac{\mathcal {F}(y^{(l)})}{\mathcal {F}(\varPhi (x^{(l)}\cdot \varPhi (x^{(l)})))+\lambda } \end{aligned}$$

(3)

In (3), $\mathcal {F}$ denotes the fast Fourier transformation operator and $\mathcal {F}(y^{(l)})$ is the Gaussian label. The response map in the new frame is computed on an image patch z within a $M \times N$ search window

$$\begin{aligned} \hat{y}=F^{-1}(A^{(l)}\odot F(\varPhi (z) \cdot \varPhi (\hat{x}))) \end{aligned}$$

(4)

where $\hat{x}$ denotes the learned target appearance model and $\odot $ is the Hadamard product. Now, we get response maps for all layers denoted by $\hat{y}^{(l)}$. Then we sum the three maps with corresponding weights $\gamma ^{(l)}$ and get summation $y_{sum}=\varSigma _l(\gamma ^{(l)}\hat{y}^{(l)})$. Thus, the new target position is estimated by searching the maximum value of $y_{sum}$.

3.2 Online Detection

Obviously, re-detection step is significant to a long-term tracking algorithm in case of tracking failure. Our re-detection is performed on every frame. For computational efficiency and model robustness, we activate the re-detection module only when $\max (\hat{y}_scale)<\mathcal {T}_{scale}$, where $\hat{y}_{scale}$ refers to scale response map and $\mathcal {T}_{scale}$ is a re-detection threshold.

To get scale response map, we construct a target pyramid around the estimated position $\arg _{m,n}y_{sum}$. The target size is assumed to be $W\times H$ in a test frame and K indicates the number of scales $s\in S$. For each scale in $S=\left\{ a^k|k=-(K-1)/2,-(K-3)/2,\dots ,(K-1)/2 \right\} $, we crop out an image patch with the size of $sW \times sH$ centered on the predicted position. Different from motion model, scale model is built on HOG features by solving the same ridge regression as (1). The scale most suitable for current target is

$$\begin{aligned} scale= \arg _j\max (\max (\hat{y}_{1}^{s}),\max (\hat{y}_{2}^{s}),\dots ,\max (\hat{y}_{j}^{s})) \end{aligned}$$

(5)

where $\hat{y}_{j}^{s}$ indicates scale correlation response and $j\in {1,2,\dots ,K}$ represent different scale levels. In addition, all patches are resized to $W \times H$ before doing correlation with scale filter.

For re-detection module, we use HOG features to train an online SVM classifier to re-detect targets. Training samples are patches with motion correlation response higher than confidence threshold $\mathcal {T}_{rd}$. The label y of a SVM train sample is determined by IOU value, which indicates the overlap ratios of the sample and target bounding box,

$$\begin{aligned} y={\left\{ \begin{array}{ll} +1,&{}\text { if }IOU>0.9 \\ -1,&{}\text { if }IOU<0.5 \end{array}\right. } \end{aligned}$$

(6)

SVM classifier is described in the form of $f(x)=w\varPhi (x)+b$ learned from samples with labels $y\in {-1,+1}$. The object function is

$$\begin{aligned} \min _{w,b,\xi } \frac{1}{2}\left| w \right| ^2+C\sum _{i=1}^N\xi _i \end{aligned}$$

(7)

subject to constrains

$$\begin{aligned} y_i(w\varPhi (x)+b)-1+\xi _i\ge 0,\xi _i\ge 0 \end{aligned}$$

(8)

Therefore, we learn a weight vector w by solving this quadratic convex optimization problem.

3.3 Model Update

In order to ensure the robustness of our tracker, it is necessary to update appearance models when targets undergo occlusion, deformation and abrupt motion. The motion model $R_m$ is updated with a learning rate on every frame:

$$\begin{aligned} \hat{x}_t^{(l)}=(1-\alpha )\hat{x}_{t-1}^{(l)}+\hat{x}_t^{(l)} \end{aligned}$$

(9)

$$\begin{aligned} \hat{A}_t^{(l)}=(1-\alpha )\hat{A}_{t-1}^{(l)}+\hat{A}_t^{(l)} \end{aligned}$$

(10)

where t is the index of current frame, l indicates features layers.

However, the scale model $R_s$ is only updated when scale correlation response $\max {\hat{y}_{scale} \mathcal {T}_{update}}$ and the learning rate is same as $R_m$.

4 Implementation

The main steps of our algorithm is presented in Algorithm 1 and more implementation details are described as follows.

Features. VGG-16 trained on ImageNet is adopted to extract hierarchical convolutional features in this work. Features used to train the motion model $R_m$ are the output of conv5-4, conv4-4 and conv3-4 convolutional layer. However, the scale model $R_s$ is constructed with HOG features. For the SVM detector, all different scale patches are resized to $W \times H$.

Kernel Selection. Both motion model and scale model adopt a Gaussian kernel $k(x,x')=e^{-|x-x'|^2/\sigma ^2 }$, which defined a mapping $\varPhi $ as $k(x,x')=\varPhi (x)\cdot \varPhi (x')$.

SVM. The SVM classifier also applies a Gaussian kernel in order to discriminate non-linear samples. The re-detection step is implemented by scanning the whole frame with scanning windows with the size corresponding to the optimal scale obtain by correlation filter.

5 Experimental Results

Setups. We evaluate our algorithm LHCF on OTB2013 [13] that contains 50 different sequences and results of 29 trackers with comparisons to state-of-the-art methods by two metrics: distance precision and overlap success rate. Parameters for our test is listed in Table 1. Our algorithm is implemented in MATLAB on a 2.6 GHz Intel Core i5 CPU with 8 GB RAM.

Table 1. Parameters for implementation

Full size table

Quantitative Evaluation. Our algorithm is evaluated on the benchmark with comparisons with (1) correlation based trackers: MEEM [15], KCF [5], CSK [4], DSST [9], (2) CNNs trackers: HCFT [7], DLT [12], FCNT [11] and (3) classifier based trackers: Struck [10], TLD [14], TGPR [2]. Figure 1 shows that our methods performs well against the other methods in overlap success rate of OPE (one pass evaluation) achieving an AUC score of 0.745 on OTB2013. Compared to HCFT and FCNT which designed for short term tracking, our algorithm achieves higher success rate. This result indicates that the re-detection step is of great importance to robustness. In terms of distance precision, LHCF (0.888) outperforms all methods based on hand-crafted features but performs a little inferior to HCFT (0.891). The precision achieved by our algorithm proves the fact that features extracted from convolution networks are more discriminative and robust than hand-crafted features in tracking as algorithms have done in other areas like image classification and detection.

Attribute-Based Evaluation. OTB2013 is annotated with 11 sequence attributes to describe different challenges in object tracking, including illumination variation (IV), out-of-plane rotation (OPR), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-view (OV), background cluttered(BC) and low resolution (LR). Figure 2 shows distance precious plots on OPE for eight main attributes (fast motion, motion blur, occlusion, out-of-view, background clutter, deformation, scale variation). As it is shown in Fig. 2, our algorithm performs well in most attributes. It is known that sequences annotated with occlusion and out-of-view always contain targets which suffer temporal disappearance caused by long time occlusion or temporal out of plane. The proposed algorithm gets over this challenge depending on the SVM based re-detection step. Moreover, we can find out LHCF performs better than DSST and LCT both of which contain scale estimation which can be explained by the fact that deep layer features contain more semantic information than HOG. However, in terms of deformation, LHCF is unexpectedly defeated by HCFT which even has no re-detection module. Both algorithms apply similarly update model using a linear interpolation operation. When deformation happens and the response becomes smaller than the threshold, the re-detection step is activated but makes no contributes in this condition. This indicates that we have to optimize our model update solution in future works.

Qualitative Evaluation. We display some tracking results of LHCF, compared with KCF [5], LCT [8], HCFT [7], DLT [12] and Struck [10] in Fig. 2. Struck and DLT are seriously affected by distractors (basketball). LHCF and FCNT track the rolling and fast moving motor perfectly due to robust appearance model based on CNNs features. However, when it comes to sequence lemming in which the target undergoes long time occlusion, only our algorithm performs well without drifting to the background. This result shows that the re-detection step is of great importance. Specially, LCT which also has a re-detection module loses the target as other short-term algorithms in this sequence. This again reveals the advantage of deep convolutional features (Fig. 3).

6 Conclusions

In this paper, we apply correlation filters and CNNs features into long-term visual tracking methods. We represent target with CNNs features and estimate position and scale with correlation filters. SVM is used as a re-detector to handle tracking failure. Experimental results show that our algorithm is of effectiveness and robustness.

References

Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011)
Article Google Scholar
Gao, J., Ling, H., Hu, W., Xing, J.: Transfer learning based visual tracking with Gaussian processes regression. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 188–203. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_13
Google Scholar
Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_19
Chapter Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with Kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50
Chapter Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, San Diego (2015)
Google Scholar
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: 2015 IEEE International Conference on Computer Vision, pp. 3074–3082. IEEE, Santiago (2015)
Google Scholar
Ma, C., Yang, X., Zhang, C., Yang, M.H.: Long-term correlation tracking. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5388–5396. IEEE, Boston (2015)
Google Scholar
Danelljan, M., Shahbaz Khan, F., Felsberg, M., Van de Weijer, J.: Adaptive color attributes for real-time visual tracking. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1090–1097. IEEE, Columbus (2014)
Google Scholar
Hare, S., Saffari, A., Philip, H.S.: Struck: structured output tracking with kernels. In: 2011 IEEE International Conference on Computer Vision, pp. 6–13. IEEE, Bacelona (2011)
Google Scholar
Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 3119–3127. IEEE, Boston (2015)
Google Scholar
Wang, N., Yeung, D.Y.: Learning a deep compact image representation for visual tracking. In: Proceedings of 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe (2013)
Google Scholar
Wu, Y., Lim, J., Yang M.H.: Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418. IEEE, Portland (2013)
Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012)
Article Google Scholar
Zhang, J., Ma, S., Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy minimization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 188–203. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_13
Google Scholar

Download references

Author information

Authors and Affiliations

College of Automation, Nanjing University of Posts and Telecommunications, Nanjing, 210046, China
Huizhi Chen & Baojie Fan

Authors

Huizhi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Baojie Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huizhi Chen .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, H., Fan, B. (2017). Hierarchical Convolutional Features for Long-Term Correlation Tracking. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 773. Springer, Singapore. https://doi.org/10.1007/978-981-10-7305-2_57

Download citation

DOI: https://doi.org/10.1007/978-981-10-7305-2_57
Published: 08 December 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7304-5
Online ISBN: 978-981-10-7305-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Convolutional Features for Long-Term Correlation Tracking

Abstract