Keywords

1 Introduction

Visual Tracking is one of the most important area in computer vision with applications in surveillance, navigation, human computer interaction, robotics, etc. Visual object tracking can be described as: given an unknown target by a bounding-box on the first frame of a sequence and estimating states of the target on the following frames. After decades of development, tracking methods have been progressed a lot but there are still factors limiting tracking performance.

This paper is based on three observations upon prior works. Firstly, tracking methods suffer from large appearance changes caused by illumination changes, occlusion, abrupt motion, background clutter and deformation. Therefore, we apply an efficient strategy to update appearance templates, which is capable of preventing target models from being polluted by background information.

Secondly, convolutional networks have achieved significant success in object detection and image recognition regions. CNNs features, coming from large-scale vision datasets and efficient learning, prove to be more robust and discriminative than hand-crafted features. Thus we use deep CNNs features to cope with distractors.

In addition, tracking methods based on hand-crafted features take advantage of efficient computation owing to low dimension. When it comes to deep features with hundreds of channels, tracking speed would be slowed down. To achieve computation efficiency, we make use of correlation filters and alleviate the sampling ambiguity at the same time.

2 Related Work

Lots of visual tracking methods have been proposed during decades of study. Tracking-by-detection is known as a popular discriminative framework for object tracking, which takes context information into consideration and classifies target and background by learning a classifier. Recently, different kinds of machine learning algorithms are applied into discriminative methods, for instance, multiple instance learning [1], boosting [3], structured support vector machine (SVM) [10], etc.

Apart from classification, target representations play an important role in visual tracking. CNNs features have promoted image recognition precious to a higher level than human. DLT [12] pre-trains a small network on tiny images dataset and then uses particle filters to localize targets. The idea of pre-trained model with offline tuning is inherited by many later methods. Wang et al. propose FCNT [11] based on fully convolutional networks, where features selected from conv4-3 and conv5-3 layer of VGGNet [6] are separately used to construct two special nets designed to catch position information and discriminate distractors. Ma et al. [7] apply features coarse-to-fine to learn three filters and the final response map is weighted sum of the three sub-maps, maximum value of which determines the location of targets. With the use of hierarchical features, FCNT and HCFT perform well with background clutter.

Correlation tracking has recently arisen much attention due to computational efficiency. Correlation tracking takes circular-shifted visions of features with Gaussian-weighted labels around target position as training samples, which alleviating the problem of sampling ambiguity. Nevertheless, circulant matrices enable correlation operations to be computed with high speed in frequency domain. Numerous extensions of correlation filters have been proposed to elevate tracking accuracy, including KCF [5] with HOG and Gaussian kernel and CSK [4] with linear kernel. Features like color-name or HOG adopted by these correlation filter based methods limit the robustness of tracker, leading to drifting under situations of severe deformation and occlusion.

3 Proposed Algorithm

This section consists of (1) correlation tracking, (2) online detection, and (3) model update. The following content will introduce these parts.

3.1 Correlation Tracking

As typical correlation trackers do, we learn a discriminative classifier and tracking targets by searching maximum value of correlation response map. The appearance of the target is modeled using correlation filter \(w^{(l)}\). Feature vector \(\mathbf {x}\) with size of \(M \times N \times D\) are extract from searching window centered around target position, where M, N and D respectively indicate the width, height and depth of features. Training samples are all the circular shifts of \(\mathbf {x}\) along the M and N dimensions, where each sample \(x_{(m,n)}^{(l)}\), \(m\!\in \! \left\{ 0,1,\dots ,M-1 \right\} \), \(n\!\in \!\left\{ 0,1,\dots ,N-1 \right\} \) has a Gaussian function label \(y^l (m,n)\) computed by: \(y^{(1)} (m,n)=exp(-((m-M/2)^2+(n-M/2)^2)/2\sigma ^2)\), where \(\sigma \) is the kernel width. The correlation filter \(w^{(l)}\) with the same size of feature \(\mathbf {x}\) is trained by solving the ridge regression:

$$\begin{aligned} \min _{w^{(l)}} \sum _{m,n}\left| \varPhi (x^{(l)}_{m,n}w^{(l)}-y^{(l)}(m,n)) \right| ^2+\left| w^{(l)} \right| ^2 \end{aligned}$$
(1)

where \(\varPhi \) denotes the mapping to a kernel space and \(\lambda \) is a non-negative regularization parameter. The learned filter \(w^{(l)}\) can be expressed as

$$\begin{aligned} w^{(l)}=\sum _{m,n}a^{(l)}(m,n)\varPhi (x^{(l)}_{m,n}) \end{aligned}$$
(2)

where the coefficient a is defined as

$$\begin{aligned} A^{(l)}=\mathcal {F}(a^{(l)})=\frac{\mathcal {F}(y^{(l)})}{\mathcal {F}(\varPhi (x^{(l)}\cdot \varPhi (x^{(l)})))+\lambda } \end{aligned}$$
(3)

In (3), \(\mathcal {F}\) denotes the fast Fourier transformation operator and \(\mathcal {F}(y^{(l)})\) is the Gaussian label. The response map in the new frame is computed on an image patch z within a \(M \times N\) search window

$$\begin{aligned} \hat{y}=F^{-1}(A^{(l)}\odot F(\varPhi (z) \cdot \varPhi (\hat{x}))) \end{aligned}$$
(4)

where \(\hat{x}\) denotes the learned target appearance model and \(\odot \) is the Hadamard product. Now, we get response maps for all layers denoted by \(\hat{y}^{(l)}\). Then we sum the three maps with corresponding weights \(\gamma ^{(l)}\) and get summation \(y_{sum}=\varSigma _l(\gamma ^{(l)}\hat{y}^{(l)})\). Thus, the new target position is estimated by searching the maximum value of \(y_{sum}\).

3.2 Online Detection

Obviously, re-detection step is significant to a long-term tracking algorithm in case of tracking failure. Our re-detection is performed on every frame. For computational efficiency and model robustness, we activate the re-detection module only when \(\max (\hat{y}_scale)<\mathcal {T}_{scale}\), where \(\hat{y}_{scale}\) refers to scale response map and \(\mathcal {T}_{scale}\) is a re-detection threshold.

To get scale response map, we construct a target pyramid around the estimated position \(\arg _{m,n}y_{sum}\). The target size is assumed to be \(W\times H\) in a test frame and K indicates the number of scales \(s\in S\). For each scale in \(S=\left\{ a^k|k=-(K-1)/2,-(K-3)/2,\dots ,(K-1)/2 \right\} \), we crop out an image patch with the size of \(sW \times sH\) centered on the predicted position. Different from motion model, scale model is built on HOG features by solving the same ridge regression as (1). The scale most suitable for current target is

$$\begin{aligned} scale= \arg _j\max (\max (\hat{y}_{1}^{s}),\max (\hat{y}_{2}^{s}),\dots ,\max (\hat{y}_{j}^{s})) \end{aligned}$$
(5)

where \(\hat{y}_{j}^{s}\) indicates scale correlation response and \(j\in {1,2,\dots ,K}\) represent different scale levels. In addition, all patches are resized to \(W \times H\) before doing correlation with scale filter.

For re-detection module, we use HOG features to train an online SVM classifier to re-detect targets. Training samples are patches with motion correlation response higher than confidence threshold \(\mathcal {T}_{rd}\). The label y of a SVM train sample is determined by IOU value, which indicates the overlap ratios of the sample and target bounding box,

$$\begin{aligned} y={\left\{ \begin{array}{ll} +1,&{}\text { if }IOU>0.9 \\ -1,&{}\text { if }IOU<0.5 \end{array}\right. } \end{aligned}$$
(6)

SVM classifier is described in the form of \(f(x)=w\varPhi (x)+b\) learned from samples with labels \(y\in {-1,+1}\). The object function is

$$\begin{aligned} \min _{w,b,\xi } \frac{1}{2}\left| w \right| ^2+C\sum _{i=1}^N\xi _i \end{aligned}$$
(7)

subject to constrains

$$\begin{aligned} y_i(w\varPhi (x)+b)-1+\xi _i\ge 0,\xi _i\ge 0 \end{aligned}$$
(8)

Therefore, we learn a weight vector w by solving this quadratic convex optimization problem.

3.3 Model Update

In order to ensure the robustness of our tracker, it is necessary to update appearance models when targets undergo occlusion, deformation and abrupt motion. The motion model \(R_m\) is updated with a learning rate on every frame:

$$\begin{aligned} \hat{x}_t^{(l)}=(1-\alpha )\hat{x}_{t-1}^{(l)}+\hat{x}_t^{(l)} \end{aligned}$$
(9)
$$\begin{aligned} \hat{A}_t^{(l)}=(1-\alpha )\hat{A}_{t-1}^{(l)}+\hat{A}_t^{(l)} \end{aligned}$$
(10)

where t is the index of current frame, l indicates features layers.

However, the scale model \(R_s\) is only updated when scale correlation response \(\max {\hat{y}_{scale} \mathcal {T}_{update}}\) and the learning rate is same as \(R_m\).

figure a

4 Implementation

The main steps of our algorithm is presented in Algorithm 1 and more implementation details are described as follows.

Features. VGG-16 trained on ImageNet is adopted to extract hierarchical convolutional features in this work. Features used to train the motion model \(R_m\) are the output of conv5-4, conv4-4 and conv3-4 convolutional layer. However, the scale model \(R_s\) is constructed with HOG features. For the SVM detector, all different scale patches are resized to \(W \times H\).

Kernel Selection. Both motion model and scale model adopt a Gaussian kernel \(k(x,x')=e^{-|x-x'|^2/\sigma ^2 }\), which defined a mapping \(\varPhi \) as \(k(x,x')=\varPhi (x)\cdot \varPhi (x')\).

SVM. The SVM classifier also applies a Gaussian kernel in order to discriminate non-linear samples. The re-detection step is implemented by scanning the whole frame with scanning windows with the size corresponding to the optimal scale obtain by correlation filter.

5 Experimental Results

Setups. We evaluate our algorithm LHCF on OTB2013 [13] that contains 50 different sequences and results of 29 trackers with comparisons to state-of-the-art methods by two metrics: distance precision and overlap success rate. Parameters for our test is listed in Table 1. Our algorithm is implemented in MATLAB on a 2.6 GHz Intel Core i5 CPU with 8 GB RAM.

Table 1. Parameters for implementation

Quantitative Evaluation. Our algorithm is evaluated on the benchmark with comparisons with (1) correlation based trackers: MEEM [15], KCF [5], CSK [4], DSST [9], (2) CNNs trackers: HCFT [7], DLT [12], FCNT [11] and (3) classifier based trackers: Struck [10], TLD [14], TGPR [2]. Figure 1 shows that our methods performs well against the other methods in overlap success rate of OPE (one pass evaluation) achieving an AUC score of 0.745 on OTB2013. Compared to HCFT and FCNT which designed for short term tracking, our algorithm achieves higher success rate. This result indicates that the re-detection step is of great importance to robustness. In terms of distance precision, LHCF (0.888) outperforms all methods based on hand-crafted features but performs a little inferior to HCFT (0.891). The precision achieved by our algorithm proves the fact that features extracted from convolution networks are more discriminative and robust than hand-crafted features in tracking as algorithms have done in other areas like image classification and detection.

Attribute-Based Evaluation. OTB2013 is annotated with 11 sequence attributes to describe different challenges in object tracking, including illumination variation (IV), out-of-plane rotation (OPR), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-view (OV), background cluttered(BC) and low resolution (LR). Figure 2 shows distance precious plots on OPE for eight main attributes (fast motion, motion blur, occlusion, out-of-view, background clutter, deformation, scale variation). As it is shown in Fig. 2, our algorithm performs well in most attributes. It is known that sequences annotated with occlusion and out-of-view always contain targets which suffer temporal disappearance caused by long time occlusion or temporal out of plane. The proposed algorithm gets over this challenge depending on the SVM based re-detection step. Moreover, we can find out LHCF performs better than DSST and LCT both of which contain scale estimation which can be explained by the fact that deep layer features contain more semantic information than HOG. However, in terms of deformation, LHCF is unexpectedly defeated by HCFT which even has no re-detection module. Both algorithms apply similarly update model using a linear interpolation operation. When deformation happens and the response becomes smaller than the threshold, the re-detection step is activated but makes no contributes in this condition. This indicates that we have to optimize our model update solution in future works.

Fig. 1.
figure 1

The precision plots and success plots of OPE on OTB2013. The performance score for each tracker is shown in the legend. The performance score of precision plot is at error threshold of 20 pixels with the performance score of success plot is the AUC value.

Fig. 2.
figure 2

The precision plots of OPE for 8 challenging attributes on OTB2013 dataset. The number of videos for each attribute is shown in parenthesis. Our algorithm LHCF almost perform best in all the attributes.

Qualitative Evaluation. We display some tracking results of LHCF, compared with KCF [5], LCT [8], HCFT [7], DLT [12] and Struck [10] in Fig. 2. Struck and DLT are seriously affected by distractors (basketball). LHCF and FCNT track the rolling and fast moving motor perfectly due to robust appearance model based on CNNs features. However, when it comes to sequence lemming in which the target undergoes long time occlusion, only our algorithm performs well without drifting to the background. This result shows that the re-detection step is of great importance. Specially, LCT which also has a re-detection module loses the target as other short-term algorithms in this sequence. This again reveals the advantage of deep convolutional features (Fig. 3).

6 Conclusions

In this paper, we apply correlation filters and CNNs features into long-term visual tracking methods. We represent target with CNNs features and estimate position and scale with correlation filters. SVM is used as a re-detector to handle tracking failure. Experimental results show that our algorithm is of effectiveness and robustness.

Fig. 3.
figure 3

Quality evaluation of proposed algorithm, LCT, HCFT, KCF, Struck and DLT on 4 challenging sequences (from top to bottom are Basketball, Liquor, MotorRolling and Lemming).