Keywords

1 Introduction

Visual Tracking is a fundamental problem in computer vision, which has been widely applied into video surveillance, robotic, medical imaging, and so on. Given the initial bounding box of the target, the process of the visual tracking is to estimate the location and scale of the target in the subsequent frames. Although visual tracking has been researched for several years, it remains an extremely challenging problem due to appearance changes, partial occlusion, motion blur, and background clutters. CNN-based trackers have been drawing increasing attention and have achieved excellent results in visual tracking.

Most existing trackers adopt an online update strategy to capture the appearance changes. FCNT [19] only updates the specific network using the most confident tracking result to avoid introducing background noise. CREST [15] collects all estimated locations to update the model every fixed frame. The process of updating brings the model drift problem due to factors such as tracking failure, occlusions, and inaccurate scale estimation.

The way to improve the robustness of online update is to reduce the introduction of the noisy and confusing data. Self-paced learning (SPL), which is recently proposed, is such a representative approach for robust learning. The origin of SPL is curriculum learning (CL) [1] proposed by Bengio et al. Furthermore, a set of training samples organized in ascending order of learning difficulty are defined in a curriculum from the CL. However, the curriculum is always fixed during the iterations and not affected by the subsequent learning. Then, inspired by the learning process of humans/animals, Kumar et al. propose the SPL to generate the dynamic curriculum according to what the model has already learned. SPL has the benefit of avoiding the bad local minima and achieving a more reasonable solution. Based on the above analysis, we propose a novel self-paced sample space model by distinguishing the reliable data from the noisy and confusing data to avert the model drifts.

The current trackers employ existing deep learning networks which have been offline pre-trained for a large amount of data to extract features. In the traditional CNNs structure, only the nearest previous layers output is used as the input of the current layer, resulting in the discarding of other existing features. [5] proposed a densely connected network by adding shortcut connections to enhance the information flow between layers and the feature reusing of the network. Inspired by the DenseNet, we apply the densely connected learning to reduce the dependency of the adjacent layers in the CNNs and improve the ability of feature representations.

The contributions of this paper are mainly summarized as three folds: (i) We propose a novel self-paced sample space model that integrate the SPL framework into the visual tracking. It avoids the drifts of online update by choosing the reliable data from noisy and confusing data. (ii) We apply the densely connected learning to enhance the information flow and feature reuse of the network. It effectively facilitates the representation power of the features. (iii) We conduct extensive experiments on the benchmark datasets. The results show that our tracker achieves the state-of-the-art performance.

The rest of this paper is organized as follows. We first introduce the existing visual tracking algorithm and self-paced learning framework in Sect. 2. Then, our SPDCT model and visual tracking algorithm are discussed in Sect. 3 and Sect. 4. Experiments are detailed in Sect. 5 and concluding remarks are given in Sect. 6.

2 Related Work

CNN-Based Tracking. The capability of feature representations is very important for visual tracking. Deep neural network, especially CNNs, is developing rapidly and has been successfully applied into visual tracking [11, 19, 20]. FCNT [19] employs a fully convolutional neural and proposes a feature map selection method to improve tracking accuracy. HCFT [11] adopts the hierarchical features to train correlation filters. STCT [20] casts online training CNN as learning ensembles to reduce over-fitting. Other CNN-based trackers consider that transfer pre-trained deep features may not be appropriate for online tracking. These methods mentioned above directly employ the traditional CNNs structure to capture the appearance change of target. Different from existing tracking methods that based on convolutional neural network, we propose a densely connected learning method to improve the robustness of visual representation through feature reuse.

Self-Paced Learning. Self-paced learning [10, 12] is to learn the model iteratively from easy to complex samples inspired by the learning process of humans/animals. Compared with other machine learning methods, SPL jointly learns the curriculum and model parameters by incorporating a self-paced function and a pace parameter into the objective function. When pace is small, only ’easy’ samples with small costs will be chosen into training data. As the value of pace grows, more samples with larger losses will be gradually appended to train a more ’mature’ model. [12] has proven that the learning process of traditional SPL regime can be guaranteed to converge to rational critical points of the corresponding implicit NCRP objective. SPL has been successfully applied to various applications, such as action and event detection [7], reranking [6], segmentation [8], and co-saliency detection [22].  [16] employs self-paced learning to solve long-term tracking problems. Compared with  [16], we use self-paced learning method to select reliable frames. However, this paper adopts feature pyramid method to fuse multi-layer CNN features and densely connected learning to improve the robustness of feature representation.

3 Proposed Visual Tracking Method

The main idea of the self-paced densely connected convolutional neural network is integrating the SPL framework into visual tracking algorithm. Specifically, SPDCT tends to distinguish the reliable data from the noisy data, and then uses them to update tracker to ensure the robust of the model. Figure 1 shows our SPDCT model pipeline. The details are discussed as follows.

Fig. 1.
figure 1

The pipeline of SPDCT tracking algorithm. We first extract the hierarchy convolutional features of the target and fuse multi-layer features. Then, feed the fused features to densely connected learning to get the response map. When updating the model, we adopt the self-paced sample space model to choose reliable samples.

3.1 Self-paced Sample Space Model

We propose a self-paced sample space model (SPSS) to avoid introducing background noise through online update. Formally, we denote the training dataset as \(D = \left\{ \left( x_{1}, y_{1} \right) , \left( x_{2}, y_{2} \right) ,...,\left( x_{n},y_{n} \right) \right\} \), where \(x_{i}\) and \(y_{i}\) denote the observed samples and correspond labels, respectively. Such an idea can be formulated as an optimization problem as follows,

$$\begin{aligned} \min \limits _{w,v}E\left( w,v \right) = \sum _{i=1}^{n}v_{i}L\left( y_{i} ,g\left( x_{i};w,b\right) \right) +f\left( \lambda ,v \right) \end{aligned}$$
(1)

where L(x) denotes the quadratic loss function under the estimated response value \(g\left( x_{i},w \right) \) with the weight vector w and bias parameter b. \(v = \left[ v_{1},v_{2},...,v_{n} \right] , v\in \left[ 0,1 \right] ^{n}\) denotes the important weights for all training samples, \(v = 1\) indicates a reliable sample and \(\lambda \) is the pace parameter for controlling the selecting pace. The capability of the self-paced sample space model is determined by the self-paced function that avoids the negative influence brought by large-noise-outliers. The formula of the self-paced function as the following form:

$$\begin{aligned} f\left( \lambda ,v \right) = -\left\| v_{1} \right\| = -\lambda \sum _{i=1}^{n}v_{i} \end{aligned}$$
(2)

Similar to SPL, the optimization problem of Eq. 1 can be solved by alternately optimizing the important weight v and the weight vector w of a sample of variables. Under fixed v, weight vector w can be optimized by existing off-the-shelf supervised learning methods, such as back propagation algorithm. Under fixed \(\{w,b\}, v = \left[ v_{1},v_{2},...,v_{n}\right] \) can be easily calculated by

$$\begin{aligned} v_{i}^{*} = {\left\{ \begin{array}{ll} 1, L\left( y_{i} ,g\left( x_{i},w \right) \right) < \lambda \\ 0, otherwise \end{array}\right. } \end{aligned}$$
(3)
Fig. 2.
figure 2

Visualization of the training set during the online update phase. Our approach selects some training samples with target in the center of the image to better suppress the background noise in each iteration update.

In traditional SPL methods, the parameter of pace adds a fixed value for each iteration to choose more hard samples, and it is difficult to effectively determine the fixed value. In this paper, we propose an adaptive strategy based on the number of samples. In the \(t^{th}\) iteration, \(N_{t}\) denotes the total number of training samples and \(N_{p}\) means the proportion of samples selected. We first get the \(L_{sort}\) by sorting the samples in ascending order according to their weights, and the \(\left( N_{t}*N_{p} \right) ^{th}\) loss value of \(L_{sort}\) is used as the parameter value of pace in the SPL. As shown in Eq. 4.

$$\begin{aligned} \lambda = L_{sort}\left[ \left( N_{t}*N_{p} \right) \right] \end{aligned}$$
(4)

Fig. 2 shows the comparison between the SPDCT algorithm(bottom row) and the CREST(top row) method. In CREST, the training data is composed of continuous video frames, which is easy to overfit the current video frames. For example, when occlusion occurs, CREST learns more background information, which causes tracking drift. In contrast, our model chooses reliable training samples through SPSS model to avoid introducting background noise.

3.2 Densely Connected Learning

The quality of the features determines the performance of the tracker based on convolutional neural networks. Most of CNN-based trackers employ the traditional CNNs structure directly to capture the appearance change of target. In the traditional CNN network structure, only the output of the previous layer is used as the input of the current layer, which leads to discarding existing features and hindering in convolutional neural networks. To enhance the reuse of the features and reduce the dependence of adjacent layers, we apply the densely connected convolutional network instead of the traditional CNNs, which connects each layer to every other layer in a feed-forward fashion. Figure 3 shows structure of the densely connected learning. The \(l^{th}\) layer receives the feature maps generated by all of previous layers as input. This form of densely connected learning can be formulated as follows:

$$\begin{aligned} x_{l} = H_{l}\left( \left[ x_{0},x_{1},...,x_{l-1} \right] \right) \end{aligned}$$
(5)

where \(H_{l}\left( x \right) \) denotes the non-linear transformation function composed of convolution (Conv) and rectified linear units (ReLU). Similar to  [5], we concatenate the multiple inputs of \(H_{l}\left( x \right) \). \(x_{l}\) is the output of the \(l^{th}\) layer. We adopt four layers in the densely connected learning with a small growth rate.

Densely connected layer enhances feature reuse and maximize the information flow through the neural network. According to the results in Sect. 5, the learned features are more robust for appearance change.

Fig. 3.
figure 3

The structure of the densely connected learning.

3.3 Multi-layer Features Fusion

According to FCNT [19], convolutional layers at different levels focus on different perspectives of target. A top layer encodes more semantic features with low-resolution map, while a lower layer carries more spatial information with high-resolution map. In order to maintain the spatial and semantic information of features, we adopt feature pyramid method as described in Feature pyramid networks (FPN) [9] to achieve multi-layer fusion, as shown in Fig. 4.

Fig. 4.
figure 4

Multi-layer features fusion.

4 Tracking with SPDCT

We illustrate the detailed procedure of SPDCT from model initialization, detection, scale estimation, and online update, as listed in algorithm 1.

figure a

Model Initialization. Similar to CREST [15], given the first frame with the target location, we extract a training patch centered on the target location and send the patch to an existing deep neural network to extract the features. Soft labels are used as the input to the densely connected learning to train weight and bias parameters of the network. All the parameters in the densely connected layers are randomly initialized following zero mean Gaussian distribution.

Detection. After a new frame’s arrival, we crop a search patch centered on the tracking results of the previous frame. The patch and the training data have the same size. We obtain the response map through the densely connected layers, which locates the target position based on the maximum response value. The online tracking strategy is extremely simple and straightforward.

Scale Estimation. When we obtain the center location of the target, we crop the frame at different scales to get some patches. We send these patches to SPDCT to get the response values of target. We evaluate the scale of target by searching for the maximum response value.

Online Update. We adopt self-paced sample space model to obtain the reliable training data for model update. We first collect tracking results as training samples. For each frame, the corresponding soft label can be generated according to the predicted location. When obtaining the response map of target, we calculate the sample weight by Eq. 3 and choose samples with \(v = 1\) for online update. In order to reduce the over-fitting of recent samples and to satisfy the memory constraint, we select a maximum of N samples at a time and online update the model every fixed frames.

5 Experiments

In this section, we first explain the implementation details and then analyze the effects of self-paced sample space model and densely connected learning. We validate the performance of our SPDCT tracker against state-of-the-art trackers on three benchmark dataset: OTB-50,OTB-100 [21] and UAV123 [13].

5.1 Experiments Setups

Implementation Details. Consistenting with the existing trackers, we set up the VGG model as feature extractor. We obtain the features from the output of conv3-3 and con4-3 layers of the VGG model. In the first frame, we obtain the training sample with five times the size of the target bounding box. The soft label and the learning rate are set to a two-dimensional Gaussian function with peak value of 1 and 5e−7, respectively. In the online update, we calculate the \(\lambda \) by Eq. 4 and choose the reliable data with the adaptive percentage \(N_{p}\) of 0.5. N is set to 11. The SPDCT model is fine-tuned for 2 iterations with the learning rate of 1e−8 for every 5 frames. The SPDCT is implemented in MATLAB based on the wrapper of MatConvNet [18].

Benchmark Datasets. We conduct our experiments on three benchmark datasets: OTB-50,OTB-100 [21] and UAV123 [13]. The OTB-50 and OTB-100 datasets have 50, 100 real-world targets for tracking, respectively. There are 11 attributes, such as occlusion, scale variation, motion blur, and background clutters. The UAV123 dataset consists of 123 aerial videos with more than 110 K frames.

Evaluation Methodology. We use the one-pass evaluation (OPE) with precision and success plots to evaluate the current state-of-the-art trackers. Precision plot shows the percentage of frames where the distance between the estimated location and the ground truth within 20 pixels. Success plot demonstrates the percentage of frames where the estimated box and the ground truth box overlap. All the trackers are ranked according to the area under curve (AUC) of each success plot.

Fig. 5.
figure 5

Ablative experiments on the OTB-50 benchmark.

5.2 Ablation Studies

The SPDCT algorithm consists of self-paced sample space model and densely connected learning. Based on the experimental results on the OTB-50 dataset, we apply the ablation studies method to analyze the effect of each part. We set up four contrast experiments including a standard SPDCT tracker, SPDCT tracker without the self-paced sample space model (SPDCT-spss free), SPDCT tracker without the densely connected model (SPDCT-densely learning free), and SPDCT with neither self-paced sample space model nor densely connected learning (SPDCT-neither).

Figure 5 shows the precision and success plots of the above ablative experiments. The experimental results show that both models of self-paced sample space model and densely connected learning are helpful to improve the performance. Self-paced sample space model enhances the ability of the tracker to discern the target because of the selection of a reliable sample for updating the model to avoid introducing noise. Densely connected learning model enriches the input of convolutional layers by reusing the convolutional features, alleviates over-fitting and enhances the representation power of features. In precision plots, SPDCT-spss free performs worse than SPDCT-neither. Because densely connected learning is more capable of learning. When there is noise in a training sample, the model learns features unrelated to the target and loses its representation power. The standard SPDCT has the best results.

Fig. 6.
figure 6

Precision and success plots on the OTB-50 dateset.

Fig. 7.
figure 7

Precision and success plots on the OTB-100 dateset.

Fig. 8.
figure 8

Precision and success plots on the UAV123 dateset.

Fig. 9.
figure 9

Qualitative evaluation of our SPDCT tracker, DeepSRDCF, HCFT, CREST on three challenging sequences, from top to down, ClifBar, Human3, Ironman.

5.3 Comparisons to State-of-the-art Trackers

In this section, we compare the SPDCT model with the recent state-of-the-art trackers, including HDT [14], CREST [15], SRDCFdecon [4], DeepSRDCF [2], HCFT [11], SINT [17], FCNT [19], MEEM [23], SRDCF [3], and other 29 trackers from OTB-2015 benchmark [21] and UAV123 [13]. We initialize the model randomly, and then use the first frame of video as the training sample. The ten best results are shown in Figs. 6, 7 and 8. On the OTB-50, OTB-100 and UVA123, the experimental results show that our SPDCT tracking algorithm performs a best among these trackers. On the OTB-50, the performance of precision plot is 3.5% and 3.6% higher than the performance of HDT and HCFT separately. The performance of success plot is 0.7% and 0.9% higher than the performance of CREST and BACF separately. On the OTB-100, the performance of precision plot is 1.7% and 2.0% higher than DeepSRDCF and HDT separately, and reaches the fourth best on success plots. On the UAV123, the performance of precision plot is 2.1% higher than the performance of SRDCF, and reaches the second best on success plots. Our SPDCT model does not use any auxiliary training data. We consider the reliability of the samples by self-paced sample space model and feature reuse by densely connected learning, improving the robustness of the model. The results reached the state-of-the-art performance and shows that our SPDCT model has good generalization ability. Figure 9 visualizes quantitative evaluation results. We compare three top performing trackers: DeepSRDCF, HCFT, CREST with our SPDCT tracker on three challenging sequences. The results show that our SPDCT model achieves the state-of-the-art trackers.

6 Conclusion

In this paper, we have proposed a novel self-paced sample space model that integrate the SPL framework into the visual tracking for distinguishing the reliable date from noisy and confusing data to avoid the model drifts problem. We also apply the densely connected learning to improve the information flow and feature reuse of the network, while enhancing the representation power of the features effectively. Experiments on three benchmark datasets demonstrate that our SPDCT model achieves state-off-the-art performance. In the future, we will consider how to effectively construct the diversity samples of visual tracking in self-paced learning framework.