Keywords

1 Introduction

Visual object tracking (VOT) aims at locating a target efficiently in a video sequence, which remains a challenging problem in unconstrained applications due to deformation, abrupt motion, occlusions and illumination, after several decades of intensive research [5, 10, 20, 36, 41, 42, 51]. Essentially VOT needs to address 3 key issues: (1) How to represent a target, i.e., the observation model; (2) How to efficiently leverage the motion smoothness assumption to locate a target in the next frame; (3) How to update tracking models online, if necessary, to handle dynamic scenarios.

Fig. 1.
figure 1

Illustration of tracking using classification (left column) vs. iterative shift (right column): tracking a fast moving vehicle (first row) and tracking a diving athlete with large deformation (second row). Given the initial box (green), classification based methods sample many proposals, select the box (red) with the highest classification score, and collect positive (yellow and red) and negative samples (blue) to fine-tune the classifiers online. There may not be enough good samples for online learning in these hard scenarios. In contrast, the proposed iterative shift tracking adjusts the bounding box step by step to locate the target (e.g., 3 steps for the vehicle and 2 steps for the athlete), and makes decisions formally when and how to update object models by reinforcement learning. The shift process generally tends to be more efficient since less candidate regions are evaluated than in classification based methods (Color figure online)

The appearance models have evolved from intensity templates [19], color histograms [14], and sparse features [4], to the dominating deep features [47] extracted by CNN models. Thus, naturally tracking may be formulated as a classification or detection-and-association problem [35] using CNN classifiers. Even a strong observation model may not capture all possible variations of targets and need to be updated on-the-fly during tracking. Nevertheless, online classifier learning may be vulnerable to samples with ambiguous labels in hard scenarios, such as deformation, quick motion and occlusions,etc., leading to model drift. The tracker needs to make decisions simultaneously on target’s motion status and on tracking status, i.e., whether and how to update observation models or even restart tracking if necessary. These are indeed tough decisions to make during online tracing.

To tackle the aforementioned issue 2 and 3, we introduce a deep reinforcement learning process to make decisions jointly on a target’s motion status and a tracker’s status in VOT. The motion status, i.e., the displacement and scaling of an object’s bounding box, is estimated in an efficient iterative shift process by a prediction network. The tracker’s status, referring to whether or how to update the observation model and whether stop and restart tracking, is determined by an actor network. The proposed method, coined as deep reinforcement learning with iterative shift (DRL-IS), exploits the correlation between object’s motion estimation and current tracking status. The prediction and actor networks are learned offline from a large number of training video sequences guided by a critic network, on how to take actions given the current frame and the previous target location and representations.

This method utilizes reinforcement learning as a principled way to learn how to make decisions during tracking, therefore, it is especially robust to deal with hard cases such as deformation or abrupt motion, where either updating the model or stop-and-restart may be a sensible action. In contrast, existing methods ADNet [52], EAST [21] and POMDP [44] which employed reinforcement learning to either estimate motion or make decisions on tracking status separately. Moreover, as shown in Fig. 1, the tracking result is estimated iteratively, instead of performing CNN classification on many candidate locations, thus leading to an efficient computation.

The main contributions of our paper are on two-fold: (1) We propose an Actor-Critic network to predict the object motion parameters and select actions on the tracking status, where the rewards for different actions are dedicatedly designed according to their impacts; (2) We formulate object tracking as an iterative shift problem, rather than CNN classification on possible bounding boxes, thus locates a target efficiently and precisely. The proposed DRL-IS is particularly capable of dealing with objects with large deformations and abrupt motion, since the motion parameters are iteratively estimated and accumulated by the prediction network, and in such hard cases the tracker is kind of self-aware to update the target feature and model or resort to detection to restart tracking. Our tracker achieves 0.909 distance precision, 0.671 overlap success on the OTB2015 benchmark and 0.812 distance precision and 0.590 overlap success on the Temple-Color128 benchmark, on a par with the best performance, and runs about 5 times faster than competing state-of-the-art methods.

2 Related Work

Visual tracking has undergone extensive study over several decades on how to represent and locate a target in video sequences, and adapt the observation model online if necessary. Deep neural networks, pre-trained for recognition tasks, tend to be also effective in delineating an object appearance in tracking, e.g., as in the MDNet [35], FCNT [46], and CREST [42] trackers, and [10, 18, 32, 47]. To find a target in current frame, a motion model is assumed to sample some candidate locations, as in the Kalman filter [1] or particle filter [22, 38]. Then, the observation model may be evaluated on hundreds of these locations, as a correlation filtering in MOSSE [5] and KCF [20], or as a discriminative classification [11] or regression problem [16], which is demanding in computation. Alternatively, an observation model may allow to calculate or search the candidate locations gradually and iteratively, as in the optical flow [14] or mean-shift tracking [9], which is generally efficient since only a few locations examined. This motivates us to propose the iterative shift process, where a prediction network adjusts target locations in an iterative manner and evaluates the neural net much less times.

The observation model may need to be updated during tracking to follow the changing appearance of an target, for instance, by collecting positive and negative samples [24] or bags [3] to conduct online learning [50]. A tracker has to make very tough decisions on when and how to update the observation model. For some difficult scenarios, such as deformation, occlusion and abrupt motion, on one hand, without any model update, the tracker may lost the target, on the other hand, due to some ambiguous or wrong labels, the tracker may drift to clutter background after the online update. In these hard but not rare cases, a sensible decision might be to stop tracking and resort to object detection or other means to reinitialize, rather than drifting blindly and silently. This fundamental issue demands for a formal decision making procedure in tracking.

Deep reinforcement learning [2, 6, 7, 23, 26, 29, 33, 34, 40] is a principled paradigm to learn how to make decisions and select actions online, which has achieved great successes in Atari games [34], search of attention patches [7], and finding objects [29] and visual relations [40]. Recently, reinforcement learning has been adopted for tracking [21, 25, 44, 52, 53], e.g., an action-decision network [52] to generate actions to seek the locations and the sizes of a target object, or a decision policy tracker [44] by using reinforcement learning to decide where to look in the upcoming frames, and when to re-initialize and update its appearance model for the tracked object. In this paper, we extend to learn how to jointly derive the target motion and make decisions on the tracker status, by a new and unified actor-critic network.

Fig. 2.
figure 2

The overview of the DRL-IS tracking method. Given the initial bounding box of a target, we first extract deep feature \(f\in \mathbb {R}^{1*512} \) from fc4 layer. Then we concatenate the feature of a candidate box f and the current target feature \(f^{*}\in \mathbb {R}^{1*512}\). We generate shift \(\delta \) using the prediction network \(\psi \) and employ the actor network \(\theta \). For action continue, we adjust the bounding box of the target according to the output \(\delta \) of \(\psi \). For action stop and update, we stop the iteration and update the appearance features of the target and the parameters of \(\psi \), while we skip the update for action stop and ignore. When taking action restart, the target may be lost, so we re-sample for the initial bounding box. In the training stage, we use a deep critic network to estimate the Q-value of current actions with \(\delta \), and fine-tune the prediction network \(\psi \) and actor network \(\theta \)

3 Approach

The proposed deep reinforcement learning with iterative shift (DRL-IS) approach involves three sub-networks: (1) the actor network, (2) the prediction network, and (3) the critic network, which share the convolutional layers and one fully connected layer (fc4), as shown in Fig. 2. We elaborate the formulation of DRL-IS for tracking and the learning procedure of these networks, in the following subsections.

3.1 Iterative Shift for Visual Tracking

We formulate visual object tracking as an iterative shift problem. Given current frame and previous tracking results, the prediction network \(\psi \) iteratively shifts the candidate bounding box to locate the target, meanwhile, the actor network \(\theta \) makes decisions on the tracking status, whether or not to update the target representation and the prediction network, or even restart tracking.

Formally, given a video \(V=\{I_1, I_2,\cdots , I_N\}\), where \(I_{t} \) is the tth frame. The tracker is initialized by cropping a target with \(l_{1}=\{x_{1},y_{1},w_{1},h_{1} \}\) in the first frame and its appearance is represented by the feature \(f_1\) , i.e., the fc4 layer’s outputs in the shared network. With the tracking results of \(l_{t-1}^*=\{x_{t-1},y_{t-1},w_{t-1},h_{t-1} \}\) and \(f_{t-1}^{*}\), we first extract \(f_t\) of \(I_{t}\) cropped by \(l_{t-1}^{*}\), and exploit the prediction network \(\psi \) to predict the movement \(\delta \) of the target between frames, which takes \(f_{t}\) and \(f_{t-1}^* \) as input:

$$\begin{aligned} \delta = \psi (f_{t},f_{t-1}^{*}). \end{aligned}$$
(1)

We denote the outputs of the prediction network as \(\delta = \{\varDelta _x,\varDelta _y,\varDelta _w,\varDelta _h\}\):

$$\begin{aligned}&\varDelta _x =(x_{t}-x_{t-1})/w_{t-1} , \nonumber \\&\varDelta _y =(y_{t}-y_{t-1})/h_{t-1} , \nonumber \\&\varDelta _w = log(w_{t}/w_{t-1}), \nonumber \\&\varDelta _h = log(h_{t}/h_{t-1}), \end{aligned}$$
(2)

where \(\varDelta _x\) and \(\varDelta _y\) specify a scale-invariant translation of the bounding box, \(\varDelta _w\) and \(\varDelta _h\) specify log-space translations of the width and height of bounding box against the previous frame [17]. It is hard to estimate the movement and shape change of the target accurately in one step when the object moves rapidly or deforms. Hence, the prediction network outputs the adjustments of the bounding box iteratively and accumulate them to obtain the tracking result. Thus, the neural network is evaluated in \(K_t\) iterations at \(I_t\) and \(\delta _k\) of each step in Eq. 2 are accumulated. This iterative shift process is considerably faster than running a classification network on hundreds of bounding boxes.

Meanwhile, the tracking status may affect the results as well, e.g., updating the prediction network on the fly if necessary. To make decisions jointly on a target’s motion status and a tracker’s status, we use the actor network \(\theta \) to generate the actions \(a_{1},a_{2},\cdots ,a_{k}, \cdots ,a_{K_t}\) according to a multinomial distribution:

$$\begin{aligned} p(a \vert s_{t,k})=\pi (s_{t,k}\vert \theta ), \sum _{i}p(a_{i} \vert s_{t,k}) =1, \end{aligned}$$
(3)

where \( a_{k} \in \mathcal {A}=\{continue, stop \ \& \ update, stop \ \& \ ignore, restart\}\), and the initial state \(s_{t,0}=\{I_{t},l_{t,0},f_{t-1}^{*}\}\) contains the image \(I_{t}\), initial location \(l_{t,0}=l_{t-1}^{*}\), and the appearance feature \(f_{t-1}^{*}\), and \(\pi (s_{t,k}\vert \theta )\) derives from the outputs of the actor network \(\theta \).

For the action continue (continue shifting without updating the model) in step k, the shift \(\delta _k =\psi (f_{t,k},f_{t-1}^{*})\) is generated by the prediction networks \(\psi \). \(f_{t,k}\) is extracted from the crop \(l_{t,k}\). The position, \(l_{t,k}\), of the target is updated iteratively according to \(\delta _k\) with \(l_{t,k-1}\).

For the action \( stop \ \& \ update\) (stop shifting and update the model), we stop the iterations and take \(l_{t}^*=l_{t,K_t} \) as the location for object and update the feature of the target and the parameters of the prediction network \(\psi \),

$$\begin{aligned} f_t^{*}= & {} \rho f_{t,K_t} + (1-\rho ) f_{t-1}^{*}, \end{aligned}$$
(4)
$$\begin{aligned} \psi _t= & {} \psi _{t-1} + \mu \mathbb {E}_{s,a}\frac{\partial Q(s,a,\delta \vert \phi )}{\partial \delta }\frac{\partial \delta }{\partial \psi }, \end{aligned}$$
(5)

where \(\rho \) is a weight coefficient since Eq. (5) is a common practice in tracking allowing the target feature evolve as a weighted sum of current and previous representations. Equation (6) is an online learning rule to update the prediction network, so \(\mu \) is an adequate learning rate. \( Q(s,a,\delta )\) is the output of critic network \(\phi \) and defined in Eq. 11. This action indicates a reliable tracking, confident enough to update the target representation and the model.

For the action \( stop \ \& \ ignore\) (stop shifting without updating the object feature), we stop the iteration and take \(l_{t}^*=l_{t,k} \) as the location for object and move on to track the target in the next frame, where the appearance feature \(f_{t}^{*}\) and the prediction network \(\psi \) are not updated. This action indicates that the target is found, yet the tracker is not confident to update the model, e.g., if motion blur or occlusions present.

For the action restart (restart tracking), we restart the iterationby re-sampling a random set of candidate patches \(L_{t}\) around \(l^*_{t-1}\) in \(I_{t}\), and select the patch which has the highest Q-values, which is defined in Eq. 12 according to the IoU objective, as the initial location:

$$ \begin{aligned} l_{t,0}=\arg \max _{s=\{I_{t},l,f_{t-1}^{*}\}, l\in L_{t}} Q(s,a={stop \ \& \ update},\delta =0 \vert \phi ). \end{aligned}$$
(6)

This action represents the cases that the tracker loses the target temporarily and resorts to an extensive search to re-initialize tracking.

Figure 3 presents a sample action sequence in tracking. The prediction and actor networks formulate the motion estimation and tracking status change in a unified way as taking actions in reinforcement learning. Nevertheless, learning these neural networks requires dedicatedly designed rewards for each type of actions.

3.2 Training the Neural Networks in DRL-IS

In this subsection, we detail the training procedure of the prediction, actor, and critic networks by deep reinforcement learning, from a large number of labeled video sequences. Note that the prediction network is pre-trained offline while during online tracking, both the prediction and actor networks are jointly updated by the actor-critic approach.

Learning of the Prediction Network: The prediction network estimates the iterative shift of the object in a given frame, from the object location and features in consecutive frames. We pre-train a convolutional neural network in an end-to-end manner to predict the shift of the target object between frames or iteration steps.

Network Architecture: As illustrated in Fig. 2, the prediction network uses three convolutional layers to extract features from the target patch and the current candidate box during pre-training. Then the features are concatenated and fed into two fully connected layers to produce the parameters which estimate the location translation and scaling changes.

Network Inputs: We sample pairs of crops from the sequences between every two frames to feed the network. The first crop is the object location in the previous frame and the second crop is in the current frame at the same location. The crops are padded with a fixed ratio to the object scale, which is empirically determined in our experiments. The network receives a pair of crops which are warped into \(107\times 107\) pixels and estimates the motion \(\delta \) between two adjacent frames.

Network Pretraining: Instead of extracting the feature of the region proposals and performing regression on the bounding box, we train a fully end-to-end network to learn location translations and deformations directly. We perform data augmentation by sampling multiple examples with scale variations which are near the target bounding box and then create crops in the current frame. Using labeled video frames and these augmented samples, the training of prediction network promotes to locate a target with less iteration steps.

DRL-IS with Actor-Critic: We exploit the actor-critic algorithm [28] to jointly train the three sub-networks, \(\theta ,\psi ,\phi \). Firstly, we define the rewards according to the tracking performance. The reward of the action continue with \(\delta _{t,k} \) is defined by \(\varDelta {IoU}\) rather than the IoU to adjust bounding boxes.

$$\begin{aligned} r_{t,k}= \left\{ \begin{array}{cl} 1 &{}\qquad \varDelta _{IoU} \ge \epsilon \\ 0 &{}\qquad -\epsilon<\varDelta _{IoU} < \epsilon \\ -1 &{}\qquad \varDelta _{IoU} \le -\epsilon \end{array} \right. , \end{aligned}$$
(7)

where \(\epsilon > 0\) and \(\varDelta _{IoU}\) is computed as:

$$\begin{aligned} \varDelta _{IoU} = g(l_{t}^{*},l_{t,k})-g(l_{t}^{*},l_{t,k-1}), g(l_i,l_j)=\frac{l_i \cap l_j}{l_i \cup l_{j}}. \end{aligned}$$
(8)

For the action \( stop \ \& \ update\) and \( stop \ \& \ ignore\), the rewards are defined by the IoU of the final prediction and the ground truth. To encourage tracking stop with less iterations, the positive reward is related to the the iteration times \(K_t\). We take \(l_{t}^{*} \) as the location for object and the rewards are computed as:

$$\begin{aligned} r_{t,K_t}=\left\{ \begin{array}{ll} 10/K_t &{}\qquad g(l_{t}^{*},l_{t,K_t}) \ge 0.7 \\ 0 &{}\qquad 0.4 \le g(l_{t}^{*},l_{t,K_t}) \le 0.7 \\ -5 &{}\qquad else \end{array} \right. . \end{aligned}$$
(9)

For the action restart, the reward is positive when the IoU of the final prediction and the ground truth is less than 0.4 considering the high computational costs of restart.

$$\begin{aligned} r_{t,K_t}=\left\{ \begin{array}{ll} -1 &{}\qquad g(l_{t}^{*},l_{t,K_t}) \ge 0.7 \\ 0 &{}\qquad 0.4 \le g(l_{t}^{*},l_{t,K_t}) \le 0.7 \\ 1 &{}\qquad else \end{array} \right. . \end{aligned}$$
(10)

Then we define the calculation of Q-values of each action. The Q value of the action continue and other actions are quite different, since the reward of continue is based on the increment of IoU while others are based on the tracking performance evaluated by IoU. The Q value of action continue with \(\delta _{t,k }\) is computed as follows:

$$\begin{aligned} Q(s,a,\delta _{t,k})=\sum _{i=k}^{K_t}\gamma ^{(i - k)} r_{t,i}. \end{aligned}$$
(11)

The Q values of actions \( stop \ \& \ update\), \( stop \ \& \ ignore\), restart are computed as:

$$\begin{aligned} Q(s,a, \delta _{t,k}=0)=\sum _{j=t}^{N}\gamma ^{j-t} r_{j,k_{j}}. \end{aligned}$$
(12)

Equation (12) sums the rewards upon the step k in the current frame while Eq. (13) sums the rewards upon the time step t. The reason for the different calculations of Q-values in Eqs. (12) and (13) is that the action continue locates the target with the current models in frame t while other actions involve the decision whether to stop tracking based on previous tracking performance.

Finally, we formulate the optimization problem of \(\phi \) and \(\theta \) as follows:

$$\begin{aligned} \phi= & {} \arg \min _{\phi }L(\phi ) = \mathbb {E}_{s,a}( Q(s,a\vert \phi ) -r-\gamma Q(s',a',\vert \phi ^{-}) )^{2}, \end{aligned}$$
(13)
$$\begin{aligned} \theta= & {} \arg \min _{\theta }J(\theta ) = -\mathbb {E}_{s,a} \log (\pi (a,s \vert \theta )) \hat{A}(s,a). \end{aligned}$$
(14)

\(s'\) is the next state and \( a'=\arg \max _{a} Q(s',a\vert \phi ^{-})\). Action-value \(\hat{A}(s,a)\) and value function V(s) is calculated as follows:

$$\begin{aligned} \hat{A}(s,a)= & {} Q(s,a\vert \phi )-V(s), \end{aligned}$$
(15)
$$\begin{aligned} V(s)= & {} \mathbb {E}_{s}\pi (s,a\vert \theta ^{-})Q(s,a \vert \phi ^{-}), \end{aligned}$$
(16)

where \(\phi ^{-}\) is the target network, which has the same architecture with \(\phi \) but is only updated in each 10 iterations. Please refer to [37] for the details of reinforcement learning. We update the parameters of the critic network \(\phi \) and actor network \(\theta \) as follows:

$$\begin{aligned} \phi= & {} \phi - \mu _{\phi } \frac{\partial L(\phi )}{\partial \phi }, \end{aligned}$$
(17)
$$\begin{aligned} \theta= & {} \theta - \mu _{\theta } \frac{\partial J(\theta )}{\partial \theta }. \end{aligned}$$
(18)

Algorithm 1 summarizes the learning of proposed method.

Fig. 3.
figure 3

An illustrative example of the actions on tracking status change by the actor network: (1) at \(I_t\), the target is readily located by two continue actions and a stop & update action updates the target feature \(f*_t\) and the prediction network \(\phi \) accordingly; (2) at \(I_{t+1}\), at first, a continue action tracks to a distractor person nearby, than the tracker spots this and take a restart action to re-initialize the tracking; (3) the shift process is restarted at \(I_{t+1}\), with a continue action, the target is found yet the scale is not reliable, and then a stop & ignore action return the results but does not update the target feature \(f^{*}_{t}\)

figure a

4 Experiments

To validate the proposed approach, we conducted experiments on the popular Object Tracking Benchmark [48, 49], Temple-Color128 [31] and VOT-2016 [30], and compared with recent state-of-the-art trackers.

4.1 Datasets and Settings

We conducted experiments on the standard benchmarks: OTB-2015, Temple-Color128 and VOT-2016. OTB-2015 [49] contains 100 video sequences, where each video was fully annotated with ground truth bounding boxes. Temple-Color128 contains 128 color sequences. The challenging attributes for visual object tracking on these two datasets include illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutters (BC), and low resolution (LR). We followed the standard evaluation metrics on these benchmarks. We used the one-pass evaluation (OPE) with the distance precision metric and overlap success plots metrics, where each tracker was initialized with the ground truth location until the end of each sequence. Specifically, the overlap success rate measures the overlap between predicted bounding boxes and ground truth bounding boxes, and the distance precision metric is the percentage of frames where the estimated location center error from the ground truth is smaller than a given distance threshold. In our experiments, we set the threshold distance as 20 pixels for all trackers. The VOT-2016 dataset consists 60 challenging videos from a set of more than 300 videos. The performance in terms of both accuracy (overlap with the ground-truth) and robustness (failure rate) is evaluated in our experiments. Noting that on VOT-2016 dataset, a tracker is restarted by the ground-truth in the case of a failure.

4.2 Implementation Details

We implemented our tracker in Python using the Pytorch library. The implementation was conducted on a PC with an Intel Core i7 3.4 GHz CPU with 24 GB RAM and the deep neural networks were trained on GeForce GTX 1080 Ti GPU with 11 GB VRAM. In our settings, the proposed tracker runs about 10 frames per second on these two benchmarks [48, 49].

Prediction Network: The prediction network has three convolutional layers which are initialized by the VGG-M [8] network which was pretrained on ImageNet [15]. The next two fully connected layers has 512 and 100 output units with ReLU activations. The output fully connected layer has 4 output units combined with the tanh activation.

Actor-Critic Network: The actor network has two fully connected layers of 100 and 4 output units with the ReLU activation. The critic network is similar to the actor network but the final layer has only one output unit. The current and candidate features are concatenated as the input to these two networks. We use the Adam optimizer [27] with a learning rate of 0.0001 and a discount of \(\beta \) (set as 0.95) to train the actor-critic network. We trained our actor-critic network by using sequences which were randomly sampled from the VOT-2013, VOT-2014, and VOT-2015 [30] in which videos overlapping with OTB and Temple-Color were excluded. The maximal number of actions is set to 10 for each frame and the starting frame for each episode is randomly selected. The end operation is determined by the mean IoU ratio of the last 5 predicted bounding boxes compared to the ground truth bounding boxes of the total frames of one sequence. If the mean IoU is under 0.2 or at the end of a sequence, we terminate the episode and update the models. We trained the network for a total num of 50,000 episodes until convergence. On VOT-2016 dataset, we conducted experiments using ImageNet as the training set for our tacker. Since each object on the training set has only one frame (static image), we set \(\gamma \) as 0 in Eq. 12, and removed the action stop & ignore.

4.3 Results and Analysis

Quantitative Evaluation: We conducted quantitative evaluations on the OTB-2015 Dataset, Temple-Color Dataset and VOT-2016 Dataset.

OBT-2015 Dataset. We compared our approach with the state-of-the-art trackers including CREST [43], ADNet [52], MDNet [36], HCFT [32], SINT [45], DeepSRDCF [12], and HDT [39]. Figure 4 shows the performance of different trackers in terms of precision and success rate based on center location error and overlap ratio on OTB-2015. We also evaluated the performance of different tracking methods and the processing speed (fps) on OTB-2015 dataset. Overall, our tracker performs favorably on both the precision and the success rate, meanwhile runs at 10.2 fps which is 5 times faster than the state-of-the-art tracker MDNet (2.1 fps in Pytorch implementation). One variant of our tracker with only two action types shown later runs even faster with an acceptable trade-off of accuracy.

Fig. 4.
figure 4

The precision and success rate over all sequences by using one-pass evaluation on the OTB-2015 Dataset [49]. The legend includes the area-under-the-curve score and the average distance precision score at 20 pixels for each tracker

Fig. 5.
figure 5

The success plots over three tracking challenges, including fast motion, deformation, scale variations, for all the compared trackers on OTB-2015

We also analyzed the performance of our tracker for three different challenge attributes labeled for each sequence including fast motion, deformation, scale variations. We compute the OPE on the distance precision metric under 8 main video attributes. As shown in Fig. 5, our tracker shows competitive results on all the attributes. Specifically, the effectiveness in deformation attributes to the prediction network update according to the policy to capture target appearance changes. For scale variation, our tracker still performs well which demonstrates that our prediction network is robust to the scale change of the target object. Our tracker performs better on all three challenges than ADNet [52], which is also a deep reinforcement learning based tracker. The main reason is that our prediction network can be adjusted according to the action learned by the policy network. Meanwhile, the action \( stop \ \& \ ignore\) and \( stop \ \& \ update\) can guide our tracker whether to update the target feature, which avoids inadequate model update in long-term tracking. We have also obtained similar performance in fast motion, where MDNet [36] and our tracker both benefit from the convolutional features and the re-detection process. However, the percentage of the frames using re-detection to the total frames of MDNet [36] is high, resulting in more computation.

Fig. 6.
figure 6

Qualitative evaluation of our tracker, MDNet [36], ADNet [52] and CREST [43] on 7 challenging sequences

Fig. 7.
figure 7

The precision and success plots over all sequences by using one-pass evaluation on the Temple-Color Dataset. The legend contains the average distance precision score and the area-under-the-curve score for each tracker

Temple-Color Dataset. We evaluate our approach on the Temple-Color dataset containing 128 videos. Figure 7 shows the performance of different trackers in terms of precision and success rate based on center location error and overlap ratio. The C-COT tracker [13] and MEEM [54] reach the average distance precision score of 0.781 and 0.706. Our approach improves by a significant margin, achieving a score of 0.818. In the success plot in Fig. 7, our method also achieves a notable absolute gain of \(1.2\%\) in area-under-the-curve score compared to state-of-the-art method C-COT.

VOT-2016 Dataset. Table 1 shows the comparison of our approach with the top 5 competing trackers in the VOT-2016 challenge. As shown in Table 1, we obtain competitive accuracy and robustness ranking with state-of-the-art methods on the VOT-2016 Dataset. Our method achieves favorable results in terms of accuracy while keeping a low failure rate, which attributes to the decision making on motion estimation and tracking status guided by reinforcement learning. Noting that MDNet_N is a variation of MDNet, which does not pre-train CNNs with other tracking datasets. MDNet_N is also initialized using the ImageNet like our method. Our DRL-IS improves the performance of MDNet_N by a significant margin, which shows that our tracker has good generality without using the tracking sequences as training data.

Table 1. Comparison with state-of-the-art methods in terms of robustness and accuracy ranking on the VOT-2016 dataset(the lower the better)

Qualitative Evaluation: Figure 6 shows qualitative comparisons of top performing visual tracking methods including MDNet [36], ADNet [52], CREST [43] and our method on 7 challenging sequences. Our tracker performs well against the compared these methods in all sequences. Moreover, none of the other methods is able to track targets for the CarScale sequence whereas our tracker successfully locates the target as well as estimates the scale changes. There are two reasons: (1) Our method accounts for the appearance changes caused by deformation and background clutters (Bird1, Soccer and Freeman4) by adjusting the bounding box of the object iteratively; (2) The feature of objects and the models are updated adaptively with deep reinforcement learning to account for appearance variations.

Table 2. The comparisons of different ablation variants of DRL-IS over the distance precision and overlap success plots on the OTB-2015 dataset

Ablation Study of Different Components: To show the impacts of different components of our tracker, we developed three variants of our tracker by integrating the prediction network with different types of policies combination and evaluated them using OTB-2015. These three variants are: (1) “Shift” is a baseline tracker which contains only one module based on the pre-trained prediction network; (2) “Shift + IS” is a pre-trained prediction model which was guided with only two action types: continue and \( stop \ \& \ update\); and (3) “DRL-IS” is our final model which was guided with full action types: continue, restart, \( stop \ \& \ ignore\) and \( stop \ \& \ update\). Table 2 shows the distance precision and overlap success plots of these variations on the OTB-2015 dataset. The “Shift” tracker can only obtain the one-step shift based on deep convolutional features, which dose not perform well because the model is not updated during tracking and may fail when the target object changes fast. The “Shift + IS” tracker enables iterative shift and updates the model according to the policy learned by the actor network, which outperforms the baseline tracker by \(6.5\%\) and \(5.7\%\) in terms of the precision and overlap success, respectively. Moreover, “DRL-IS” incorporates all actions with the prediction network and achieves \(8.7\%\) and \(2.2\%\) performance gains of in terms of the precision over the “Shift” and “Shift + IS” variations, respectively.

5 Conclusion

In this paper, we have proposed a DRL-IS method for visual tracking, which has demonstrated reinforcement learning is an effective way to model the tough decision making process for tracking, i.e., performing motion estimation and changing tracking status at the same time. The new iterative shift by deep nets locates targets efficiently than online classification and copes well with the cases that deformation or motion blur present in video. Extensive experiments on 3 public datasets have validated the advantages on tracking robustness and efficiency of the proposed method.