Keywords

1 Introduction

Object tracking is defined as the problem of estimating the spatio-temporal trajectory of a target object in an image. Although it has been studied for many applications such as bio-image analysis, scene surveillance, autonomous vehicle control, etc., it is still a difficult problem. One difficulty comes from appearance variation. For example, for a general person tracking problem, we need to deal with various clothes, poses, and body shapes under various illumination condition. Traditional methods assume a predefined template of the target object and update it accordingly to any changes in appearance [1, 2]. Another difficulty is occlusion. Traditional object tracking methods are often intolerant to severe occlusion [3,4,5].

In this paper, we propose a object tracking method robust to both of appearance variation and occlusion by using a complementary combination of Single Shot Multibox Detector (SSD) [6], Fully Convolutional Network (FCN) [7], and Dynamic Programming (DP) [8,9,10]. SSD and FCN are employed for tackling appearance variation. They have been proposed recently for object detection and they can provide a probability value of a target object for each category, such as person, car, and motorbike, given each bounding box or pixel respectively. Since SSD and FCN are types of CNNs, large amounts of training samples will make them robust to a variety of appearances.

To deal with occlusion, we utilized DP for global optimization of a target object’s trajectory. DP is one of the most fundamental optimization techniques and has been used for obtaining a globally optimal tracking path. Since a slope constraint of DP prohibits the tracked position from moving steeply over all frames, it is possible to obtain a stable tracking path, regardless of occlusion.

It is very important to note the reason why we use the two CNN-based object detectors, SSD and FCN, in a complementary manner, is because they provide detection results in different ways. SSD provides an accurate detection result to a clear target object, however it is impossible to provide a detection result in an unstable situation such as occlusion. In contrast to SSD, FCN provides a result, regardless of any situation. Namely, combination of SSD and DP is useful to stable situation to obtain accurate result and it of FCN and DP is utilized in unstable situation to obtain any result.

It is also noteworthy that the proposed method requires neither the initial position nor the template of the target object. Traditional trackers may be sensitive to the template of the target object and the initialization in which the initial position of the target object is denoted on the first frame. However, the proposed method does not require either the template nor the initialization due to the synergy combining SSD, FCN, and DP.

The contributions of this paper are as follows. First, we show proposed method achieved the highest accuracy compared to the traditional trackers introduced in the Visual Tracker Benchmark [11]. Second, we confirm that the complementary use of the two CNN-based object detectors, SSD and FCN, are useful for tracking. Third, we confirm that the proposed method tackles appearance variation and occlusion through several experiments even without initialization, templates, and modifying parameters.

The remaining of this paper is organized as follows. In Sect. 2, we introduce related traditional tracking research. Section 3 elaborates on SSD, FCN, and DP and details the proposed method. In Sect. 4, we confirm that the proposed method is a robust tracker through several experiments and analyze the experimental results. Finally, Sect. 5 draws the conclusion.

2 Related Work

Object tracking is one of the important techniques in computer vision and has been actively studied for decades. Most object tracking algorithms are divided into two categories: generative and discriminative methods. Generative methods describe appearance of a target object using a generative model and search for the target object region that fits the model best. A number of generative model based algorithms have been proposed such as sparse representation [12, 13], density estimation [14, 15], and incremental subspace learning [16]. On contrary, discriminative methods build a model to distinguish a target object from the background. These tracking methods include P-N learning [17] and online boosting [18,19,20]. Even though these approaches are satisfactory in restricted situations, they have inherent limitations which include occlusion and appearance variation such as illumination changes, deformation etc.

To deal with limitations which traditional trackers can not tackle, recent trackers employ Convolutional Neural Networks (CNN) [21, 22] and Deep Convolutional Neural Networks (DCNN) [23, 24] by focusing on their powerful performance. A number of trackers using neural networks have been proposed such as human tracking, hand tracking, etc. [25,26,27,28]. Representative tracker using a neural network is a Fully Convolutional Network based Tracker (FCNT) [29] which also utilizes FCN. This method utilizes multi-level feature maps of a VGG network [30] to complement drastic appearance variation and distinguish a target object from its similar distracters. It selects discriminative feature maps and discards noisy ones, because the CNN features pretrained on ImageNet [23] are for distinguishing generic objects. Even though FCNT achieved a high accuracy compared to conventional trackers, initialization and templates are necessary to track a target object.

Table 1 shows the comparison of characteristics between the proposed method and FCNT. The main difference between the proposed method and FCNT is whether initialization and templates of a target object are necessary or not. Namely, FCNT can track only a specific target object with defined initial position and template. In contrast, it is possible to use the proposed method without them. The other difference is that FCNT uses a greedy tracking algorithm whereas the proposed method utilizes DP for globally optimal tracking. In Sect. 4.3, we will prove experimentally that the proposed method has superiority over FCNT.

Table 1. Comparison between the proposed method and FCNT [29].
Fig. 1.
figure 1

Pipeline to generate a likelihood map from an input image: (a) shows the pipeline to generate a likelihood map by SSD. The likelihood map by SSD provides accurate probability values and positions of target objects when the targets are rather easy for detection. When SSD fails to detect the target, we switch to FCN and employ the likelihood map by FCN, as shown in (b). The likelihood map by FCN might include noisy probability values, compared to it by SSD.

3 The Proposed Method

3.1 Likelihood Maps by Single Shot Multibox Detector and Fully Convolutional Network

Figure 1 shows the pipeline of how to obtain a likelihood map from input image. In the proposed method, SSD and FCN are used for obtaining likelihood maps, each of which shows a two-dimensional probability distribution of a target object position at a certain frame. A peak in a likelihood map at frame t suggests a candidate position of the target object at t. We will switch two likelihood maps according to the situation as shown in Fig. 1. This is because SSD and FCN shows different behaviors especially when object candidate detection is difficult, as follows.

SSD is based on VGG-16 network which includes 13 convolution layers and 3 pooling layers. It possesses supplementary two characteristics: convolutional predictors and multi-scale feature maps. The convolutional predictors generate a probability value for the presence of each object category in each default box and produce adjustments to the box to match the object shape. Additionally, the network combines predictions from multi-scale feature maps with different resolutions to handle objects of various sizes. We generate a likelihood map by setting a probability value, on the center position of resulting bounding box. Thus, likelihood maps obtained by SSD contain a very accurate probability value. However, when SSD fails to detect a target object, likelihood maps can not be obtained.

FCN is composed entirely of convolutional layers based on VGG-16 and trained end-to-end, pixels-to-pixels, for classification and segmentation. It takes input of arbitrary size and produces a correspondingly-sized likelihood map by up and down sampled pooling layers. The likelihood map by FCN might include noisy probability values by up and down sampled pooling layers. To obtain accurate positive response of a target object, links between the low-level fine layers and the high-level coarse layers are constructed. These are so called skip connections which combines information from fine layers and course layers. Even if we can obtain a more accurate likelihood map by skip connections, a likelihood map obtained by FCN still includes noisy probability values, compared to SSD.

Fig. 2.
figure 2

The tracking path optimization in the proposed method: each frame is a likelihood map calculated by SSD or FCN and the green arrow suggests a probability value of a target object at each position. The blue rectangle is a slope constraint to restrict movement and the red plot is the position where the sum of probability values over the associated path is highest at the final frame. The orange arrows mean the most globally optimal tracking path obtained by back-tracking. (Color figure online)

Using both SSD and FCN to obtain a likelihood map increases the tracking accuracy. Although SSD is a detection method with high accuracy, it might not detect the target object in unstable situations such as occlusion, blurriness, and deformation, as shown in (b) of Fig. 1. If SSD fails to detect, we switch to FCN and obtain likelihood maps by FCN. The success and failure criteria of detection by SSD is whether the detected position exists within N pixels from the highest value position of the previous frame or not. FCN provides likelihood maps for all input images, regardless of unstable situation, even if they might include noisy probability values. Note that, as discussed later, even when both SSD and FCN cannot obtain likelihood values (e.g. when an object that has been tracked leaves the scene), DP complements the tracking path.

The other merit of using SSD and FCN is their computational efficiency. A naive method to obtain likelihood maps is to apply a CNN to a sliding window region of an input image. Using this method requires many forward calculations and can not deal with a target object of various sizes, because it accepts a fixed region size as input. However, both SSD and FCN accept the entire image and only require a single forward calculation with handling various sizes.

3.2 Global Path Optimization by Dynamic Programming

To apply DP to our method, we start by creating likelihood maps of each of the frames using SSD or FCN. Figure 2 shows the procedure to obtain the most optimal tracking path by DP. For each pixel on a likelihood map, we find the highest value within a given slope constraint of the previous frame which prohibits from moving steeply and create cumulative DP maps. This process is continued by iterating over all of the frames. Cumulative DP map \(D^{(f)}\) is defined as:

$$\begin{aligned} D^{(f)}(x,y) = \max _{x-w_s \le x \le x+w_s, y-h_s \le y \le y+h_s} [D^{(f-1)}(x,y)] + L^{(f)}(x,y) \end{aligned}$$
(1)

where likelihood map is \(L^{(f)}\), f is the number of frame and size of slope constraint is denoted as (\(w_s\), \(h_s\)). We select the highest probability value on the final cumulative DP map. After that, DP searches for the most optimal tracking path by back-tracking along the highest probability value on each previous likelihood map. DP is a non-greedy algorithm to estimate the global optimal path in a sequence. Due to this, DP-based tracking is robust to occlusion, which degrades a tracking performance of greedy algorithms.

3.3 Synergies by Combining SSD, FCN, and DP

We propose combination of SSD, FCN, and DP as a robust object tracking method. The proposed method has not only advantage of robustness to appearance variation and occlusion, but also does not need to set a template, change initialization, or change parameters, even when appearance of a target object changes.

The proposed method does not need a template for object tracking. For traditional trackers, a template is necessary and needs to be updated when a target object changes. However, for the proposed method, it is unnecessary, under the condition that a category of a target object is trained sufficiently. Since, the proposed method can deal with appearance variation by learning numerous features of a target object, it also does not need to modify parameters even if appearance of a target object is changed. For traditional trackers, identifying the position of a target object on the first frame is important element to track. However, the proposed method can obtain the most globally optimal tracking path by back-tracking over all cumulative DP maps without any identifying the position.

4 Implementation and Experiments

4.1 Experimental Setup

We used the VOC2012 [31] dataset to train SSD and FCN on 20 categories. The training dataset has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations. The categories are as follows: person, bird, cat, cow, dog, horse, sheep, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa and tv/monitor.

To demonstrate that the proposed method can track a target object with a high accuracy, we evaluated the proposed method using sequences that have a target object in one of the 20 categories of VOC2012. Since the proposed method can detect only trained objects on the 20 categories, we selected 12 sequences for our experiments: CarScale, Coke, Couple, Crossing, David3, Jogging1&2, MotorRolling, MountainBike, Walking1&2, Woman. It is noteworthy that those sequences show various difficulties, such as illumination variation, scale variation, occlusion, fast motion, rotation, and low resolution.

The sequences were classified into two types, single-object sequencesFootnote 1 and multi-object sequencesFootnote 2. Since the proposed method is designed to track a single object without initialization, single-object sequences are appropriate for performance evaluation. The proposed method, however, is still applicable to multi-object sequences by initialization. We therefore conducted two separated experiments, single-object sequences (without initialization) and multi-object sequences (with initialization). Moreover, we conducted two extra experiments. One is to compare the results using a likelihood map by both SSD and FCN, single SSD and single FCN. In th, we set N as 10, which is pixel number of the success and failure criteria to switch from SSD to FCN. The other is the comparison experiment between the proposed method and FCNT [29] which is a tracker using FCN, to demonstrate superiority of the proposed method than FCNT.

4.2 Evaluation Criterion

We evaluated the proposed method by comparing the precision which is established method in the Visual Tracker Benchmark [11]. The precision is defined as the percentage of frames whose estimated position is within a given threshold from a ground-truth. The distance between the estimated position and the manually labeled ground-truth is calculated by Euclidean distance. To show a performance efficiently, we conducted one-pass evaluation (OPE) that trackers run throughout a test sequence only one time and compare the precision of each of trackers.

Fig. 3.
figure 3

Precision results for single-object sequences: We compared results of the proposed method without initialization and those of the traditional trackers with that by using single-object sequences. The precision is defined as the percentage of frames whose estimated position is within location error threshold. The permissible threshold distance is denoted as location error threshold. The score listed in the legend means the precision score at a threshold of 20 pixels. The proposed method possesses the higher performance for all thresholds and has no large deviation from ground-truth.

Fig. 4.
figure 4

Precision results for multi-object sequences: we summarized the all results with initialization by using multi-object sequences. The precision is defined as the percentage of frames whose estimated position is within location error threshold. The permissible threshold distance is denoted as location error threshold. The score listed in the legend means the precision score at a threshold of 20 pixels.

Fig. 5.
figure 5

Precision results according to how to generate a likelihood map: we compared the results using likelihood maps by both SSD and FCN, single SSD, and single FCN, without initialization. The precision is defined as the percentage of frames whose estimated position is within location error threshold. The permissible threshold distance is denoted as location error threshold. The score listed in the legend means the precision score at a threshold of 20 pixels. The proposed method which is using likelihood maps by both SSD and FCN, possesses a higher performance than the others.

4.3 Evaluation Results

Tracking methods can be divided into offline tracking such as the proposed method and online tracking. However, we compared the proposed method to online tracking methods in order to show the performance, because there is no comparable offline tracking methods which are released. We compared the proposed method to the top five traditional trackers introduced in the Visual Tracker Benchmark: Structured Output Tracking with Kernels (Struck) [32], a sparsity-based tracker (SCM) [33], P-N Learning tracker (TLD) [34], Context tracker (CXT) [35] and Visual Tracking Decomposition (VTD) [36].

Figure 3 shows all precision results including the proposed method without initialization and the traditional trackers with initialization. The score listed in the legend of Fig. 3 is the precision at a threshold of 20 pixels, since a 20 pixel threshold is the standard threshold for the Visual Tracker Benchmark. As shown in Fig. 3, we confirmed that the proposed method outperforms than the traditional trackers, even though the proposed method is not given the initial position of the object on the first frame. Since DP sets a slope constraint to prohibit tracked position from moving rapidly, the proposed method can track a target object with small deviation.

As we mentioned in Sect. 4.1, the proposed method is applicable to multi-object sequences by identifying a initial position of a target object. Figure 4 shows all results with initialization. For all thresholds, the proposed method possesses a higher performance compared to the traditional trackers. Through these results, when multi-objects even exist on the same frame, the proposed method can track a target object distinguishably by initialization. By comparing the results of Figs. 3 and 4, we also confirm that the precision of the proposed method with initialization is more accurate than that without initialization.

Figure 5 shows the precision results using likelihood maps by both SSD and FCN, a single SSD, and a single FCN, respectively, without initialization. The proposed method which utilizes both SSD and FCN has a higher performance than the others at a threshold of 20 pixels. Since the method using single SSD can not generate likelihood maps for all input images, the results by single SSD are worse than those by the proposed method. As shown in Fig. 5, the precision by single FCN ascends rapidly for low thresholds. When a target object is large enough, the method using single FCN might track the position which is far from the center position of a target object, because FCN does not always obtain the highest probability value which is close to center position of a target object. Due to this, the precision by single FCN is lower than the others for low thresholds.

Fig. 6.
figure 6

Example results of the proposed method dealing with appearance variation and occlusion: the center of green and red ‘+’ mean the ground-truth and the tracked position, respectively. Examples of left side are results of the top five traditional trackers introduced in the Visual Tracker Benchmark. Examples of right side are results of the proposed method. (Color figure online)

Figure 6 shows examples of the proposed method dealing with appearance variation and occlusion. As shown in Fig. 6, although there are various target objects of same category in each sequence, the proposed method can track each target object without template and parameter modification. Also, we confirmed that the proposed method can track a occluded target object more stably, in contrast to the top five traditional trackers, because DP seeks the global optimal path over all frames. However, when target objects of same category appear with occlusion, such as the final sequence in Fig. 6, the proposed method does not know which object should track. Since the proposed method utilizes general features of same category during object tracking, we should add to specific features of a target object, when it tracks distinguishably.

We also observed the performance of FCNT [29] using the same sequences. It could achieve 0.951 and 0.945 precision for single-object and multi-object sequences, respectively, at threshold of 20 pixels. It is, however, almost meaningless to compare this precision to ours. First of all, we should remember that FCNT needs a template and ours does not. In addition, FCNT needs initialization and ours does not. In fact, we can see that the proposed method without initialization has no severe degradation from comparison between Figs. 3 and 4. Furthermore, our DP-based method has theoretical superiority over FCNT at the robustness to occlusion, as shown in Fig. 7.

Fig. 7.
figure 7

Example results of FCNT and the proposed method to track the occluded target object.

5 Conclusion

In this paper, we presented the object tracking method which combines SSD, FCN and DP. We confirmed that the proposed method is robust to appearance variation and occlusion through several experiments and achieved the highest accuracy compared to the traditional trackers in the Visual Tracker Benchmark. In contrast to traditional trackers, the proposed method can track the target object without initialization, modifying parameters, and templates as it synergizes the combination of SSD, FCN, and DP. However, the proposed method can be extended to tracking with multiple similar objects by using initialization. We confirmed that using both SSD and FCN is more stable to tracking than single SSD and single FCN as well.

We expect to use the proposed method in analysis field such as traffic analysis, bio-image analysis, etc. In future, we will connect SSD, FCN with network flows [37] to track multi-target objects simultaneously. To increase the tracking accuracy in situation when similar objects of same category appear with occlusion such as the last sequence in Fig. 6, we will apply Flownet [38], in order to utilize information of optical flow.