Globally Optimal Object Tracking with Complementary Use of Single Shot Multibox Detector and Fully Convolutional Network

Lee, Jinho; Iwana, Brian Kenji; Ide, Shouta; Hayashi, Hideaki; Uchida, Seiichi

doi:10.1007/978-3-319-75786-5_10

Jinho Lee¹⁶,
Brian Kenji Iwana¹⁶,
Shouta Ide¹⁶,
Hideaki Hayashi¹⁶ &
…
Seiichi Uchida¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10749))

Included in the following conference series:

Pacific-Rim Symposium on Image and Video Technology

1660 Accesses
2 Citations

Abstract

Object tracking is one of the most important but still difficult tasks in computer vision and pattern recognition. The main difficulties in the tracking task are appearance variation of target objects and occlusion. To deal with those difficulties, we propose a object tracking method combining Single Shot Multibox Detector (SSD), Fully Convolutional Network (FCN) and Dynamic Programming (DP). SSD and FCN provide a probability value of the target object which allows for appearance variation within each category. DP provides a globally optimal tracking path even with severe occlusions. Through several experiments, we confirmed that their combination realized a robust object tracking method. Also, in contrast to traditional trackers, initial position and a template of the target do not need to be specified. We show that the proposed method has a higher performance than the traditional trackers in tracking various single objects through video frames.

You have full access to this open access chapter, Download conference paper PDF

Multi-class Multi-object Tracking Using Changing Point Detection

Object tracking using a convolutional network and a structured output SVM

Article Open access 15 June 2017

A survey on online learning for visual tracking

Article 15 May 2020

Keywords

1 Introduction

Object tracking is defined as the problem of estimating the spatio-temporal trajectory of a target object in an image. Although it has been studied for many applications such as bio-image analysis, scene surveillance, autonomous vehicle control, etc., it is still a difficult problem. One difficulty comes from appearance variation. For example, for a general person tracking problem, we need to deal with various clothes, poses, and body shapes under various illumination condition. Traditional methods assume a predefined template of the target object and update it accordingly to any changes in appearance [1, 2]. Another difficulty is occlusion. Traditional object tracking methods are often intolerant to severe occlusion [3,4,5].

In this paper, we propose a object tracking method robust to both of appearance variation and occlusion by using a complementary combination of Single Shot Multibox Detector (SSD) [6], Fully Convolutional Network (FCN) [7], and Dynamic Programming (DP) [8,9,10]. SSD and FCN are employed for tackling appearance variation. They have been proposed recently for object detection and they can provide a probability value of a target object for each category, such as person, car, and motorbike, given each bounding box or pixel respectively. Since SSD and FCN are types of CNNs, large amounts of training samples will make them robust to a variety of appearances.

To deal with occlusion, we utilized DP for global optimization of a target object’s trajectory. DP is one of the most fundamental optimization techniques and has been used for obtaining a globally optimal tracking path. Since a slope constraint of DP prohibits the tracked position from moving steeply over all frames, it is possible to obtain a stable tracking path, regardless of occlusion.

It is very important to note the reason why we use the two CNN-based object detectors, SSD and FCN, in a complementary manner, is because they provide detection results in different ways. SSD provides an accurate detection result to a clear target object, however it is impossible to provide a detection result in an unstable situation such as occlusion. In contrast to SSD, FCN provides a result, regardless of any situation. Namely, combination of SSD and DP is useful to stable situation to obtain accurate result and it of FCN and DP is utilized in unstable situation to obtain any result.

It is also noteworthy that the proposed method requires neither the initial position nor the template of the target object. Traditional trackers may be sensitive to the template of the target object and the initialization in which the initial position of the target object is denoted on the first frame. However, the proposed method does not require either the template nor the initialization due to the synergy combining SSD, FCN, and DP.

The contributions of this paper are as follows. First, we show proposed method achieved the highest accuracy compared to the traditional trackers introduced in the Visual Tracker Benchmark [11]. Second, we confirm that the complementary use of the two CNN-based object detectors, SSD and FCN, are useful for tracking. Third, we confirm that the proposed method tackles appearance variation and occlusion through several experiments even without initialization, templates, and modifying parameters.

The remaining of this paper is organized as follows. In Sect. 2, we introduce related traditional tracking research. Section 3 elaborates on SSD, FCN, and DP and details the proposed method. In Sect. 4, we confirm that the proposed method is a robust tracker through several experiments and analyze the experimental results. Finally, Sect. 5 draws the conclusion.

2 Related Work

Object tracking is one of the important techniques in computer vision and has been actively studied for decades. Most object tracking algorithms are divided into two categories: generative and discriminative methods. Generative methods describe appearance of a target object using a generative model and search for the target object region that fits the model best. A number of generative model based algorithms have been proposed such as sparse representation [12, 13], density estimation [14, 15], and incremental subspace learning [16]. On contrary, discriminative methods build a model to distinguish a target object from the background. These tracking methods include P-N learning [17] and online boosting [18,19,20]. Even though these approaches are satisfactory in restricted situations, they have inherent limitations which include occlusion and appearance variation such as illumination changes, deformation etc.

To deal with limitations which traditional trackers can not tackle, recent trackers employ Convolutional Neural Networks (CNN) [21, 22] and Deep Convolutional Neural Networks (DCNN) [23, 24] by focusing on their powerful performance. A number of trackers using neural networks have been proposed such as human tracking, hand tracking, etc. [25,26,27,28]. Representative tracker using a neural network is a Fully Convolutional Network based Tracker (FCNT) [29] which also utilizes FCN. This method utilizes multi-level feature maps of a VGG network [30] to complement drastic appearance variation and distinguish a target object from its similar distracters. It selects discriminative feature maps and discards noisy ones, because the CNN features pretrained on ImageNet [23] are for distinguishing generic objects. Even though FCNT achieved a high accuracy compared to conventional trackers, initialization and templates are necessary to track a target object.

Table 1 shows the comparison of characteristics between the proposed method and FCNT. The main difference between the proposed method and FCNT is whether initialization and templates of a target object are necessary or not. Namely, FCNT can track only a specific target object with defined initial position and template. In contrast, it is possible to use the proposed method without them. The other difference is that FCNT uses a greedy tracking algorithm whereas the proposed method utilizes DP for globally optimal tracking. In Sect. 4.3, we will prove experimentally that the proposed method has superiority over FCNT.

Table 1. Comparison between the proposed method and FCNT [29].

Full size table

3 The Proposed Method

3.1 Likelihood Maps by Single Shot Multibox Detector and Fully Convolutional Network

Figure 1 shows the pipeline of how to obtain a likelihood map from input image. In the proposed method, SSD and FCN are used for obtaining likelihood maps, each of which shows a two-dimensional probability distribution of a target object position at a certain frame. A peak in a likelihood map at frame t suggests a candidate position of the target object at t. We will switch two likelihood maps according to the situation as shown in Fig. 1. This is because SSD and FCN shows different behaviors especially when object candidate detection is difficult, as follows.

SSD is based on VGG-16 network which includes 13 convolution layers and 3 pooling layers. It possesses supplementary two characteristics: convolutional predictors and multi-scale feature maps. The convolutional predictors generate a probability value for the presence of each object category in each default box and produce adjustments to the box to match the object shape. Additionally, the network combines predictions from multi-scale feature maps with different resolutions to handle objects of various sizes. We generate a likelihood map by setting a probability value, on the center position of resulting bounding box. Thus, likelihood maps obtained by SSD contain a very accurate probability value. However, when SSD fails to detect a target object, likelihood maps can not be obtained.

FCN is composed entirely of convolutional layers based on VGG-16 and trained end-to-end, pixels-to-pixels, for classification and segmentation. It takes input of arbitrary size and produces a correspondingly-sized likelihood map by up and down sampled pooling layers. The likelihood map by FCN might include noisy probability values by up and down sampled pooling layers. To obtain accurate positive response of a target object, links between the low-level fine layers and the high-level coarse layers are constructed. These are so called skip connections which combines information from fine layers and course layers. Even if we can obtain a more accurate likelihood map by skip connections, a likelihood map obtained by FCN still includes noisy probability values, compared to SSD.

Using both SSD and FCN to obtain a likelihood map increases the tracking accuracy. Although SSD is a detection method with high accuracy, it might not detect the target object in unstable situations such as occlusion, blurriness, and deformation, as shown in (b) of Fig. 1. If SSD fails to detect, we switch to FCN and obtain likelihood maps by FCN. The success and failure criteria of detection by SSD is whether the detected position exists within N pixels from the highest value position of the previous frame or not. FCN provides likelihood maps for all input images, regardless of unstable situation, even if they might include noisy probability values. Note that, as discussed later, even when both SSD and FCN cannot obtain likelihood values (e.g. when an object that has been tracked leaves the scene), DP complements the tracking path.

The other merit of using SSD and FCN is their computational efficiency. A naive method to obtain likelihood maps is to apply a CNN to a sliding window region of an input image. Using this method requires many forward calculations and can not deal with a target object of various sizes, because it accepts a fixed region size as input. However, both SSD and FCN accept the entire image and only require a single forward calculation with handling various sizes.

3.2 Global Path Optimization by Dynamic Programming

To apply DP to our method, we start by creating likelihood maps of each of the frames using SSD or FCN. Figure 2 shows the procedure to obtain the most optimal tracking path by DP. For each pixel on a likelihood map, we find the highest value within a given slope constraint of the previous frame which prohibits from moving steeply and create cumulative DP maps. This process is continued by iterating over all of the frames. Cumulative DP map $D^{(f)}$ is defined as:

$$\begin{aligned} D^{(f)}(x,y) = \max _{x-w_s \le x \le x+w_s, y-h_s \le y \le y+h_s} [D^{(f-1)}(x,y)] + L^{(f)}(x,y) \end{aligned}$$

(1)

where likelihood map is $L^{(f)}$, f is the number of frame and size of slope constraint is denoted as ($w_s$, $h_s$). We select the highest probability value on the final cumulative DP map. After that, DP searches for the most optimal tracking path by back-tracking along the highest probability value on each previous likelihood map. DP is a non-greedy algorithm to estimate the global optimal path in a sequence. Due to this, DP-based tracking is robust to occlusion, which degrades a tracking performance of greedy algorithms.

3.3 Synergies by Combining SSD, FCN, and DP

We propose combination of SSD, FCN, and DP as a robust object tracking method. The proposed method has not only advantage of robustness to appearance variation and occlusion, but also does not need to set a template, change initialization, or change parameters, even when appearance of a target object changes.

The proposed method does not need a template for object tracking. For traditional trackers, a template is necessary and needs to be updated when a target object changes. However, for the proposed method, it is unnecessary, under the condition that a category of a target object is trained sufficiently. Since, the proposed method can deal with appearance variation by learning numerous features of a target object, it also does not need to modify parameters even if appearance of a target object is changed. For traditional trackers, identifying the position of a target object on the first frame is important element to track. However, the proposed method can obtain the most globally optimal tracking path by back-tracking over all cumulative DP maps without any identifying the position.

4 Implementation and Experiments

4.1 Experimental Setup

We used the VOC2012 [31] dataset to train SSD and FCN on 20 categories. The training dataset has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations. The categories are as follows: person, bird, cat, cow, dog, horse, sheep, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa and tv/monitor.

To demonstrate that the proposed method can track a target object with a high accuracy, we evaluated the proposed method using sequences that have a target object in one of the 20 categories of VOC2012. Since the proposed method can detect only trained objects on the 20 categories, we selected 12 sequences for our experiments: CarScale, Coke, Couple, Crossing, David3, Jogging1&2, MotorRolling, MountainBike, Walking1&2, Woman. It is noteworthy that those sequences show various difficulties, such as illumination variation, scale variation, occlusion, fast motion, rotation, and low resolution.

The sequences were classified into two types, single-object sequences^{Footnote 1} and multi-object sequences^{Footnote 2}. Since the proposed method is designed to track a single object without initialization, single-object sequences are appropriate for performance evaluation. The proposed method, however, is still applicable to multi-object sequences by initialization. We therefore conducted two separated experiments, single-object sequences (without initialization) and multi-object sequences (with initialization). Moreover, we conducted two extra experiments. One is to compare the results using a likelihood map by both SSD and FCN, single SSD and single FCN. In th, we set N as 10, which is pixel number of the success and failure criteria to switch from SSD to FCN. The other is the comparison experiment between the proposed method and FCNT [29] which is a tracker using FCN, to demonstrate superiority of the proposed method than FCNT.

4.2 Evaluation Criterion

We evaluated the proposed method by comparing the precision which is established method in the Visual Tracker Benchmark [11]. The precision is defined as the percentage of frames whose estimated position is within a given threshold from a ground-truth. The distance between the estimated position and the manually labeled ground-truth is calculated by Euclidean distance. To show a performance efficiently, we conducted one-pass evaluation (OPE) that trackers run throughout a test sequence only one time and compare the precision of each of trackers.

4.3 Evaluation Results

Tracking methods can be divided into offline tracking such as the proposed method and online tracking. However, we compared the proposed method to online tracking methods in order to show the performance, because there is no comparable offline tracking methods which are released. We compared the proposed method to the top five traditional trackers introduced in the Visual Tracker Benchmark: Structured Output Tracking with Kernels (Struck) [32], a sparsity-based tracker (SCM) [33], P-N Learning tracker (TLD) [34], Context tracker (CXT) [35] and Visual Tracking Decomposition (VTD) [36].

Figure 3 shows all precision results including the proposed method without initialization and the traditional trackers with initialization. The score listed in the legend of Fig. 3 is the precision at a threshold of 20 pixels, since a 20 pixel threshold is the standard threshold for the Visual Tracker Benchmark. As shown in Fig. 3, we confirmed that the proposed method outperforms than the traditional trackers, even though the proposed method is not given the initial position of the object on the first frame. Since DP sets a slope constraint to prohibit tracked position from moving rapidly, the proposed method can track a target object with small deviation.

As we mentioned in Sect. 4.1, the proposed method is applicable to multi-object sequences by identifying a initial position of a target object. Figure 4 shows all results with initialization. For all thresholds, the proposed method possesses a higher performance compared to the traditional trackers. Through these results, when multi-objects even exist on the same frame, the proposed method can track a target object distinguishably by initialization. By comparing the results of Figs. 3 and 4, we also confirm that the precision of the proposed method with initialization is more accurate than that without initialization.

Figure 5 shows the precision results using likelihood maps by both SSD and FCN, a single SSD, and a single FCN, respectively, without initialization. The proposed method which utilizes both SSD and FCN has a higher performance than the others at a threshold of 20 pixels. Since the method using single SSD can not generate likelihood maps for all input images, the results by single SSD are worse than those by the proposed method. As shown in Fig. 5, the precision by single FCN ascends rapidly for low thresholds. When a target object is large enough, the method using single FCN might track the position which is far from the center position of a target object, because FCN does not always obtain the highest probability value which is close to center position of a target object. Due to this, the precision by single FCN is lower than the others for low thresholds.

Figure 6 shows examples of the proposed method dealing with appearance variation and occlusion. As shown in Fig. 6, although there are various target objects of same category in each sequence, the proposed method can track each target object without template and parameter modification. Also, we confirmed that the proposed method can track a occluded target object more stably, in contrast to the top five traditional trackers, because DP seeks the global optimal path over all frames. However, when target objects of same category appear with occlusion, such as the final sequence in Fig. 6, the proposed method does not know which object should track. Since the proposed method utilizes general features of same category during object tracking, we should add to specific features of a target object, when it tracks distinguishably.

We also observed the performance of FCNT [29] using the same sequences. It could achieve 0.951 and 0.945 precision for single-object and multi-object sequences, respectively, at threshold of 20 pixels. It is, however, almost meaningless to compare this precision to ours. First of all, we should remember that FCNT needs a template and ours does not. In addition, FCNT needs initialization and ours does not. In fact, we can see that the proposed method without initialization has no severe degradation from comparison between Figs. 3 and 4. Furthermore, our DP-based method has theoretical superiority over FCNT at the robustness to occlusion, as shown in Fig. 7.

5 Conclusion

In this paper, we presented the object tracking method which combines SSD, FCN and DP. We confirmed that the proposed method is robust to appearance variation and occlusion through several experiments and achieved the highest accuracy compared to the traditional trackers in the Visual Tracker Benchmark. In contrast to traditional trackers, the proposed method can track the target object without initialization, modifying parameters, and templates as it synergizes the combination of SSD, FCN, and DP. However, the proposed method can be extended to tracking with multiple similar objects by using initialization. We confirmed that using both SSD and FCN is more stable to tracking than single SSD and single FCN as well.

We expect to use the proposed method in analysis field such as traffic analysis, bio-image analysis, etc. In future, we will connect SSD, FCN with network flows [37] to track multi-target objects simultaneously. To increase the tracking accuracy in situation when similar objects of same category appear with occlusion such as the last sequence in Fig. 6, we will apply Flownet [38], in order to utilize information of optical flow.

Notes

1.
Single-object sequences contain CarScale, Coke, Couple, Crossing, David3, MotorRolling, MountainBike, Walking1&2, Woman.
2.
Multi-object sequences contain CarScale, Coke, Couple, Crossing, David3, Jogging1&2, MotorRolling, MountainBike, Walking1&2, Woman.

References

Lewis, J.P.: Fast template matching. In: Vision Interface vol. 95, no. 120123, pp. 15–19 (1995)
Google Scholar
Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: multitarget detection and tracking. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24670-1_3
Chapter Google Scholar
Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. Comput. Vis. Pattern Recogn. 2, 142–149 (2000)
Google Scholar
Zach, C., Gallup, D., Frahm, J.-M.: Fast gain-adaptive KLT tracking on the GPU. In: Computer Vision and Pattern Recognition, pp. 1–7 (2008)
Google Scholar
He, W., Yamashita, T., Lu, H., Lao, S.: Surf tracking. In: IEEE 12th International Conference on Computer Vision, pp. 1586–1592 (2009)
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Uchida, S., Sakoe, H.: A monotonic and continuous two-dimensional warping based on dynamic programming. In: International Conference on Pattern Recognition, vol. 1, pp. 521–524 (1998)
Google Scholar
Geiger, D., Gupta, A., Costa, L.A., Vlontzos, J.: Dynamic programming for detecting, tracking, and matching deformable contours. IEEE Trans. Pattern Anal. Mach. Intell. 17(3), 294–302 (1995)
Article Google Scholar
Arnold, J., Shaw, S.W., Pasternack, H.: Efficient target tracking using dynamic programming. IEEE Trans. Aerosp. Electron. Syst. 29(1), 44–56 (1993)
Article Google Scholar
Wu, Y., Lim, J., Yang, M.-H.: Online object tracking: a benchmark. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013)
Google Scholar
Mei, X., Ling, H.: Robust visual tracking using $l1$ minimization. In: IEEE 12th International Conference on Computer Vision, pp. 1436–1443 (2009)
Google Scholar
Zhang, T., Ghanem, B., Liu, S., Ahuja, N.: Robust visual tracking via multi-task sparse learning. In: Computer Vision and Pattern Recognition (CVPR), pp. 2042–2049 (2012)
Google Scholar
Han, B., Comaniciu, D., Zhu, Y., Davis, L.S.: Sequential kernel density approximation and its application to real-time visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 30(7), 1186–1197 (2008)
Article Google Scholar
Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1296–1311 (2003)
Article Google Scholar
Ross, D.A., Lim, J., Lin, R.-S., Yang, M.-H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1), 125–141 (2008)
Article Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012)
Article Google Scholar
Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: BMVC, vol. 1, no. 5, p. 6 (2006)
Google Scholar
Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_19
Chapter Google Scholar
Son, J., Jung, I., Park, K., Han, B.: Tracking-by segmentation with online gradient boosting decision tree. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3056–3064 (2015)
Google Scholar
Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997)
Article Google Scholar
Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Convolutional neural network committees for handwritten character classification. In: Document Analysis and Recognition (ICDAR), pp. 1135–1139 (2011)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618 (2013)
Google Scholar
Fan, J., Xu, W., Wu, Y., Gong, Y.: Human tracking using convolutional neural networks. IEEE Trans. Neural Netw. 21(10), 1610–1623 (2010)
Article Google Scholar
Maung, T.H.H.: Real-time hand tracking and gesture recognition system using neural networks. World Acad. Sci. Eng. Technol. 50, 466–470 (2009)
Google Scholar
Torricelli, D., Conforto, S., Schmid, M., D’Alessio, T.: A neural-based remote eye gaze tracker under natural head motion. Comput. Methods Prog. Biomed. 92(1), 66–78 (2008)
Article Google Scholar
Li, H., Li, Y., Porikli, F.: Robust online visual tracking with a single convolutional neural network. In: Asian Conference on Computer Vision, pp. 194–209 (2014)
Google Scholar
Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3119–3127 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)
Google Scholar
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)
Article Google Scholar
Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M.-M., Hicks, S.L., Torr, P.H.S.: Structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2096–2109 (2016)
Article Google Scholar
Zhong, W., Lu, H., Yang, M.-H.: Robust object tracking via sparsity-based collaborative model. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1838–1845 (2012)
Google Scholar
Kalal, Z., Matas, J., Mikolajczyk, K.: P-N learning: bootstrapping binary classifiers by structural constraints. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–56 (2010)
Google Scholar
Dinh, T.B., Vo, N., Medioni, G.: Context tracker: exploring supporters and distracters in unconstrained environments. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1177–1184 (2011)
Google Scholar
Kwon, J., Lee, K.M.: Visual tracking decomposition. In: Computer Vision and Pattern Recognition (CVPR), pp. 1269–1276 (2010)
Google Scholar
Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network flows. In: Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Google Scholar
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Kyushu University, Fukuoka, Japan
Jinho Lee, Brian Kenji Iwana, Shouta Ide, Hideaki Hayashi & Seiichi Uchida

Authors

Jinho Lee
View author publications
You can also search for this author in PubMed Google Scholar
Brian Kenji Iwana
View author publications
You can also search for this author in PubMed Google Scholar
Shouta Ide
View author publications
You can also search for this author in PubMed Google Scholar
Hideaki Hayashi
View author publications
You can also search for this author in PubMed Google Scholar
Seiichi Uchida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinho Lee .

Editor information

Editors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Bathurst, New South Wales, Australia
Manoranjan Paul
University of São Paulo, São Paulo, Brazil
Carlos Hitoshi
University of Chinese Academy of Science, Beijing, China
Qingming Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, J., Iwana, B.K., Ide, S., Hayashi, H., Uchida, S. (2018). Globally Optimal Object Tracking with Complementary Use of Single Shot Multibox Detector and Fully Convolutional Network. In: Paul, M., Hitoshi, C., Huang, Q. (eds) Image and Video Technology. PSIVT 2017. Lecture Notes in Computer Science(), vol 10749. Springer, Cham. https://doi.org/10.1007/978-3-319-75786-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-75786-5_10
Published: 15 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75785-8
Online ISBN: 978-3-319-75786-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)