Elsevier

Neurocomputing

Volume 275, 31 January 2018, Pages 2645-2655
Neurocomputing

Visual tracking using Siamese convolutional neural network with region proposal and domain specific updating

https://doi.org/10.1016/j.neucom.2017.11.050Get rights and content

Abstract

This paper deals with the problem of arbitrary object tracking using Siamese convolutional neural network (CNN), which is trained to match the initial patch of the target in the first frame with candidates in a new frame. The network returns the most similar candidate with the smallest margin contrastive loss. For candidate proposals in each frame, a Siamese region proposal network is applied to identify potential targets from across the whole frame. It is also able to mine hard negative examples to make the network more discriminative for the specific sequence. The Siamese tracking network and the Siamese region proposal network share weights which are trained end-to-end. Taking advantage of the fast implementation of fully convolutional architecture, the Siamese region proposal network does not cost much spare time during online tracking. Although the network is trained to be a generic tracker that can be applied to any video sequence, we find that domain specific network updating with a short- and long-term strategy can significantly improve the tracking performance. After combining generic Siamese network training, Siamese region proposal, and domain specific updating, the proposed tracker obtains state-of-the-art tracking performance.

Introduction

Taking advantage of the deep convolutional neural networks (CNNs) as well as huge training datasets, computer vision technique has been improved forward with a great step [1]. Especially in the field of image classification [2] and target detection [3], [4], the computer has gained human-level performance. However, generic visual object tracking is still challenging, even with the help of deep CNNs and large training datasets. It is caused by the fact that the object is unknown before tracking and also shows arbitrary size and appearance during the whole tracking interval. Most existing trackers, with either a generative model or a discriminative model, such as SCM [5], MILTrack [6], Struck [7], TLD [8] and KCF [9] are based on the object-specific approach. The model of the object's appearance is learned in an online fashion using examples extracted from the video itself. However, the labels of these online training examples are assigned by the online tracking results, hence they are not guaranteed to be the ground truth. The reliability of those labels heavily depends on the validity of the tracking model.

Although the trackers using deep CNNs are not satisfactory, they still outperform the traditional trackers with shallow architectures. To fully take advantage of the representation power of CNNs in visual tracking, several deep architectures have been developed. It is a tricky work to train a proper CNN in an object specific approach. Approaches have been proposed by transferring the CNN weights pre-trained from a large scale image dataset such as the ImageNet [10], [11], [12]. Due to the fact that there is only one reference image (ground truth) for tracking, it makes sense to compare the candidate image patches with the reference image patch, and then choose the most similar one. A Siamese CNN satisfies the demand exactly. However, it is difficult for a Siamese network to learn features that are representative enough to adapt to various appearance changes of a same target, such as changes in geometry/photometry, camera viewpoint, illumination, or partial occlusion. There is still a long journey to build a reliable generic visual tracker for arbitrary object tracking.

In this paper, we propose a Siamese CNN tracker based on VGG-M network [13]. Similar CNN architecture has also been employed by trackers in [12,14], which represent the best performance on either accuracy or efficiency up to now. Our proposed network is initialized by the network weights pre-trained on ImageNet classification dataset, and further fine-tuned using three video datasets [15], [16], [17]. The entire network is trained end-to-end by SGD with a back-propagation algorithm using the common libraries Caffe [18]. The Siamese CNN is not only used for tracking, but also adopted for candidate object proposals by embedding an additional convolutional layer. The proposed Siamese CNN tracker can be used as a generic tracker that can be directly applied to any video sequence. Inspired by the outstanding performance of MDNet [12], we find that domain specific updating can improve the tracking performance significantly. Experiment results show that our proposed network gives best performance among the proposed Siamese networks [10,11,19].

The key contribution of our work lies in two aspects. First, a Siamese CNN tracker is presented and trained using a combined dataset containing 1719 sequences. Without any model updating, the tracker is able to obtain comparable performance with MEEM [20]. We also try to train our Siamese net with a small dataset containing only 58 sequences. Comparable experiment on OTB-100 dataset shows that training data with a large size and good variety is important to obtain a good generic CNN tracker. However, the performance of CNN based tracker is still not as good as expected. We assume it is because that the deep feature is not representative enough for all object variations such as deformation, occlusion, et al. It means that domain specific model updating is essential to further improve the performance of any tracker. By fine-tuning the last three layers of our network during online tracking procedure, significant performance improvement is observed. During online updating, a short- and long-term memory strategy is adopted. The second contribution is that a Siamese region proposal network is constructed based on the proposed Siamese CNN tracker. The region proposal network identifies potential object candidates across the whole incoming frame instead of a small search radius around the previous target location. The improvement brought by the region proposal network mainly lies in three aspects. The first one is to reduce the number of candidate image patches in each frame. It helps to improve the efficiency of the tracking procedure. The second is to re-detect the tracked object after we lost it or the object is blocked for a while. Third, the region proposal procedure helps to mine hard negative examples that are used for model updating. During the tracking process, we update the object model concentrating on hard false-positives that are supplied by the region proposal network. Hard false-positive samples help to suppress distractors caused by complex background clutters, and learn how to re-rank proposals according to the object model. The proposed Siamese region proposal network is designed by taking advantage of the fast implementation of the fully convolutional network proposed in [14], where an additional correlation layer is appended at the end. By sharing parameters between the Siamese tracker network and the Siamese region proposal network, only a little spare time consuming is needed for candidate target region proposal in a new frame. The Siamese CNN tracker and the Siamese region proposal network are combined in a way similar to the combination of target objectness region proposal network and the target detection network in the faster-RCNN target detector [21].

The paper is organized as follows. Section 2 discusses related works in tracking and convolutional neural networks. Our tracking framework is described in Section 3. Section 4 presents the experimental evaluations and results. Finally, conclusions are given in Section 5.

Section snippets

Siamese tracker

Object representation is one of the major components in any visual tracking algorithm. Wang [22] concludes that the feature extractor is the most important part of a tracker and the observation model is not significantly important if the features are good enough. Fortunately, deep CNN is a powerful tool to learn good visual features. Given the initial state (e.g., position and extent) of a target object in the first image, the goal of tracking is to estimate the states of the target in the

The proposed Siamese CNN tracker

Inspired by the previous mentioned trackers, especially from [12], [14], and [26], we present a new Siamese CNN tracker that is trained with a margin contrastive loss. First of all, the Siamese CNN illustrated in Fig. 1 is trained with different training sequences, which is presented in Section 3.1. A generic tracker is obtained after network offline training. The generic tracker is used to demonstrate the effective of our proposed Siamese structure as illustrated in Fig. 1. The way of how the

Experiments

The proposed tracker termed as SRPT (Siamese Region Proposal Tracker) is evaluated on a large benchmark dataset OTB-100 [31] containing 100 videos with comparisons to state-of-the-art methods. Trackers are evaluated following the protocol in [31] using success plot, which measures the percentage of successfully tracked frames. A frame is defined as successfully tracked if the IOU value between the predicated bounding box and the ground truth box is bigger than a threshold. The success plot is

Conclusion

We propose a visual tracker using Siamese CNN combined with Siamese region proposal network and domain specific updating. The region proposal network is able to identify potential targets from across the whole frame, and also to mine hard negative examples to make the network more discriminative for the specific sequence. It shares weights with the tracking network, thus does not spend too much spare time for region proposal. Domain specific fine-tuning and short- and long-term based online

Han Zhang received the B.S. degree in electronics science and technology from Shanghai Jiao Tong University, Shanghai, China, in 2010 and the M.S. degree from National University of Defense Technology, Changsha, China, in 2012. She is currently working as Research Associate in Northwest Institute of Nuclear Technology, Xi'an, China. Her research interests include multitemporal remote sensing, image analysis and pattern recognition.

Reference (34)

  • J.F. Henriques et al.

    High-speed tracking with Kernelized correlation filters

    PAMI

    (2015)
  • TaoR. et al.

    Siamese instance search for tracking

  • NamH. et al.

    Learning multi-domain convolutional neural networks for visual tracking

  • K. Chatfield et al.

    Return of the devil in the details: delving deep into convolutional nets

  • L. Bertinetto et al.

    Fully-convolutional Siamese networks for object tracking

  • M. Kristan et al.

    The visual object tracking VOT2015 challenge results

  • O. Russakovsky et al.

    ImageNet large scale visual recognition challenge

  • Cited by (28)

    • Uncertain motion tracking via target-objectness proposal and memory validation

      2022, Information Sciences
      Citation Excerpt :

      The EBT tracker searched for instance-specific candidate target proposals across the whole frame by training an SVM classifier based on the Edge Box feature. The Siamese region proposal network (RPN) [40] is applied to identify potential targets from across the whole frame. It is also able to mine negative examples to make the tracker more discriminative for the specific sequence.

    View all citing articles on Scopus

    Han Zhang received the B.S. degree in electronics science and technology from Shanghai Jiao Tong University, Shanghai, China, in 2010 and the M.S. degree from National University of Defense Technology, Changsha, China, in 2012. She is currently working as Research Associate in Northwest Institute of Nuclear Technology, Xi'an, China. Her research interests include multitemporal remote sensing, image analysis and pattern recognition.

    Weiping Ni was born in China in 1980. He received the B.S., degree from University of Science and Technology of China, Hefei, China, in 2004, the M.S. degree from National University of Defense Technology Changsha, China, in 2006, and Ph.D. degree in pattern recognition and intelligent system at Xidian University, Xi'an, China, in 2016. From 2014 until now, he has been a Research Associate with the Northwest Institute of Nuclear Technology, Xi'an, China. His research interest includes remote sensing image processing, automatic target recognition, and computer vision.

    Weidong Yan was born in 1967. He received the B.S. and M.S. degrees in electronic engineering from the School of Electrical Engineering, National University of Defense Technology, Changsha, China. He is currently a Researcher Fellow with the Northwest Institute of Nuclear Technology, Xi'an, China. His research interests include remote sensing image analysis and pattern recognition.

    Junzheng Wu received the B.S. (2008) degree in automation from Tsinghua University, China, and the M.S(2011) degree in signal and information processing from Northwest Institute of Nuclear Technology(NINT). He is currently a Research Assistant of NINT, and his research interests include computer vision, remote sensing images processing.

    Hui Bian was born in 1971, an Associate Research Fellow with the Northwest Institute of Nuclear Technology, Xi'an, China. His research interests include image fusion, target detection and pattern recognition.

    Deliang Xiang received the B.S. degree in remote sensing science and technology from Wuhan University, Wuhan, China, in 2010 and the M.S. degree from National University of Defense Technology, Changsha, China, in 2012. He is currently pursuing the Ph.D. degree in microwave remote sensing at KTH Royal Institute of Technology, Stockholm, Sweden. His research interests include urban area remote sensing, PolSAR image processing, and pattern recognition.

    View full text