Visual tracking using Siamese convolutional neural network with region proposal and domain specific updating

doi:10.1016/j.neucom.2017.11.050

Neurocomputing

Volume 275, 31 January 2018, Pages 2645-2655

https://doi.org/10.1016/j.neucom.2017.11.050 Get rights and content

Abstract

This paper deals with the problem of arbitrary object tracking using Siamese convolutional neural network (CNN), which is trained to match the initial patch of the target in the first frame with candidates in a new frame. The network returns the most similar candidate with the smallest margin contrastive loss. For candidate proposals in each frame, a Siamese region proposal network is applied to identify potential targets from across the whole frame. It is also able to mine hard negative examples to make the network more discriminative for the specific sequence. The Siamese tracking network and the Siamese region proposal network share weights which are trained end-to-end. Taking advantage of the fast implementation of fully convolutional architecture, the Siamese region proposal network does not cost much spare time during online tracking. Although the network is trained to be a generic tracker that can be applied to any video sequence, we find that domain specific network updating with a short- and long-term strategy can significantly improve the tracking performance. After combining generic Siamese network training, Siamese region proposal, and domain specific updating, the proposed tracker obtains state-of-the-art tracking performance.

Introduction

Taking advantage of the deep convolutional neural networks (CNNs) as well as huge training datasets, computer vision technique has been improved forward with a great step [1]. Especially in the field of image classification [2] and target detection [3], [4], the computer has gained human-level performance. However, generic visual object tracking is still challenging, even with the help of deep CNNs and large training datasets. It is caused by the fact that the object is unknown before tracking and also shows arbitrary size and appearance during the whole tracking interval. Most existing trackers, with either a generative model or a discriminative model, such as SCM [5], MILTrack [6], Struck [7], TLD [8] and KCF [9] are based on the object-specific approach. The model of the object's appearance is learned in an online fashion using examples extracted from the video itself. However, the labels of these online training examples are assigned by the online tracking results, hence they are not guaranteed to be the ground truth. The reliability of those labels heavily depends on the validity of the tracking model.

Although the trackers using deep CNNs are not satisfactory, they still outperform the traditional trackers with shallow architectures. To fully take advantage of the representation power of CNNs in visual tracking, several deep architectures have been developed. It is a tricky work to train a proper CNN in an object specific approach. Approaches have been proposed by transferring the CNN weights pre-trained from a large scale image dataset such as the ImageNet [10], [11], [12]. Due to the fact that there is only one reference image (ground truth) for tracking, it makes sense to compare the candidate image patches with the reference image patch, and then choose the most similar one. A Siamese CNN satisfies the demand exactly. However, it is difficult for a Siamese network to learn features that are representative enough to adapt to various appearance changes of a same target, such as changes in geometry/photometry, camera viewpoint, illumination, or partial occlusion. There is still a long journey to build a reliable generic visual tracker for arbitrary object tracking.

In this paper, we propose a Siamese CNN tracker based on VGG-M network [13]. Similar CNN architecture has also been employed by trackers in [12,14], which represent the best performance on either accuracy or efficiency up to now. Our proposed network is initialized by the network weights pre-trained on ImageNet classification dataset, and further fine-tuned using three video datasets [15], [16], [17]. The entire network is trained end-to-end by SGD with a back-propagation algorithm using the common libraries Caffe [18]. The Siamese CNN is not only used for tracking, but also adopted for candidate object proposals by embedding an additional convolutional layer. The proposed Siamese CNN tracker can be used as a generic tracker that can be directly applied to any video sequence. Inspired by the outstanding performance of MDNet [12], we find that domain specific updating can improve the tracking performance significantly. Experiment results show that our proposed network gives best performance among the proposed Siamese networks [10,11,19].

The key contribution of our work lies in two aspects. First, a Siamese CNN tracker is presented and trained using a combined dataset containing 1719 sequences. Without any model updating, the tracker is able to obtain comparable performance with MEEM [20]. We also try to train our Siamese net with a small dataset containing only 58 sequences. Comparable experiment on OTB-100 dataset shows that training data with a large size and good variety is important to obtain a good generic CNN tracker. However, the performance of CNN based tracker is still not as good as expected. We assume it is because that the deep feature is not representative enough for all object variations such as deformation, occlusion, et al. It means that domain specific model updating is essential to further improve the performance of any tracker. By fine-tuning the last three layers of our network during online tracking procedure, significant performance improvement is observed. During online updating, a short- and long-term memory strategy is adopted. The second contribution is that a Siamese region proposal network is constructed based on the proposed Siamese CNN tracker. The region proposal network identifies potential object candidates across the whole incoming frame instead of a small search radius around the previous target location. The improvement brought by the region proposal network mainly lies in three aspects. The first one is to reduce the number of candidate image patches in each frame. It helps to improve the efficiency of the tracking procedure. The second is to re-detect the tracked object after we lost it or the object is blocked for a while. Third, the region proposal procedure helps to mine hard negative examples that are used for model updating. During the tracking process, we update the object model concentrating on hard false-positives that are supplied by the region proposal network. Hard false-positive samples help to suppress distractors caused by complex background clutters, and learn how to re-rank proposals according to the object model. The proposed Siamese region proposal network is designed by taking advantage of the fast implementation of the fully convolutional network proposed in [14], where an additional correlation layer is appended at the end. By sharing parameters between the Siamese tracker network and the Siamese region proposal network, only a little spare time consuming is needed for candidate target region proposal in a new frame. The Siamese CNN tracker and the Siamese region proposal network are combined in a way similar to the combination of target objectness region proposal network and the target detection network in the faster-RCNN target detector [21].

The paper is organized as follows. Section 2 discusses related works in tracking and convolutional neural networks. Our tracking framework is described in Section 3. Section 4 presents the experimental evaluations and results. Finally, conclusions are given in Section 5.

Section snippets

Siamese tracker

Object representation is one of the major components in any visual tracking algorithm. Wang [22] concludes that the feature extractor is the most important part of a tracker and the observation model is not significantly important if the features are good enough. Fortunately, deep CNN is a powerful tool to learn good visual features. Given the initial state (e.g., position and extent) of a target object in the first image, the goal of tracking is to estimate the states of the target in the

The proposed Siamese CNN tracker

Inspired by the previous mentioned trackers, especially from [12], [14], and [26], we present a new Siamese CNN tracker that is trained with a margin contrastive loss. First of all, the Siamese CNN illustrated in Fig. 1 is trained with different training sequences, which is presented in Section 3.1. A generic tracker is obtained after network offline training. The generic tracker is used to demonstrate the effective of our proposed Siamese structure as illustrated in Fig. 1. The way of how the

Experiments

The proposed tracker termed as SRPT (Siamese Region Proposal Tracker) is evaluated on a large benchmark dataset OTB-100 [31] containing 100 videos with comparisons to state-of-the-art methods. Trackers are evaluated following the protocol in [31] using success plot, which measures the percentage of successfully tracked frames. A frame is defined as successfully tracked if the IOU value between the predicated bounding box and the ground truth box is bigger than a threshold. The success plot is

Conclusion

We propose a visual tracker using Siamese CNN combined with Siamese region proposal network and domain specific updating. The region proposal network is able to identify potential targets from across the whole frame, and also to mine hard negative examples to make the network more discriminative for the specific sequence. It shares weights with the tracking network, thus does not spend too much spare time for region proposal. Domain specific fine-tuning and short- and long-term based online

Han Zhang received the B.S. degree in electronics science and technology from Shanghai Jiao Tong University, Shanghai, China, in 2010 and the M.S. degree from National University of Defense Technology, Changsha, China, in 2012. She is currently working as Research Associate in Northwest Institute of Nuclear Technology, Xi'an, China. Her research interests include multitemporal remote sensing, image analysis and pattern recognition.

Reference (34)

GuoY. et al.
Deep learning for visual understanding: a review
Neurocomputing
(2016)
K. Chen et al.
Once for all: a two-flow convolutional neural network for visual tracking
IEEE Transactions on Circuits & Systems for Video Technology
(2016)
WangJ. et al.
Object tracking using color-feature guided network generalization and tailored feature fusion
Neurocomputing
(2017)
HeK. et al.
Deep residual learning for image recognition
J. Redmon et al.
You only look once: unified, real-time object detection
LiuW. et al.
SSD: single shot MultiBox detector
ZhongW. et al.
Robust object tracking via sparse collaborative appearance model
IEEE Trans. Image Process
(2014)
B. Babenko et al.
Visual tracking with online multiple instance learning
S. Hare et al.
Struck: Structured output tracking with kernels
Z. Kalal et al.
Tracking-learning-detection
PAMI
(2012)

J.F. Henriques et al.

High-speed tracking with Kernelized correlation filters

PAMI

(2015)

TaoR. et al.

Siamese instance search for tracking

NamH. et al.

Learning multi-domain convolutional neural networks for visual tracking

K. Chatfield et al.

Return of the devil in the details: delving deep into convolutional nets

L. Bertinetto et al.

Fully-convolutional Siamese networks for object tracking

M. Kristan et al.

The visual object tracking VOT2015 challenge results

O. Russakovsky et al.

ImageNet large scale visual recognition challenge

Cited by (28)

Uncertain motion tracking via target-objectness proposal and memory validation
2022, Information Sciences
Citation Excerpt :
The EBT tracker searched for instance-specific candidate target proposals across the whole frame by training an SVM classifier based on the Edge Box feature. The Siamese region proposal network (RPN) [40] is applied to identify potential targets from across the whole frame. It is also able to mine negative examples to make the tracker more discriminative for the specific sequence.
Traditional correlation filter trackers only rely on a search window to localize the object in each frame. However, such approaches are prone to fail in case of e.g.uncertain motion or distractor, where a simple motion model alone is insufficient for robust tracking. To address this problem, we propose a novel tracking architecture that can utilize target-objectness proposals (TOP) effectively. Firstly, we reformulate the Markov Chains Monte Carlo method as a region proposal mechanism from a new perspective. It integrates learning features into the feature space sampling strategy for global object candidates boxes generator. Secondly, to improve the quality of candidates boxes, we constructe objectness labels to guide the sampling process. With the help of objectness information, this interaction of spatial constraints increases the effectiveness of region proposals. In the end, a unified tracking framework is designed to enable sampling and regression strategies to exploit and complement each other to cope with uncertain motion tracking. The proposed TOP tracker performs favorably against advanced trackers, especially in uncertain motion scenarios, on four benchmark datasets including OTB-2013, OTB-2015, Temple Color-128 and UAV-123.
Attention classification-and-segmentation network for micro-crack anomaly detection of photovoltaic module cells
2022, Solar Energy
Micro-crack anomaly detection is a crucial part of the quality inspection of photovoltaic (PV) module cells. However, due to the complex background and the lack of sufficient anomaly samples, it is a challenging task to identify and locate micro-crack accurately. This paper presents a novel method for detecting micro-crack anomaly in PV module cells by designing an attention classification-and-segmentation network. Specifically, the proposed network consists of a classification network and a segmentation network. In the classification network, the real-time micro-crack anomaly discrimination task in the detection process is performed at first. The classification network introduces transfer learning and deep supervision mechanism to effectively extract and fuse multi-scale features to accurately predict the anomaly probability score. In the segmentation network, the pixel-level micro-crack detection is conducted on the sample determined as defects. The M-shaped structure is used to better extract and fuse shallow-level and deep-level features in the segmentation network, effectively solving the “All Black” issue in the training process. And the attention mechanism module is inserted into the M-shaped structure to more effectively extract micro-crack anomaly features and suppress background noise, thereby significantly improving the accuracy of segmentation. By the design of two-stage network architecture and the utilization of the attention module, the proposed network presents a strong capability for learning from a small set of labeled and annotated samples. Comprehensive experiments are conducted on the real PV electroluminescence (EL) images dataset. Experimental results show that, compared with InceptionV4 classification network and U-net segmentation network, the proposed network has superior performance with ACC of 100.0% and DICE of 0.541 in the micro-crack anomaly classification and segmentation task. Moreover, the experiments show the attention module inserted in the proposed network plays a significant role in improving the classification and segmentation accuracy.
Efficient attention based deep fusion CNN for smoke detection in fog environment
2021, Neurocomputing
Smoke detection based on video monitoring is of great importance for early fire warning. However, most of the smoke detection methods based on neural network only consider the normal weather. The harsh weather such as the fog environment is ignored. In this paper, we propose a smoke detection in normal and fog weather, which combines attention mechanism and feature-level and decision-level fusion module. First, a new fog smoke dataset with diverse positive and hard negative samples dataset is established through online collection and offline shooting. Then, an attention mechanism module combining spatial attention and channel attention is proposed to solve the problem of small smoke detection. Next, a lightweight feature-level and decision-level fusion module is proposed, which can not only improve the discrimination of smoke, fog and other similar objects, but also ensure the real-time performance of the model. Finally, a large number of comparative experiments on the existing dataset and our self-created dataset, show that our method can obtain higher detection accuracy rate, precision rate, recall rate, and F1 score from the perspective of overall, each category, small smoke and hard negative samples detection than the existing methods.
Deep convolutional neural networks for data delivery in vehicular networks
2021, Neurocomputing
In vehicular networks, most content delivery schemes only utilize vehicle cooperation or powerful infrastructure to satisfy data requests. How to fully utilize vehicle-to-vehicle and vehicle-to-infrastructure communications to improve data acquisition still requires further analysis. In this paper, the content delivery problem is formulated as a maximum flow of a directed network, which implies the encounters and the requests. Despite of a high delivery ratio, the proposed Content delivery scheme using mAximum Flow (CAF) is infeasible in large-scale real-time applications due to high computational complexity. To solve this problem, we transform the GPS trajectory data into two-dimensional coverage grid maps which indicate the communication opportunities between vehicles and infrastructures in CAF. The map set, which consists of coverage grid maps in a storage cycle, and the number of satisfied requests obtained from CAF compose the training set that can be trained by the deep convolutional neural networks. This solution combining CAF with deep neural networks is called CAF-Net. In the experiments, we evaluate the performances of four popular architectures of deep convolutional neural networks when outputting the targets. The results show that ResNet 50 has the smallest error and the computation time of a delivery ratio is only 82.84 ms, which is a lot shorter than 4531.53 s using CAF. The results also demonstrate the feasibility of applying the deep learning framework to vehicular networks.
A new deep learning method for displacement tracking from ultrasound RF signals of vascular walls
2021, Computerized Medical Imaging and Graphics
It is necessary to monitor the mechanical properties of arteries which directly related to cardiovascular diseases (CVDs) in the early stages. In this study, we proposed a new method based on deep learning (DL) to track the displacement of the vessel wall from the ultrasound radio-frequency (RF) signals, which is a key technique to achieve quantitative measurement of vascular biomechanics. In comparison with traditional method, both results on simulation and experimental carotid artery data demonstrated that the DL method has higher accuracy for motion tracking of artery walls. Hence, the DL method can be widely applied so that can predict the early pathology of cardiovascular system.
DrlNet: Blind object proposal quality assessment with discriminative response learning
2020, Digital Signal Processing: A Review Journal
Object proposal quality assessment without ground truth as reference is a challenging task. Some existing methods measure the quality with hand-crafted metrics for subjective metrics, such as objectness and foreground confidence. Recently, deep learning is adopted for direct assessment for quantifiable metric, such as Intersection over Union (IoU). However, we find that IoU, the commonly used quality metric, is far from fully describing the quality of an object proposal. Proposals with the same IoU score may carry totally different amount of discriminative attribute. We introduce a new metric named Discriminative Information Richness (DIR) to characterize the discriminative degree of the given object proposal. DIR is derived from the response intensity of the projected deep feature maps, whose high correlation response indicates the discriminative regions. Besides, we design a convolutional neural network named DrlNet to simultaneously predict IoU scores and perceive the richness of the identification information. DrlNet is defined as a multi-metric joint deep regression network for both spatial covering prediction and discriminative information richness perception. Compared with the solely IoU based models, DrlNet can provide more comprehensive quality assessment. We perform comprehensive experiments on both PASCAL VOC dataset and COCO dataset. The experimental results show that our DrlNet performs well on both proposal selection and object detection tasks. Particularly, experimental results on COCO dataset demonstrate the good generalization ability of the proposed model.

View all citing articles on Scopus

Weiping Ni was born in China in 1980. He received the B.S., degree from University of Science and Technology of China, Hefei, China, in 2004, the M.S. degree from National University of Defense Technology Changsha, China, in 2006, and Ph.D. degree in pattern recognition and intelligent system at Xidian University, Xi'an, China, in 2016. From 2014 until now, he has been a Research Associate with the Northwest Institute of Nuclear Technology, Xi'an, China. His research interest includes remote sensing image processing, automatic target recognition, and computer vision.

Weidong Yan was born in 1967. He received the B.S. and M.S. degrees in electronic engineering from the School of Electrical Engineering, National University of Defense Technology, Changsha, China. He is currently a Researcher Fellow with the Northwest Institute of Nuclear Technology, Xi'an, China. His research interests include remote sensing image analysis and pattern recognition.

Junzheng Wu received the B.S. (2008) degree in automation from Tsinghua University, China, and the M.S(2011) degree in signal and information processing from Northwest Institute of Nuclear Technology(NINT). He is currently a Research Assistant of NINT, and his research interests include computer vision, remote sensing images processing.

Hui Bian was born in 1971, an Associate Research Fellow with the Northwest Institute of Nuclear Technology, Xi'an, China. His research interests include image fusion, target detection and pattern recognition.

Deliang Xiang received the B.S. degree in remote sensing science and technology from Wuhan University, Wuhan, China, in 2010 and the M.S. degree from National University of Defense Technology, Changsha, China, in 2012. He is currently pursuing the Ph.D. degree in microwave remote sensing at KTH Royal Institute of Technology, Stockholm, Sweden. His research interests include urban area remote sensing, PolSAR image processing, and pattern recognition.

View full text

Visual tracking using Siamese convolutional neural network with region proposal and domain specific updating

Abstract

Introduction

Section snippets

Siamese tracker

The proposed Siamese CNN tracker

Experiments

Conclusion

Neurocomputing

IEEE Transactions on Circuits & Systems for Video Technology

Neurocomputing

Deep residual learning for image recognition

You only look once: unified, real-time object detection

SSD: single shot MultiBox detector

Robust object tracking via sparse collaborative appearance model

IEEE Trans. Image Process

Visual tracking with online multiple instance learning

Struck: Structured output tracking with kernels

Tracking-learning-detection

PAMI

High-speed tracking with Kernelized correlation filters

PAMI

Siamese instance search for tracking

Learning multi-domain convolutional neural networks for visual tracking

Return of the devil in the details: delving deep into convolutional nets

Fully-convolutional Siamese networks for object tracking

The visual object tracking VOT2015 challenge results

ImageNet large scale visual recognition challenge