Visual tracking using spatio-temporally nonlocally regularized correlation filter

doi:10.1016/j.patcog.2018.05.017

Pattern Recognition

Volume 83, November 2018, Pages 185-195

https://doi.org/10.1016/j.patcog.2018.05.017 Get rights and content

Highlights

•
A novel regularized CF based tracking approach has been proposed with promising results on three benchmark datasets.
•
Our method effectively captures the long-term spatio-temporally nonlocal superpixel appearance information to regularize the CF learning.
•
Our method deals well with the challenging factors such as large viewpoint changes and non-rigid deformation.

Abstract

Due to the factors like rapidly fast motion, cluttered backgrounds, arbitrary object appearance variation and shape deformation, an effective target representation plays a key role in robust visual tracking. Existing methods often employ bounding boxes for target representations, which are easily polluted by noisy clutter backgrounds that may cause drifting problem when the target undergoes large-scale non-rigid or articulated motions. To address this issue, in this paper, motivated by the spatio-temporal nonlocality of target appearance reoccurrence in a video, we explore the nonlocal information to accurately represent and segment the target, yielding an object likelihood map to regularize a correlation filter (CF) for visual tracking. Specifically, given a set of tracked target bounding boxes, we first generate a set of superpixels to represent the foreground and background, and then update the appearance of each superpixel with its long-term spatio-temporally nonlocal counterparts. Then, with the updated appearances, we formulate a spatio-temporally graphical model comprised of the superpixel label consistency potentials. Afterwards, we generate segmentation by optimizing the graphical model via iteratively updating the appearance model and estimating the labels. Finally, with the segmentation mask, we obtain an object likelihood map that is employed to adaptively regularize the CF learning by suppressing the clutter background noises while making full use of the long-term stable target appearance information. Extensive evaluations on the OTB50, SegTrack, Youtube-Objects datasets demonstrate the effectiveness of the proposed method, which performs favorably against some state-of-art methods.

Introduction

Visual tracking is a classic computer vision problem, but remains as a challenging task since the target appearance may suffer from the factors such as rapidly fast motion, cluttered backgrounds, arbitrary object appearance variation and shape deformation, to name a few. A robust appearance model plays a key role in ensuring a good tracking performance, thereby attracting much attention in the past decades [1], [2], [3], [4], [5], [6]. Here, we also investigate how to learn an effective appearance model for robust visual tracking.

Recently, discriminative correlation filter (DCF) based approaches have witnessed a great success in visual tracking [7], [8], [9], [10], [11], [12], [13], [14], [15]. These methods learn a CF from a set of circulant training samples, corresponding to periodically extending these samples, and hence training and detection of these trackers can be efficiently achieved by the fast fourier transform (FFT). However, the circulant structures of the training samples also introduce unwanted boundary effects, resulting in inaccurately representing the target, thereby limiting the discriminative capability of the learned CF. The boundary effects have been elaborately mitigated by recent work [9], [12]. In [9] Danelljan et al. introduce a spatial regularization to suppress the CF values outside the object boundary. In [12] the CF is learned with a set of real negative samples densely sampled from background. The second drawback of the CF based trackers is that the target shape is represented by a bounding box, which may introduce clutter background noises into the target especially when the target suffers from large-scale non-rigid deformation, leading to a set of suboptimal positive samples. Over time this degrades the learned CF model and causes drifting problem. To address this issue, Lukežič et al. [7] employ a segmentation mask that can well extract target from background to spatially regularize the CF learning. However, only the locally spatial information is exploited to segment the object without resorting to useful long-term spatio-temporal information that is helpful to achieve a robust segmentation.

To address the above problems, we propose a spatio-temporal nonlocally (STN) regularized CF based tracker that explores the long-term spatio-temporal nonlocal information of the target appearance to accurately represent and segment the target from background. Fig. 1 shows the flow chart of our method, among which the segmentation method is motivated by the video object segmentation techniques [16], [17], [18]. First, given the tracked bounding boxes in a set of consecutive frames, we leverage optical flow information to obtain a rough object position that ensures a frame-to-frame segmentation consistency. Specifically, we produce a rough motion boundary in pairs of adjacent frames using the technology in [16] and then get an efficient initial foreground estimation. Here, the only requirement for the object is to move differently from its surrounding background in some frames of the video. Moreover, in order to reduce the noises introduced by appearance learning, we explore the information of superpixels from the long-term spatio-temporal nonlocal regions to learn a robust appearance model, which is integrated into a spatio-temporal graph model. Afterwards, as GrabCut [19], the graph model is iteratively solved by refining the foreground-background labeling and updating the foreground-background appearance models. Finally, after achieving the segmentation mask, we generate an object likelihood map as the weight factor to adaptively regularize the CF learning for visual tracking. To solve the regularized CFs, we employ an alternating direction method of multipliers (ADMM) [7] which can be efficiently solved via FFT in each iteration. We extensively evaluate the proposed algorithm on three challenging datasets including OTB50 [20] SegTrack [21] and Youtube-Objects [22], showing favorable results against some state-of-art methods.

The main contributions of this work are summarized as follows:

•
We present a robust appearance model for segmentation that explores the spatio-temporally nonlocal information from superpixels.
•
We present a segmentation based regularized KCF tracker that can not only output an object rectangle but also its accurate contour, achieving more accurate results than the KCF tracker.
•
The proposed approach achieves favorable results on both video segmentation and visual tracking benchmarks, such as SegTrack, Youtube-Objects and OTB50.

Section snippets

Related work

In this section, we briefly review some recent approaches related to our work, including video object segmentation and CF based visual tracking.

Methodology

In contrast to the supervised video object segmentation methods [17], [30] that need to well segment the target in the first frame, the proposed tracking method only needs to give the initial bounding box in the first frame as a prior. Before assigning each pixel a label, in order to reduce the computational complexity and the background noise, we first use the method introduced in Section 3.1 to obtain the coarse object location mask in each frame. Then, we use the TurboPixel algorithm [46] to

Setup

All images are sized to a fixed size of 240 × 320 pixels for experiments that can effectively handle the fast motion factor. We employ SLIC [50] to generate a set of superpixels in each frame due to its high efficiency. For each sequence, the number of superpixels is about 50 ∼ 1500. The parameter β in (9) is set to $β = 0.5,$ the parameter λ in (12) is set to $λ = 1,$ and the parameter η in (16) and (17) is set to $η = 0.075$ . We leverage the $d = 31$ -channel HOG features to represent the target appearances

Conclusion

In this paper, we have presented a CF based tracking approach that explores the spatio-temporal nonlocal information to accurately represent and segment the target in a video, yielding an object likelihood map to regularize the CF learning. Specifically, given the tracked target bounding boxes, we first generated a set of superpixels to represent the foreground and background, and then updated the appearance of each superpixel with its long-term spatio-temporally nonlocal counterparts. Then,

Acknowledgements

This work is supported in part by NSF of Jiangsu province under grants BK20170040, BK20151529 and BK20150906, in part by the Six talent peaks project in Jiangsu Province under Grant R2017L07, in part by NSFC under grand 61773002 and in part by Applied basic research project in Shanxi Province under grand 201601D011007.

References (61)

K. Zhang et al.
Real-time visual tracking via online weighted multiple instance learning
Pattern Recognit.
(2013)
S. Zhang et al.
Sparse coding based visual tracking: review and experimental comparison
Pattern Recognit.
(2013)
B. Zhong et al.
Visual tracking via weakly supervised learning from multiple imperfect oracles
Pattern Recognit.
(2014)
Y. Xie et al.
Discriminative subspace learning with sparse representation view-based model for robust visual tracking
Pattern Recognit.
(2014)
Y. Su et al.
Abrupt motion tracking using a visual saliency embedded particle filter
Pattern Recognit.
(2014)
L. Zhang et al.
Robust visual tracking via co-trained kernelized correlation filters
Pattern Recognit.
(2017)
Y. Wang et al.
Hierarchically supervised deconvolutional network for semantic video segmentation
Pattern Recognit.
(2017)
L. Wang et al.
Robust level set image segmentation via a local correntropy-based k-means clustering
Pattern Recognit.
(2014)
B. Peng et al.
A survey of graph theoretical approaches to image segmentation
Pattern Recognit.
(2013)
F. Liu et al.
Crf learning with cnn features for image segmentation
Pattern Recognit.
(2015)

X.-F. Wang et al.

A novel level set method for image segmentation by incorporating local statistical analysis and global similarity measurement

Pattern Recognit.

(2015)

P. Ochs et al.

Segmentation of moving objects by long term video analysis

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

A. Lukežič et al.

Discriminative correlation filter with channel and spatial reliability

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

M. Mueller et al.

Context-aware correlation filter tracking

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

M. Danelljan et al.

Learning spatially regularized correlation filters for visual tracking

Proceedings of the IEEE International Conference on Computer Vision

(2015)

J.F. Henriques et al.

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

L. Bertinetto et al.

Staple: complementary learners for real-time tracking

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2016)

H. Kiani Galoogahi et al.

Learning background-aware correlation filters for visual tracking

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

M. Danelljan et al.

Beyond correlation filters: learning continuous convolution operators for visual tracking

Proceedings of European Conference on Computer Vision

(2016)

M. Danelljan et al.

Eco: efficient convolution operators for tracking

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

H. Kiani Galoogahi et al.

Correlation filters with limited boundaries

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2015)

A. Papazoglou et al.

Fast object segmentation in unconstrained video

Proceedings of the IEEE International Conference on Computer Vision

(2013)

N. Märki et al.

Bilateral space video segmentation

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2016)

T. Wang et al.

Probabilistic motion diffusion of labeling priors for coherent video segmentation

IEEE Trans. Multimedia

(2012)

C. Rother et al.

Grabcut: interactive foreground extraction using iterated graph cuts

ACM Trans. Graph.

(2004)

Y. Wu et al.

Object tracking benchmark

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

D. Tsai et al.

Motion coherent tracking using multi-label mrf optimization

Int. J. Comput. Vis.

(2012)

A. Prest et al.

Learning object class detectors from weakly annotated video

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2012)

B.L. Price et al.

Livecut: Learning-based interactive video segmentation by evaluation of multiple propagated cues

Proceedings of the IEEE International Conference on Computer Vision

(2009)

X. Bai et al.

Video snapcut: robust video object cutout using localized classifiers

ACM Trans. Graph.

(2009)

Cited by (41)

Enhanced robust spatial feature selection and correlation filter learning for UAV tracking
2023, Neural Networks
Spatial boundary effect can significantly reduce the performance of a learned discriminative correlation filter (DCF) model. A commonly used method to relieve this effect is to extract appearance features from a wider region of a target. However, this way would introduce unexpected features from background pixels and noises, which will lead to a decrease of the filter’s discrimination power. To address this shortcoming, this paper proposes an innovative method called enhanced robust spatial feature selection and correlation filter Learning (EFSCF), which performs jointly sparse feature learning to handle boundary effects effectively while suppressing the influence of background pixels and noises. Unlike the $ℓ_{2}$ -norm-based tracking approaches that are prone to non-Gaussian noises, the proposed method imposes the $ℓ_{2, 1}$ -norm on the loss term to enhance the robustness against the training outliers. To enhance the discrimination further, a jointly sparse feature selection scheme based on the $ℓ_{2, 1}$ -norm is designed to regularize the filter in rows and columns simultaneously. To the best of the authors’ knowledge, this has been the first work exploring the structural sparsity in rows and columns of a learned filter simultaneously. The proposed model can be efficiently solved by an alternating direction multiplier method. The proposed EFSCF is verified by experiments on four challenging unmanned aerial vehicle datasets under severe noise and appearance changes, and the results show that the proposed method can achieve better tracking performance than the state-of-the-art trackers.
Visual object tracking: A survey
2022, Computer Vision and Image Understanding
Visual object tracking is an important area in computer vision, and many tracking algorithms have been proposed with promising results. Existing object tracking approaches can be categorized into generative trackers, discriminative trackers, and collaborative trackers. Recently, object tracking algorithms based on deep neural networks have emerged and obtained great attention from researchers due to their outstanding tracking performance. To summarize the development of object tracking, a few surveys give analyses on either deep or non-deep trackers. In this paper, we provide a comprehensive overview of state-of-the-art tracking frameworks including both deep and non-deep trackers. We present both quantitative and qualitative tracking results of various trackers on five benchmark datasets and conduct a comparative analysis of their results. We further discuss challenging circumstances such as occlusion, illumination, deformation, and motion blur. Finally, we list the challenges and the future work in this fast-growing field.
Spatial-temporal 3D dependency matching with self-supervised deep learning for monocular visual sensing
2022, Neurocomputing
Monocular visual sensing is the task of using a camera to estimate the scene depth, optical flow and camera pose. In this paper, we propose a spatial–temporal 3D dependency matching approach that enforces the robustness of continuous frames matching for monocular visual sensing. 3D structure and warped depth based geometry backpropagation are used to encourage jointly learning the view depth, optical flow and camera pose employing a novel self-supervised neural network from monocular sequences. We designed two different iterative convolutional prediction sub-networks, where the optical flow obtained by depth and camera pose is iteratively used for depth prediction. A virtual frame method is proposed to optimize the optical flow of moving objects. The salient feature of the proposed learning framework is completely unsupervised, requiring only consecutive monocular images for training and testing. Evaluation on publicly benchmark datasets shows that our unsupervised learning model significantly outperforms previous methods and achieves better performance compared with previously unsupervised manners and achieves comparable results with supervised ones.
LSTM based trajectory prediction model for cyclist utilizing multiple interactions with environment
2021, Pattern Recognition
Citation Excerpt :
Especially in the underdeveloped and developing areas, bicycles, electric bicycles and motorcycles play an indispensable role in transportation, and this figure reaches 37%. VRUs’ safety has been a great concern over the past years with focuses mainly on object detection [2-4], tracking [5-7] and risk avoidance [8]. Currently, Advanced Driving Assistance System (ADAS) to avoid collision with VRUs is available in the market.
The cyclist trajectory prediction is critical for the local path planning of autonomous vehicles. Based on the assumption that cyclist's movement is limited by its dynamics and subjected to interactions with environments, a novel LSTM based cyclist trajectory prediction model which utilizes multiple interactions with surroundings and motion feature in a unified framework is proposed. Road features describing road boundary and static obstacles are employed to address cyclist's interaction with the road. To address interactions with pedestrians, other cyclists and vehicles, object features including object attributes and relative positions are utilized. The focal attention mechanism is employed to reveal the importance of features at each time-steps. By feeding features into LSTM encoder, the movement in the next two seconds is predicted. Experiments were conducted on two datasets, and results show that the presented model outperforms the state-of-art models in most cases.
Robust visual tracking via spatio-temporal adaptive and channel selective correlation filters
2021, Pattern Recognition
Citation Excerpt :
To alleviate the spatial distortion issue, SRDCF [14] introduces a spatial regularization to penalize the filter coefficients outside the target bounding box. Several works [5,27] introduce a segmentation mask or an adaptive spatial map into correlation filter learning for discriminative tracking. CACF [28] and BACF [8] respectively introduce context samples and background samples for correlation filter learning to improve the tracking performance.
In recent years, Discriminative Correlation Filter (DCF) based tracking methods have achieved impressive performance in visual tracking. However, their excellent performance usually comes at the cost of sacrificing the computational speed. Furthermore, training correlation filters using high dimensional raw features may introduce the risk of severe over-fitting. To address the above issues, we propose Spatio-Temporal adaptive and Channel selective Correlation Filters (STCCF) for robust tracking. Specifically, we first select a set of target-specific features from high dimensional features via an effective channel selective scheme based on the Taylor expansion. Then, we reformulate the filter learning problem from ridge regression to elastic net regression to adaptively select the discriminative features inside the target bounding box at the spatial level. Moreover, we constrain the filters to be adaptive across temporal frames by learning a transformation matrix from the initial filters to the previous filters. In particular, with a specific spatio-temporal-channel constraint, STCCF can not only alleviate the over-fitting problem and reduce the computational cost, but also enhance the discriminability and interpretability of the learned filters. The proposed STCCF can be optimized by using a few iterations of Alternating Direction Method of Multipliers (ADMM). Experiments on six challenging datasets show that STCCF can achieve promising performance with fast running speed.
Recent trends in multicue based visual tracking: A review
2020, Expert Systems with Applications
In the recent years, multicue visual tracking frameworks have been preferred over single cue visual tracking approaches to address critical environmental challenges. In literature, it has been well accepted that combining multiple complementary cues extracted from single sensor or multiple sensors, deep features and features extracted from different layers of deep learning architecture enhance tracking performance and accuracy. In this paper, we have categorized the multi-cue object tracking work based on the exploited appearance model into traditional architecture and deep learning based trackers. The categorized work have been tabulated to provide detailed overview of the representative work and to list out the new trends in the domain. Also, we have briefly analyzed the various tracking benchmark and tabulated their substantial parameters. Our review work analyze the recent trends in the field of object tracking alongwith the latest tracking benchmark to indicate the future directions to the researchers. In addition, we have experimentally evaluated the state-of-the-arts on OTB-15, UAV123, VOT2017 and LaSOT datasets under various tracking challenges.

View all citing articles on Scopus

View full text