Elsevier

Pattern Recognition

Volume 83, November 2018, Pages 185-195
Pattern Recognition

Visual tracking using spatio-temporally nonlocally regularized correlation filter

https://doi.org/10.1016/j.patcog.2018.05.017Get rights and content

Highlights

  • A novel regularized CF based tracking approach has been proposed with promising results on three benchmark datasets.

  • Our method effectively captures the long-term spatio-temporally nonlocal superpixel appearance information to regularize the CF learning.

  • Our method deals well with the challenging factors such as large viewpoint changes and non-rigid deformation.

Abstract

Due to the factors like rapidly fast motion, cluttered backgrounds, arbitrary object appearance variation and shape deformation, an effective target representation plays a key role in robust visual tracking. Existing methods often employ bounding boxes for target representations, which are easily polluted by noisy clutter backgrounds that may cause drifting problem when the target undergoes large-scale non-rigid or articulated motions. To address this issue, in this paper, motivated by the spatio-temporal nonlocality of target appearance reoccurrence in a video, we explore the nonlocal information to accurately represent and segment the target, yielding an object likelihood map to regularize a correlation filter (CF) for visual tracking. Specifically, given a set of tracked target bounding boxes, we first generate a set of superpixels to represent the foreground and background, and then update the appearance of each superpixel with its long-term spatio-temporally nonlocal counterparts. Then, with the updated appearances, we formulate a spatio-temporally graphical model comprised of the superpixel label consistency potentials. Afterwards, we generate segmentation by optimizing the graphical model via iteratively updating the appearance model and estimating the labels. Finally, with the segmentation mask, we obtain an object likelihood map that is employed to adaptively regularize the CF learning by suppressing the clutter background noises while making full use of the long-term stable target appearance information. Extensive evaluations on the OTB50, SegTrack, Youtube-Objects datasets demonstrate the effectiveness of the proposed method, which performs favorably against some state-of-art methods.

Introduction

Visual tracking is a classic computer vision problem, but remains as a challenging task since the target appearance may suffer from the factors such as rapidly fast motion, cluttered backgrounds, arbitrary object appearance variation and shape deformation, to name a few. A robust appearance model plays a key role in ensuring a good tracking performance, thereby attracting much attention in the past decades [1], [2], [3], [4], [5], [6]. Here, we also investigate how to learn an effective appearance model for robust visual tracking.

Recently, discriminative correlation filter (DCF) based approaches have witnessed a great success in visual tracking [7], [8], [9], [10], [11], [12], [13], [14], [15]. These methods learn a CF from a set of circulant training samples, corresponding to periodically extending these samples, and hence training and detection of these trackers can be efficiently achieved by the fast fourier transform (FFT). However, the circulant structures of the training samples also introduce unwanted boundary effects, resulting in inaccurately representing the target, thereby limiting the discriminative capability of the learned CF. The boundary effects have been elaborately mitigated by recent work [9], [12]. In [9] Danelljan et al. introduce a spatial regularization to suppress the CF values outside the object boundary. In [12] the CF is learned with a set of real negative samples densely sampled from background. The second drawback of the CF based trackers is that the target shape is represented by a bounding box, which may introduce clutter background noises into the target especially when the target suffers from large-scale non-rigid deformation, leading to a set of suboptimal positive samples. Over time this degrades the learned CF model and causes drifting problem. To address this issue, Lukežič et al. [7] employ a segmentation mask that can well extract target from background to spatially regularize the CF learning. However, only the locally spatial information is exploited to segment the object without resorting to useful long-term spatio-temporal information that is helpful to achieve a robust segmentation.

To address the above problems, we propose a spatio-temporal nonlocally (STN) regularized CF based tracker that explores the long-term spatio-temporal nonlocal information of the target appearance to accurately represent and segment the target from background. Fig. 1 shows the flow chart of our method, among which the segmentation method is motivated by the video object segmentation techniques [16], [17], [18]. First, given the tracked bounding boxes in a set of consecutive frames, we leverage optical flow information to obtain a rough object position that ensures a frame-to-frame segmentation consistency. Specifically, we produce a rough motion boundary in pairs of adjacent frames using the technology in [16] and then get an efficient initial foreground estimation. Here, the only requirement for the object is to move differently from its surrounding background in some frames of the video. Moreover, in order to reduce the noises introduced by appearance learning, we explore the information of superpixels from the long-term spatio-temporal nonlocal regions to learn a robust appearance model, which is integrated into a spatio-temporal graph model. Afterwards, as GrabCut [19], the graph model is iteratively solved by refining the foreground-background labeling and updating the foreground-background appearance models. Finally, after achieving the segmentation mask, we generate an object likelihood map as the weight factor to adaptively regularize the CF learning for visual tracking. To solve the regularized CFs, we employ an alternating direction method of multipliers (ADMM) [7] which can be efficiently solved via FFT in each iteration. We extensively evaluate the proposed algorithm on three challenging datasets including OTB50 [20] SegTrack [21] and Youtube-Objects [22], showing favorable results against some state-of-art methods.

The main contributions of this work are summarized as follows:

  • We present a robust appearance model for segmentation that explores the spatio-temporally nonlocal information from superpixels.

  • We present a segmentation based regularized KCF tracker that can not only output an object rectangle but also its accurate contour, achieving more accurate results than the KCF tracker.

  • The proposed approach achieves favorable results on both video segmentation and visual tracking benchmarks, such as SegTrack, Youtube-Objects and OTB50.

Section snippets

Related work

In this section, we briefly review some recent approaches related to our work, including video object segmentation and CF based visual tracking.

Methodology

In contrast to the supervised video object segmentation methods [17], [30] that need to well segment the target in the first frame, the proposed tracking method only needs to give the initial bounding box in the first frame as a prior. Before assigning each pixel a label, in order to reduce the computational complexity and the background noise, we first use the method introduced in Section 3.1 to obtain the coarse object location mask in each frame. Then, we use the TurboPixel algorithm [46] to

Setup

All images are sized to a fixed size of 240 × 320 pixels for experiments that can effectively handle the fast motion factor. We employ SLIC [50] to generate a set of superpixels in each frame due to its high efficiency. For each sequence, the number of superpixels is about 50 ∼ 1500. The parameter β in (9) is set to β=0.5, the parameter λ in (12) is set to λ=1, and the parameter η in (16) and (17) is set to η=0.075. We leverage the d=31-channel HOG features to represent the target appearances

Conclusion

In this paper, we have presented a CF based tracking approach that explores the spatio-temporal nonlocal information to accurately represent and segment the target in a video, yielding an object likelihood map to regularize the CF learning. Specifically, given the tracked target bounding boxes, we first generated a set of superpixels to represent the foreground and background, and then updated the appearance of each superpixel with its long-term spatio-temporally nonlocal counterparts. Then,

Acknowledgements

This work is supported in part by NSF of Jiangsu province under grants BK20170040, BK20151529 and BK20150906, in part by the Six talent peaks project in Jiangsu Province under Grant R2017L07, in part by NSFC under grand 61773002 and in part by Applied basic research project in Shanxi Province under grand 201601D011007.

References (61)

  • X.-F. Wang et al.

    A novel level set method for image segmentation by incorporating local statistical analysis and global similarity measurement

    Pattern Recognit.

    (2015)
  • P. Ochs et al.

    Segmentation of moving objects by long term video analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • A. Lukežič et al.

    Discriminative correlation filter with channel and spatial reliability

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Mueller et al.

    Context-aware correlation filter tracking

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Danelljan et al.

    Learning spatially regularized correlation filters for visual tracking

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J.F. Henriques et al.

    High-speed tracking with kernelized correlation filters

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • L. Bertinetto et al.

    Staple: complementary learners for real-time tracking

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • H. Kiani Galoogahi et al.

    Learning background-aware correlation filters for visual tracking

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Danelljan et al.

    Beyond correlation filters: learning continuous convolution operators for visual tracking

    Proceedings of European Conference on Computer Vision

    (2016)
  • M. Danelljan et al.

    Eco: efficient convolution operators for tracking

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • H. Kiani Galoogahi et al.

    Correlation filters with limited boundaries

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • A. Papazoglou et al.

    Fast object segmentation in unconstrained video

    Proceedings of the IEEE International Conference on Computer Vision

    (2013)
  • N. Märki et al.

    Bilateral space video segmentation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • T. Wang et al.

    Probabilistic motion diffusion of labeling priors for coherent video segmentation

    IEEE Trans. Multimedia

    (2012)
  • C. Rother et al.

    Grabcut: interactive foreground extraction using iterated graph cuts

    ACM Trans. Graph.

    (2004)
  • Y. Wu et al.

    Object tracking benchmark

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • D. Tsai et al.

    Motion coherent tracking using multi-label mrf optimization

    Int. J. Comput. Vis.

    (2012)
  • A. Prest et al.

    Learning object class detectors from weakly annotated video

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • B.L. Price et al.

    Livecut: Learning-based interactive video segmentation by evaluation of multiple propagated cues

    Proceedings of the IEEE International Conference on Computer Vision

    (2009)
  • X. Bai et al.

    Video snapcut: robust video object cutout using localized classifiers

    ACM Trans. Graph.

    (2009)
  • Cited by (41)

    • Visual object tracking: A survey

      2022, Computer Vision and Image Understanding
    • LSTM based trajectory prediction model for cyclist utilizing multiple interactions with environment

      2021, Pattern Recognition
      Citation Excerpt :

      Especially in the underdeveloped and developing areas, bicycles, electric bicycles and motorcycles play an indispensable role in transportation, and this figure reaches 37%. VRUs’ safety has been a great concern over the past years with focuses mainly on object detection [2-4], tracking [5-7] and risk avoidance [8]. Currently, Advanced Driving Assistance System (ADAS) to avoid collision with VRUs is available in the market.

    • Robust visual tracking via spatio-temporal adaptive and channel selective correlation filters

      2021, Pattern Recognition
      Citation Excerpt :

      To alleviate the spatial distortion issue, SRDCF [14] introduces a spatial regularization to penalize the filter coefficients outside the target bounding box. Several works [5,27] introduce a segmentation mask or an adaptive spatial map into correlation filter learning for discriminative tracking. CACF [28] and BACF [8] respectively introduce context samples and background samples for correlation filter learning to improve the tracking performance.

    • Recent trends in multicue based visual tracking: A review

      2020, Expert Systems with Applications
    View all citing articles on Scopus
    View full text