Elsevier

Image and Vision Computing

Volume 31, Issues 6–7, June–July 2013, Pages 434-447
Image and Vision Computing

Markov Chain Monte Carlo Modular Ensemble Tracking

https://doi.org/10.1016/j.imavis.2012.09.007Get rights and content

Abstract

Recent years have been characterized by the overgrowth of video-surveillance systems and by automation of the processing they integrate. Object Tracking has become a recurrent problem in video-surveillance and is a very important domain in computer vision. It was recently approached using classification techniques and still more recently using boosting methods.

We propose in this paper a new machine learning based strategy to build the observation model of tracking systems. The global observation function results of a linear combination of several simplest observation functions so-called modules (one per visual cue). Each module is built using a Adaboost-like algorithm, derived from the Ensemble Tracking Algorithm. The importance of each module is estimated using an original probabilistic sequential filtering framework with a joint state model composed by both the spatial object parameters and the importance parameters of the observation modules.

Our system is tested on challenging sequences which prove its performance for tracking and scaling on fix and mobile cameras and we compare the robustness of our algorithm with the state of the art.

Graphical abstract

Highlights

► New machine learning based strategy to build observation model of tracking systems. ► Classifiers trained with Adaboost on homogeneous feature spaces. ► Original probabilistic sequential filtering framework with a joint state model.

Introduction

Numerous works identify object tracking as a critical issue in many applications such as surveillance and anomaly detection [20]. Among all definitions, [36] defines tracking as the estimation from a video sequence of the trajectory of moving objects in the image plane. The tracking by itself is composed of two steps: an object detection step, where potential candidates are identified in each frame of the sequence, and a tracking step, where a specific candidate is tracked all along the frames. Depending on the constraints imposed on these two steps, several methods and algorithms are available (see e.g. [36], [27] for a review). Four main categories can more particularly be distinguished: background subtraction, silhouette tracking, points tracking and supervised learning methods. We herein impose four constraints on the tracker: we want the tracker to be (i) robust ;(ii) real-time; (iii) able to track pedestrians and (iv) usable from mobile cameras acquisitions. Due to these constraints, background subtraction and silhouette tracking are inappropriate and we focus in the following on the remaining two other categories.

Several tracking problems (point tracking, parametric contour tracking) brings together two methods widely used in the vision community: Kalman filters (e.g. [7]) and particle filters. In the following, we are more particularly interested in point tracking, in which the object is represented with a few points. One limitation of the Kalman filter is the assumption that the state variables are normally distributed. To overcome shortcomings of the Gaussian modeling in describing dynamics and observation modeling, the condensation algorithm was introduced by Isard and Blake [21], where the authors developed a method to track curves in dense visual clutter in near real-time. This limitation can more generally be overcome by using particle filtering. [8] used classifiers as likelihood observation function of a particle filter based tracker. [29] used a particle filter based approach to detect and track multiple targets. Brasnett et al. [6] designed a particle filter for object tracking based on color, edge and texture cues with adaptive parameters, together with an online estimation method of the noise parameters of the visual models.

Particle filters are thus very efficient methods to track multiple objects, as they can cope with non-linearities and multi-modalities induced by occlusions and background clutter.

Supervised learning consists in inferring a function from supervised training data, composed of input data and desired outputs (here class labels). The task of the supervised learner is to predict the class label (classification) of unknown data using only a given number of training samples. In the tracking community, the most popular supervised learning methods include the direct construction of an inter-classes frontier (e.g. Support Vector Machines [2], [35], Neural Networks [5] and the combination of classifiers improving classification performance. In this sense, several works have already described semi-supervised or online tracking based on machine learning algorithms like Boosting. In particular, Collins et al. [9] present an online features selection method, in order to evaluate multiple features while tracking and the method adjusts the set of features to improve tracking performance. The features are ranked by computing the object/background variance ratio to log likelihood distributions. This selection is based on a finite set of features (for example 49 linear combinations of RBG color space). This feature evaluation method is embedded in a mean-shift tracking system. More generally, boosting, and especially Adaboost [12], was proved to be very efficient [16], [22].

The proposed paper uses the same idea but does not propose relevant contribution on module learning or updating. The main contribution are related to both the joint exploration of the spatial configuration of the object and the relevance of the observation modules.

In a preliminary work [30], we proposed to train classifiers with Adaboost on homogeneous feature spaces. Individual classification decisions (confidence maps) were then combined into a global decision (one unique map and one unique position) via a heuristic algorithm and finally this decision was used to update all strong classifiers. As only one decision was kept, it might not always be accurate (error accumulation, bimodal set of maps…). We hereby propose to have a double point of view, by combining point tracking and supervised learning methods. More precisely, classification decisions of the classifiers trained on homogeneous feature spaces are used by a particle filter specially designed to track object position and the best classification decisions combination at the same time.

Several recent works [3], [34], [17], [25], [28] deal with machine learning based strategies to build the observation model of tracking systems. One classic strategy provides a discriminative function between the object and the background (in a region close to the object). However, most of these approaches use a unique visual feature (color based, edges based, …) to describe the object appearance and since both the object and background appearance may vary with time, a good discriminative function may be able to handle with heterogeneous visual cues. In [3], an heterogeneous feature vector is used to describe the object/background appearance model in an Adaboost based framework. The drawback of this strategy is that some very different cues (edges, bins of colored histograms, ..) are merged in the same geometric subspace. Some cues can be reliable or not, and therefore may lead to a possible high global Bayesian error.

We propose an alternative strategy that consist in building a global observation function resulting of a linear combination of several simplest observation functions so-called modules (one per visual cue). The importance of each module (i.e. the weight in the linear combination) is estimated according to the object/background configuration, and we propose an original probabilistic sequential filtering framework with a joint state model composed by both the spatial object parameters and the importance parameters of the observation modules. At each time, the posterior distribution of the state vector is estimated with a Monte Carlo algorithm. The SIR Algorithm [15] is a very popular solution to deal with sequential filtering in computer vision. However, since the number of particles grows exponentially with the dimension of the state vector, this solution can not be used in high dimension spaces for realtime applications. We therefore propose a Markov Chain Monte-Carlo particle filter, with a pseudo-marginal proposal strategy, which is more efficient, and has already been used in computer vision for multi-object tracking [24].

The work the most closely related to ours may be the one of [13], that extent semi-supervised on-line boosting [17] with a particle filter to achieve a higher frame-rate. The speeding is achieved by a more sophisticated search-space sampling, and an improved update sample selection. Other related works include [14], where authors used a Bayesian model, including the novel integration of a semi-supervised likelihood function, a sequential Monte Carlo scheme for efficient online Bayesian updating, and a posterior-reduction criterion for active learning.

The remainder of this paper is organized as follows: subsection 2 recalls some basics on Ensemble Tracking algorithm, since supervised learners are partially based on [3]. Section 2 introduces our contributions, consisting first of a aggregated supervised tracking step performed on homogeneous feature spaces, and then on the introduction of a Markov Chain Monte-Carlo (MCMC) particle filter estimating both position and dimensions of the object to track, and weights of classifiers stemming from the tracking. Finally, Section 3 presents and analyzes results of our algorithm on both synthetic and real challenging video sequences.

Given a video sequence (i.e. a set of frames I1Ip) and labeled samples of object/background pixels on the first frame, Ensemble Tracking (ET, [3]) first trains a set of weak classifiers. A strong classifier is then computed via the Adaboost algorithm, which classifies the pixels on the next frame and builds a confidence map further analyzed by the Mean Shift algorithm [10] to determine the new object position. The strong classifier is finally updated, integrating data from the current frame. Each strong classifier is built using a set of weak classifiers handling a single feature vector in R11 made up of a 8-bin local histogram of oriented gradient (HoG) and pixel R, G, B values. An overview of ET is presented in Algorithm 1.

The main drawback of ET is that it performs tracking on a heterogeneous feature space. Features used in the combined feature vector can be reliable or not, and therefore may lead to a possible high global Bayesian error. Furthermore, features that are not discriminative enough may hinder the classification results.

In order to avoid these problems, we herein propose to work on several homogeneous feature spaces and to track the object using an ET-like algorithm on each of these spaces (called modules, one confidence map per space based on a consistent feature vector). Decisions are then combined into a unique one, managing their complementarity, reliability and their redundancy. Using one ET strong classifier per space allows an independent decision on each homogeneous feature space to be taken and therefore gives the possibility to handle undiscriminative data that may hinder the final decision stage.

Splitting the feature space strongly modifies the objective of the tracking process: a tracking algorithm now has to estimate a hidden state composed on the one hand of the position and the dimensions of the object, and on the other hand of the linear weights of the module decisions, leading to the most discriminant observation. To handle this modification, we propose to use a specific particle filter jointly managing both the positions and dimensions of the object and the weights of the modules.

Algorithm 1 Ensemble Tracking

Section snippets

Markov Chain Monte Carlo Modular Ensemble Tracking

This section presents the proposed exploration algorithm that approximate a jointly appearance position state of the object to be tracked. A Markov Chain Monte Carlo Particle filter is used to efficiently explore the state space using a marginal proposal law. The synoptic diagram of the Markov Chain Monte Carlo Modular Ensemble Tracking (MC2-MET) is proposed in Fig. 1, and the whole algorithm is summarized in A. Each step is detailed in the next subsections.

The feature space is now composed of

Results

This section presents the experiments made to evaluate the proposed method on both synthetic and real changeling sequences.

The Markov Chain Monte Carlo Modular Ensemble Tracking algorithm (MC2-MET) has been implemented in C++, on a PC equipped with Intel® Core 2 Duo E8500 3.16 GHz and 4Go of RAM DDR2.

Conclusion and perspectives

We presented in this article a modular version of Ensemble Tracking combined with a Markov Chain Monte Carlo particle filter (MCMC). The key idea was to jointly track the object position/scale and the relevance of each observation module with a sequential Bayesian filter. We introduced our first contribution that is a new technique of data separation into homogeneous consistent feature spaces, a new update method of the algorithm, both involving more robust and more stable tracking, and a new

References (36)

  • R.T. Collins et al.

    Online selection of discriminative tracking features

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • D. Comaniciu et al.

    Mean shift: a robust approach toward feature space analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • J. Deutscher et al.

    Articuled Body Motion Capture by Annealed Particle Filtering

  • Y. Freund et al.

    Experiments with a New Boosting Algorithm

  • M. Godec et al.

    Speeding Up Semi-supervised On-line Boosting for Tracking

  • A.B. Goldberg et al.

    Oasis: Online Active Semi-supervised Learning

  • N.J. Gordon et al.

    Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation

  • H. Grabner et al.

    Real-time Tracking Via On-line Boosting

  • Cited by (4)

    • On-line fusion of trackers for single-object tracking

      2018, Pattern Recognition
      Citation Excerpt :

      The information fusion domain sometimes speaks of centralized vs. decentralized functional architectures [6]. Most of the tracking approaches proposed in the literature can be considered as falling into the centralized architecture category: they describe different ways to combine either motion models [7–10], observation models [9,11] or appearance features [9,12–16] to cite few of them. Since our concern in this study is the analysis of decentralized strategies for tracking, we won’t describe these approaches further and refer to other recent surveys [1,5,17].

    • Robust tracking with discriminative ranking middle-level patches

      2014, International Journal of Advanced Robotic Systems

    This paper has been recommended for acceptance by Matti Pietikainen.

    View full text