Independent increment processes for human motion recognition

https://doi.org/10.1016/j.cviu.2007.02.002Get rights and content

Abstract

This paper describes an algorithm for classifying human motion patterns (trajectories) observed in video sequences. We address this task in a hierarchical way: high-level activities are described as sequences of low-level motion patterns (dynamic models). These low-level dynamic models are simply independent increment processes, each describing a specific motion regime (e.g., “moving left”). Classifying a trajectory thus consists in segmenting it into the sequence its low-level components; each sequence of low-level components corresponds to a high-level activity. To perform the segmentation, we introduce a penalized maximum-likelihood criterion which is able to select the number of segments via a novel MDL-type penalty. Experiments with synthetic and real data illustrate the effectiveness of the proposed approach.

Introduction

The analysis of human activities from video sequences is an active research topic with obvious applications in video surveillance, e.g., in the detection of typical or abnormal behaviors. This paper focuses on the classification of human motion in real situations using surveillance cameras.

The surveillance of large public areas often relies on the use of a large number of cameras covering several regions of interest. The observation room is usually equipped with a large set of video monitors, used by (one or more) human operator(s) to watch over multiple areas. This requires a considerable human effort in multiplexing attention to survey multiple scenes and events. Recently, a considerable effort has been devoted to developing automatic surveillance systems, providing information about which activities are taking place in a given space. These systems aim at monitoring the actions of each pedestrian, in order to classify its activity and discriminate common activities (e.g., “walking”, “entering shop”) from dangerous or inappropriate ones (e.g., “running fast”, “fighting”).

In this work,1 activities are recognized from motion patterns associated to each person tracked by the system. A fundamental assumption is that the motion is described by a sequence of displacements of the 2D centroid (mean position) of each person’s image “blob”. The displacements are described by multiple dynamical models associated to elementary motion regimes, equipped with a switching mechanism. Switching among different models occurs at unknown time instants, which have to be estimated from the video sequence.

Our approach is thus of hierarchical nature. Each high-level activity is composed of a sequence of low-level models; for example, “browsing” may be composed of “moving left”–“stopped”–“moving right”. Classifying a trajectory thus consists in segmenting it into the sequence its low-level components. A sequence of low-level components can then be classified into one of the considered high-level activities via a simple look-up table. The central block of the procedure is clearly the segmentation of each observed trajectory into its low-level components. To this end, we introduce a penalized maximum likelihood segmentation/classification criterion, equipped with an MDL-type penalty which is able to select the number of segments. This segmentation is based on parameters of each low-level model, which are learned off-line during a training phase, from hand segmented data.

The rest of the paper is organized as follows. Section 2 overviews related work. Section 3 presents our formulation of the problem and the adopted motion models. In Section 4, we describe the proposed methodology, namely the model parameter estimation criteria and the trajectory segmentation algorithm. Section 5 describes how the high-level classes are assigned to sequences of low-level models. Section 6 reports experimental results with synthetic data and real video sequences. Section 7 discusses some limitations of the approach and ongoing work to address them. Finally, Section 8 concludes the paper.

Section snippets

Related Work

A considerable amount of work has been recently done to characterize human activities and behavior from video sequences (a recent survey can be found in [5]). This task is usually split into two parts: object tracking and activity recognition [5]. Different types of methods have been used to deal with each of these operations. Object tracking is often performed using deformable models [2], [8], region based techniques (e.g., blob detection [25]), articulated models of the human body [26], or

Rationale

In order to segment and classify different activities, as mentioned in Section 1, we first observed that all trajectories of a given activity follow a typical route. Fig. 1 shows trajectories corresponding to a person entering a shop (left), leaving a shop (right) or just passing in front of a shop (bottom).

In this work, we use sequences of elementary models, such as “moving upwards”, “stopped”, “moving downwards”, “moving left”, and “moving right” to describe the trajectories. Our problem

Model parameter estimation

The estimation of the parameters of the low-level models (that is, of {μc, Qc, c = 1 ,  , C}) is performed in an supervised way, using training trajectories which were previously segmented into low-level actions by a human observer. These estimates are simply standard maximum likelihood estimates, based on the set of all displacements in the training trajectories that were classified as belonging to each model.

Segmentation into a known number of segments

Having defined the set of models and learnt the corresponding parameters from training

High-level identification of the sequences

To identify the performed activities, we must assign semantics to each sequence of model labels into which a trajectory is segmented. For example, in the case of classification of people moving in a shopping mall (described in detail in the next section) an “entering” activity is identified whenever the sequence of segment low-level labels “moving right”-“moving up” or “moving left”-“moving up”, occurs within the trajectory; as another example, a “leaving” activity is identified when the

Synthetic data

This subsection presents experimental results with synthetic data, for which we have performed Monte Carlo tests. We have considered the following five (C = 5) low-level models: {“moving right”, “moving left”, “moving up”, “moving down”, “stopped”}. These low-level models are characterized by the following parameters:moving right:μ1=30,Q1=3001,moving left:μ2=-30,Q2=3001,moving up:μ3=02,Q3=1004,moving down:μ4=0-2,Q4=1004,stopped:μ5=00,Q5=.500.5.Based on these five low-level models, we

Limitations and ongoing work

In this section, we discuss some of the limitations of the proposed approach. First, it is clear that the method strongly relies on the assumption of static environment and a static camera, conditions which are not always necessarily satisfied. For example, in the campus experiment, the amount of parked cars can change drastically along the day; in the presence of a large number of cars, the trajectories are more constrained, and this may have a relevant effect on the statistical parameters of

Conclusions

In this paper, we have proposed and tested an algorithm for modelling, segmenting, and classifying human trajectories in constrained environments. The proposed approach describes high-level activities as sequences of low-level motion models. We have introduced a penalized maximum-likelihood criterion to segment the observed trajectories into its low-level components; the criterion is able to select the number of segments, due to the presence of a novel MDL-type penalty. Our experimental results

Acknowledgment

We thank Prof. José Santos - Victor from ISR and IST and other members of CAVIAR project for many stimulating discussions.

References (25)

  • D. Ayers et al.

    Monitoring human behavior from video taken in an office environment

    Image and Vision Computing

    (2001)
  • J. Marques et al.

    Optimal and suboptimal shape tracking based on switched dynamic models

    Image and Vision Computing

    (2001)
  • O. Masoud et al.

    A method for human action recognition

    Image and Vision Computing

    (2003)
  • Y. Yacoob et al.

    Parameterized modeling and recognition of activities

    Computer Vision and Image Understanding

    (1999)
  • A. Baumberg et al.

    Learning deformable models for tracking the human body

  • A. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Transaction on Pattern Analysis and Machine Intelligence

    (2001)
  • S. Hongeng, R. Nevatia, Multi-agent event recognition, in: Proceedings of the 8th IEEE International Conference on...
  • W. Hu et al.

    A survey on visual surveillance of object motion and behaviors

    IEEE Transactions on Systems and Cybernetics—Part C: Applications and Reviews

    (2004)
  • M. Isard, A. Blake, A mixed-state condensation tracker with automatic model-switching, in: Proceedings of the...
  • Y. Ivanov et al.

    Recognition of visual activities and interactions by stochastic parsing

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • J. Nascimento et al.

    Robust shape tracking in the presence of cluttered background

    IEEE Transactions on Multimedia

    (2004)
  • N. Johnson et al.

    Learning the distribution of object trajectories for event recognition

    Image and Vision Computing

    (1996)
  • Cited by (0)

    This work was partially supported by Fundação para a Ciência e a Tecnologia, Portuguese Ministry of Science and Technology and Higher Education (which includes FEDER funds) and the EU IST Program, under Project IST-2001-37540 (CAVIAR).

    View full text