Silhouette-based gesture and action recognition via modeling trajectories on Riemannian shape manifolds

https://doi.org/10.1016/j.cviu.2010.10.006Get rights and content

Abstract

This paper addresses the problem of recognizing human gestures from videos using models that are built from the Riemannian geometry of shape spaces. We represent a human gesture as a temporal sequence of human poses, each characterized by a contour of the associated human silhouette. The shape of a contour is viewed as a point on the shape space of closed curves and, hence, each gesture is characterized and modeled as a trajectory on this shape space. We propose two approaches for modeling these trajectories. In the first template-based approach, we use dynamic time warping (DTW) to align the different trajectories using elastic geodesic distances on the shape space. The gesture templates are then calculated by averaging the aligned trajectories. In the second approach, we use a graphical model approach similar to an exemplar-based hidden Markov model, where we cluster the gesture shapes on the shape space, and build non-parametric statistical models to capture the variations within each cluster. We model each gesture as a Markov model of transitions between these clusters. To evaluate the proposed approaches, an extensive set of experiments was performed using two different data sets representing gesture and action recognition applications. The proposed approaches not only are successfully able to represent the shape and dynamics of the different classes for recognition, but are also robust against some errors resulting from segmentation and background subtraction.

Research highlights

Gesture recognition using models build from the Riemannian geometry of shape spaces. ► Silhouette contours are represented as points on the closed curves shape manifold. ► Template and graphical-based approaches are used to model the gesture dynamics. ► Experimental validation is performed using two different datasets of human gestures.

Introduction

The problems of modeling and recognition of human gestures and actions from a sequence of images have received considerable interest in the computer vision community, as one of the many problems that aim at achieving a high level automated understanding of video data. This interest is motivated by many applications in different areas in human–computer interaction [1], robotics [2], security, and multimedia analysis.

Existing literature in human movement visual analysis (see reviews [3], [4], [5], [6]) uses the terms gesture and action interchangeably to refer to sequences of human poses corresponding to different activities. However, there are slight variations among these terms. Gesture recognition from video refers to the problem of modeling and recognizing full body gestures performed by an individual in the form of a sequence of body poses and captured by a video camera. These gestures are typically used to communicate certain control commands and requests to a machine equipped with vision capabilities. This scenario usually arises in applications such as human–computer interaction and robotics. On the other hand, action recognition refer to the more general case of modeling and recognizing different human actions- such as walking, running, jumping, etc., performed under different scenarios and conditions. This problem is more prominent in applications such as smart surveillance and media indexing. The main difference between the two problems is that in the former the human subject can be more cooperating, which reduces the need for building view-invariance into the models. The models proposed in this paper can be used for either of the two problems. This is demonstrated by using two different experimental datasets for gesture and action recognition respectively. We will use the term gesture to refer to both gestures and simple actions throughout the paper.

Many of the existing approaches for gesture recognition model gestures as a temporal sequence of feature points representing the human pose at each time instant. The choice of these features usually depends on the application domain, image quality or resolution, and computational constraints. Features such as exemplar key frames [7], optical flow [8] and feature points trajectories [9], [10] have been frequently used to represent the raw, high-dimensional video data in several approaches. The major challenge of most of these features is that they require highly accurate low-level processing tasks such as tracking of interest points. This accuracy turns out to be very hard to achieve in gesture recognition scenarios because of fast articulation, self-occlusion, and different resolution levels that are encountered in different applications.

In order to overcome this limitation, silhouette-based approaches have been receiving increasing attention recently [11], [12], [13]. These approaches focus on the use of the shape of the binary silhouette of the human body as a feature for gesture recognition. They rely on the observation that most human gestures can be recognized using only the shape of the outer contour of the body as shown in Fig. 1 for a “Turn Right” control gesture. The most important advantage of these features is being easy to extract from the raw video frames using object localization and background subtraction algorithms, which are low-level processing tasks and relatively higher accuracy can be achieved in these tasks under different conditions.

An important question in silhouette-based approaches is: how can we represent the shape of these silhouettes in an efficient and robust way? Several shape representation features have been used in the literature for this purpose, including chain codes [14], Fourier descriptors [15], shape moments [16], and shape context [17]. For most of these features, the feature vector is treated as a vector in a Euclidean space in order to use standard vector space methods for modeling and recognition. This assumption is not usually valid as these feature lie in a low-dimensional, non-Euclidean space. Working directly on these nonlinear manifolds can provide models and discriminative measures that may result in an improved performance. One way to explore this lower-dimensional space is to try to learn its structure from training data using dimensionality reduction techniques combined with a suitable notion of local discriminative measure between the visual data features. This technique was recently used [18], [19], [13] for human action recognition and pose recovery. The problems with this technique come from the limitations of data-driven manifolds such as a lack of robust statistical models, and the difficulty in extrapolation and matching of new data.

The limitations of data-driven manifolds methods noted above have shifted the attention of many computer vision researchers towards the use of analytic differential geometry. This shift was also supported by the fact that many features in computer vision lie on curved space because of the geometric nature of the problems. Several of these manifolds were used in problems like object detection and tracking [20], affine invariant shape clustering [21], and activity modeling [22]. The use of such manifolds offers a wide variety of statistical and modeling tools that arise from the field of differentiable geometry. These tools have found applications in problems such as target recognition [23], parameter estimation [24], clustering and dimensionality reduction [25], classification [26], and statistical analysis [27], [22].

The choice of the right feature and space to model the shape of the silhouettes is not the only issue in silhouette-based methods. An equally important problem is the efficient modeling of the dynamics of temporal variations of these feature as the gesture progresses. The importance of both shape and dynamic cues for modeling human movement was noted and demonstrated experimentally in [28]. Various models were used for modeling these dynamics, ranging between statistical generative models [29], [30], [31], [32], and the more recent discriminative models [33], [34]. The invariance to temporal rate of execution of action in such models is crucial for achieving accurate recognition [35].

In this paper, we explore the use of shape analysis on manifolds for human actions and gesture recognition. Our approach falls into the category of the silhouette-based approaches described earlier. Each silhouette is represented by a planar closed curve corresponding to the contour of this silhouette, and we are interested in evolving shapes of these curves during actions and gestures. We will use a recent approach for shape analysis [36], [37], [38], that uses differential geometric tools on the shape spaces of closed curves. Similar ideas have also been presented in [39], [40], [41]. While there are several ways to analyze shapes of closed curves, an elastic analysis of the parameterized curves is particularly appropriate in this application. This is because: (1) the elastic matching of curves allows nonlinear registration and improved matching of features (e.g. body parts) across silhouettes, (2) this method uses a square-root representation under which the elastic metric reduces to the standard L2 metric and thus simplifies the analysis, and (3) under this metric the re-parameterizations of curves do not change Riemannian distances between them and thus help remove the parametrization variability from the analysis. Furthermore, such geometric approaches are useful because they allow us to perform intrinsic statistical analysis tasks, such as shape modeling and clustering, on such Riemannian spaces [42].

Using a square-root representation of contours, each human gesture is transferred into a sequence of points on the shape space of closed curves. Thus, the problem of action recognition becomes a problem of modeling and comparing dynamical trajectories on the shape space. We propose two different approaches to model these trajectories.

In the first approach, we propose a template-based approach to learn a unique template trajectory representing each gesture. One of the main challenges in template-based method is to account for variation in temporal execution rate. To deal with this problem, we use a modified version of the Dynamic Time Warping (DTW) algorithm to learn the warping functions between the different realizations of each gesture. We use the geodesic distances on the shape space to match different points on the trajectories in order to learn the warping functions. An iterative approach is then used to learn a mean trajectory on the shape space and to compute the temporal warping functions.

In the second approach, we utilize the geometry of the shape space more efficiently in order to cope with the different variations within each gesture caused by changes in execution style, body shape and noise. Each gesture is modeled as a Markov model to represent the transition among different clusters on the shape space of closed curve. We learn these models by decoupling the problem into two stages. In the first stage, we cluster the individual silhouette shapes using the Affinity Propagation (AP) clustering technique [43], and build statistical model of variation within each cluster. In the second stage, a Hidden Markov Model (HMM) is used to learn the transition between different clusters for each gesture.

Extensive experiments were conducted to test the performance of our algorithms. We used two different data sets of video sequences representing different control gestures and regular actions, with a total of 226 video sequences. The data sets contained many variations in terms of the number of subjects, execution styles, and temporal execution rates.

Our contribution in this work can be summarized as

  • 1.

    Posing the problem of gesture and action recognition as one of classifying the trajectories on a Riemannian shape space of closed curves.

  • 2.

    Proposing a template-based model and a Markovian graphical model for modeling the time-series data of points on the shape manifold. These models were designed to fully adhere to the geometry of the manifold and to model the statistical variation of the data on this manifold.

  • 3.

    Provide a comprehensive set of experimental analysis of the proposed models on two different datasets for gesture and action recognition.

The reminder of this paper is organized as follows: Section 2 reviews some of the work in different areas related to this paper. In Section 3, we review some notations from Riemannian geometry and then describe the square-root representation of closed curves and the resulting shape space. We also give a brief overview of the computation of distances and statistics on this manifold. Section 4 describes the two dynamical model approaches used for gesture modeling. Experimental results validating the proposed method for human gesture recognition are introduced in Section 5.

Section snippets

Related work

The problems of gesture and action recognition have received great attention in the literature. Several survey papers [3], [4], [5], [6] have tried to group and analyze the existing body of work. The reader is referred to these review papers for a complete overview of the related work in the field. Meanwhile, we will give a brief review of some of the approaches in three different areas that are of most relevance to this paper.

Manifold representation of silhouettes

As mentioned earlier, we will use the square-root elastic representation [36], [37] to construct a shape space of closed curves in R2. Under this framework, the contour of the human silhouette in each frame represents a point in the shape space and each gesture represents a temporal trajectory on that space. We then use principles from Riemannian geometry combined with the structure of the shape space to build statistical models for these trajectories for representation and recognition.

For the

Modeling gesture dynamics

Using the described shape model for closed curves, the dynamic sequence of shapes corresponding to a particular gesture or action will correspond to a sequence of m points on the formQ=q1,q2,,qm,where qiS for i = 1, 2, …, m, and S is the quotient shape space described in Section 3. We propose to use the trajectories on the shape space corresponding to these sequence as a feature for modeling and recognition of different gestures. Because of the special nature of the curved shape space, vector

Experimental results

We carried out an extensive set of experiments to evaluate and verify the effectiveness of using the contour curve shape manifold and the two proposed methods in modeling and recognition of human actions and gestures. The experiments also investigate the effect on performance with changing some of the system choices like the cluster assignment method and whether the clusters are shared among different gesture models or not.

These experiments were performed using two different datasets

Conclusion

In this paper, we presented a novel gesture recognition technique using shape manifolds. Contours of the silhouette are extracted and represented as 2D closed elastic curves parameterized using the square-root parametrization. This representation is intrinsically invariant to both translation,scale, and re-parametrization of the curve. Each gesture is modeled as a temporal trajectory on the resulting Riemannian manifold of 2D elastic curves. We proposed template and graphical-based HMM

Acknowledgment

This work was partially supported by ONR Grant No. N00014-09-1-0664.

References (68)

  • C. Bregler, Learning and recognizing human dynamics in video sequences, in: Proceedings of IEEE Computer Society...
  • A. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans. Pattern Anal. Machine Intell.

    (2001)
  • A. Veeraraghavan et al.

    Matching shape sequences in video with an application to human movement analysis

    IEEE Trans. Pattern Anal. Machine Intell.

    (2005)
  • L. Wang et al.

    Learning and matching of dynamic shape manifolds for human action recognition

    IEEE Trans. Image Process.

    (2007)
  • H. Freeman

    On the encoding of arbitrary geometric configurations

    IRE Trans. Electron. Comput.

    (1961)
  • C. Zahn et al.

    Fourier descriptors for plane closed curves

    IEEE Trans. Comput.

    (1972)
  • M. Hu

    Visual pattern recognition by moment invariants

    IEEE Trans. Inform. Theory

    (1962)
  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Trans. Pattern Anal. Machine Intell.

    (2002)
  • A. Elgammal, C. Lee, Inferring 3D body pose from silhouettes using activity manifold learning, in: Proceedings of IEEE...
  • M. Black, Y. Yacoob, Parameterized modeling and recognition of activities, in: Proceedings of IEEE International...
  • O. Tuzel, F. Porikli, P. Meer, Learning on lie groups for invariant detection and tracking, in: Proceedings of IEEE...
  • E. Begelfor, M. Werman, Affine invariance revisited, in: Proceedings of IEEE Computer Society Conference on Computer...
  • P. Turaga, A. Veeraraghavan, R. Chellappa, Statistical analysis on Stiefel and Grassmann manifolds with applications in...
  • U. Grenander et al.

    Hilbert–Schmidt lower bounds for estimators on matrix Lie groups for ATR

    IEEE Trans. Pattern Anal. Machine Intell.

    (1998)
  • A. Srivastava et al.

    Monte Carlo extrinsic estimators for manifold-valued parameters

    IEEE Trans. Signal Process.

    (2001)
  • A. Goh, R. Vidal, Clustering and dimensionality reduction on riemannian manifolds, in: Proceedings of IEEE Computer...
  • O. Tuzel et al.

    Pedestrian detection via classification on riemannian manifolds

    IEEE Trans. Pattern Anal. Machine Intell.

    (2008)
  • P. Fletcher, S. Venkatasubramanian, S. Joshi, Robust statistics on riemannian manifolds via the geometric median, in:...
  • A. Veeraraghavan, A. Roy-Chowdhury, R. Chellappa, Role of shape and kinematics in human movement analysis, in:...
  • L. Rabiner

    A tutorial on hidden Markov models and selected applications in speech recognition

    Proc. IEEE

    (1989)
  • J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in:...
  • M. Brand, N. Oliver, A. Pentland, Coupled hidden Markov models for complex action recognition, in: Proceedings of IEEE...
  • S. Hongeng, R. Nevatia, Large-scale event detection using semi-hidden Markov models, in: Proceedings of IEEE...
  • S. Wang, A. Quattoni, L. Morency, D. Demirdjian, T. Darrell, Hidden conditional random fields for gesture recognition,...
  • Cited by (0)

    View full text