Variable silhouette energy image representations for recognizing human actions

https://doi.org/10.1016/j.imavis.2009.09.018Get rights and content

Abstract

Recognizing human actions is an important topic in the computer vision community. One of the challenges of recognizing human actions is describing for the variability that arises when arbitrary view camera captures human performing actions. In this paper, we propose a spatio-temporal silhouette representation, called silhouette energy image (SEI), and multiple variability action models, to characterize motion and shape properties for automatic recognition of human actions in daily life. To address the variability in the recognition of human actions, several parameters, such as anthropometry of the person, speed of the action, phase (starting and ending state of an action), camera observations (distance from camera, slanting motion, and rotation of human body), and view variations are proposed. We construct the variability (or adaptable) models based on SEI and the proposed parameters. Global motion descriptors express the spatio-temporal properties of combined energy templates (SEI and variability action models). Our construction of the optimal model for each action and view is based on the support vectors of global motion descriptions of action models. We recognize different daily human actions of different styles successfully in the indoor and outdoor environment. Our experimental results show that the proposed method of human action recognition is robust, flexible and efficient.

Introduction

Recognition of human actions from multiple views by the classification of image sequences has the applications in video surveillance and monitoring, human–computer interactions, model-based compressions, video retrieval in various situations. Typical situations include scenes with moving or clutter backgrounds, stationary or non-stationary camera, scale variation, starting and ending state variation, individual variations in appearance and cloths of people, changes in light and view-point, and so on. These situations make the human action recognition a challenging task. Several human action recognition methods have been proposed in the last few decades. Detailed surveys can be found in several papers including [1].

We consider the approach for recognizing actions is to extract a set of features from each image sequence frame and use these features to train classifiers and to perform recognition. Therefore, it is important to consider the appropriateness and robustness of features of action recognition in varying environment. Actually, there is no rigid syntax and well-defined structure for human action recognition available. Moreover, there are several sources of variability that can affect human action recognition, such as variation in speed, view-point, size and shape of performer, phase change of action, scaling of persons, and so on. In addition, the motion of the human body is non-rigid in nature. These characteristics make human action recognition a sophisticated task. Considering the above circumstances, we consider some issues that affect the development of models of actions and classifications, which are as follows: (i) an action can be characterized by the local motion of human body parts, (ii) an action can be illustrated by the silhouette image sequence of the human body, which can be regarded as global motion flow, (iii) the trajectory of an action from different viewing directions is different; some of the body parts (part of hand, lower part of leg, part of body, etc.) are occluded due to view changes, and (iv) human actions depend on several variability, such as anthropometry, method of performing the action, speed, phase variation (starting and ending time of the action), and camera view variations such as zooming, tilting, and rotating.

Among various features, the motion of the body parts and human body shape play the most significant role for recognition. Motion-based features can represent the approximation of the moving direction of the human body and human action can be effectively characterized by motion rather than other cues, such as color and depth. In the motion-based approach, the motion information of the human such as optic flows, affine variation, filters, gradients, spatial-temporal words, and motion blobs are used for recognizing actions. Motion-based action recognition has been performed by several researchers; a few of them are [4], [12], [13], [15]. However, motion-based techniques are not always robust in capturing velocity when motions of the actions are similar for the same body parts. On the other hand, the human body silhouette represents the pose of the human body at any instant in time, and a series of body silhouette images can be used to recognize human action correctly, regardless of the speed of movement. Different descriptors of shape information of motion regions such as points, boxes, silhouettes, and blobs are used for recognizing or classifying actions. Several researchers performed action recognition using shapes or silhouettes, such as [2], [3]. Bobick and Davis [2] proposed the motion energy image (MEI) and motion history image (MHI) for human movement representation and recognition and were constructed from the cumulative binary motion images. Han [20] proposed the gait energy image for individual recognition. Another gait recognition, called motion silhouette image (MSI) was proposed by Lam in [21]. We propose silhouette energy image (SEI) from the silhouette image sequence for human action recognition.

Besides motion and body shape, several variability that occur frequently are also responsible for human action recognition. Sheikh and Shah [16] explicitly identified three sources of variability in action recognition, such as view-point, execution rate, and anthropometry of actors and they used the 3D space with thirteen anatomical landmarks for each image. Related works have typically concentrated on the variability in view-point [14] by deriving view invariant features or proposing a view invariant algorithm.

During the action recognition of persons, we utilize the 2D information of global shape motion features in addition to several variability’s for recognizing the periodic as well as non-periodic or single occurrence actions. The global shape motions are extracted from geometric shape of models. Therefore, based on the combined information of global motion, sources of variability, and multiple views, human action recognition is more robust and flexible. We propose to recognize several actions of humans in the daily life from multiple views learning of global motion features using the multi-class support vector machine (MCSVM).

Our work is motivated by the ability of humans to utilize periodic and non-periodic motions to perform several actions, which are frequently used in the daily life. It is well recognized that many actions are periodic in nature. This periodic nature of human actions can be analyzed using the shape of human beings, since body parts, as well as shapes, can change while performing particular actions. Shape analysis plays an important role in action recognition, gait recognition, etc. In many situations, we are interested in the movement of human body silhouette (shape) over time. The shape changing of humans describes the nature of human’s motion and shows the action or activity performed by humans. This change of shape over time is considered as a result of global motion of the shape and deformations. We consider this global motion change by compact representation where we accumulate all time information into static time information, i.e. 2D information. This static time information of the resulting image provides an important cue for global and local motion characteristics, such as motion distribution, motion orientation, shape deformation, etc. By using appropriate variable parameters, we consider more relational characteristics of each action.

We define human action as the movement of humans for performing a task within a short period of time. In this paper, we represent a human action by silhouette energy images (SEI) which is constructed from a sequence of silhouette image. We consider that similar prototype actions, called variability templates (VTs) or models are generated from SEI and variability parameters. The variability parameters are: (1) anthropometry, (2) variation of phase (varying starting and ending state of actions), (3) speed variation of an action, and (4) camera observations (zooming of the person, slanting motion, and body rotation). During the representation of SEI, we utilize the period or duration of an action. The period or duration represents the difference between starting and ending states of an action. The period of an action may be represented by half cycle, one cycle or several cycles. The period of an action might be a half cycle, when the remaining half cycle predictably follows the same pattern of the previous half cycle. Strictly speaking, both representations of actions are not the same, but we can consider them as approximately similar. We also considered multiple view variations (action recognition from multiple viewing directions). Moreover, the person’s clothing, occlusion, etc. affect the recognition. All the above mentioned factors are closely related with human action representation and recognition.

A SEI is constructed by using the sequence of silhouette images. Therefore, in case of SEI, we can consider the energy of a pixel is a local motion descriptor. So, an average of a set of data (energy) can present a more informative characteristic of the local image contents. On the contrary, a global descriptor can also be defined as a number (or a value) which characterizes the whole image content. The ‘semi-global descriptor’ refers to the use of many values as opposed to one single value, such as a set of values in multiscale window centered on the concerned point. Therefore, multiple values can characterize an image with more information for comparing image sets. We use global descriptor of SEI and corresponding VTs for characterizing and classifying human actions.

In this paper, we use the image similarity to recognize human actions. The major contributions are as follows:

  • 1.

    We use the 2D representation of human action model, called SEI, accumulating time varying silhouette images at a unit time for action recognition. Therefore, action representation using SEI saves both storage space and time.

  • 2.

    We introduce the explicit variability action model, for considering different forms of the same action, for human action recognition. Four important factors are considered, which include anthropometry of persons, speed of an action, the starting and ending phase of an action, and the camera observations (zoom, scale, and rotation). Moreover, multiple view variations are adapted, which make the human action recognition more robust.

  • 3.

    Typical scenarios with homogeneous-stationary camera, scale variations, appearance and cloths variations, multiple views, and incomplete actions are recognized.

Of particular interest is the detection method, which we use for the recognition of several daily actions of elderly people for human–robot interaction (HRI) or similar applications.

In our system, we assume that silhouettes of an image sequence are correctly captured. From the silhouette image sequence, we estimate the temporal boundary (i.e. period or duration) of each action [19]. Depending on the temporal boundary, an action model (i.e. SEI) is constructed by the silhouette image sequence. Using the variability parameters and the SEI, VTs are generated. The models are characterized by global motions. We learn an action for multiple view global motion descriptors by using a MCSVM, and generate SVM models for specified actions. For recognizing actions, we classify (using the similarity of features) descriptions using SVM models. The actions modeling and classification in this work involve both the Korea university full body gesture database (FBGDB) [5] and the KTH database (KTHDB) [11]. Our proposed action recognition approach is shown in Fig. 1.

This paper is organized as follows: Section 2 presents the action representation of our approach. Section 3 represents the generation of variability models in our system. Section 4 discusses global motion descriptors of combined models. Section 5 shows the classification approach of human actions. Section 6 presents experimental results and discussions of the selected approaches. Finally, conclusions are drawn in Section 7.

Section snippets

Silhouette energy image (SEI) representation

Human action is the movement of humans for performing a task within a short period of time. The action may be simple or complex depending on the number of body limbs involved in the action. Many actions performed by humans have cyclic nature and they show periodicity of short duration. Besides, many actions show single occurrence or non-periodic with time frame of specific length (i.e duration). We have considered human actions daily performed which are almost cyclic in nature, either multiple

What are variability action models?

The variability action models or variability templates (VTs) are defined as noise action models or complementary action models that are generated by using SEI and variability parameters. If the representation of an action derived from different variability or adaptability parameters (anthropometry, execution rate, phase, camera observation) are similar, then this representation is said to be robust for adaptability of these parameters. The original action model is not a unique representation

Global shape motion descriptions

We define the combined variability templates or models (CVT) as the combination of action models (SEIs or ATs) and variability models (VTs). We described the geometric shape motions by {sg,sz,vx,vy,vh,skp,so}. The notations are defined in the following subsections.

Classification of actions

The classification of human action can be carried out by different process, namely, Bayes classifier, k-nearest neighbor (kNN) classifier, and support vector machine (SVM) classifier, etc. derived from feature vectors. Among them, SVM has high generalization capabilities in many tasks, especially in terms of object recognition. SVM is based on the idea of hyperplane classifier that achieves classification by a separating surface (linear or nonlinear) in the input space of data set. SVMs are

FBGDB

The FBGDB [5] contains 14 representative full body actions in the daily life of 20 performers. In the database, all the performers are elderly persons (both male and female) with ages ranging from 60 to 80. The database consists of 2D video data and silhouette data taken at three views: Front view (v1), left-side or −45° view (v2), and right-side or +45° view (v3). The sample images are shown in Fig. 8, where the symbols represent the actions’ name.

KTHDB

The KTHDB is one of the largest databases with

Conclusions and further research

This paper proposed a novel method for human action recognition using the SEI with variability action models. The variability provided a more natural and robust environment for human action recognition, using an advanced human–machine interface due to consideration on the following invariance factors: shape of actors, the starting and ending state of action, speed of an action, camera observations, and different scenarios. From the combined information of SEI and VTs, the global motion

Acknowledgments

This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No. 2009-0060113). This research was also supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Knowledge Economy of Korea. This work was also supported in part by CASR grants KUET, Khulna, Bangladesh.

References (21)

  • M. Ahmad et al.

    Human action recognition using shape and CLG-motion flow from multi-view image sequences

    Pattern Recognition

    (2008)
  • P. Turaga et al.

    Machine recognition of human activities: a survey

    IEEE Transaction on Circuits and Systems for Video Technology

    (2008)
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Transaction on PAMI

    (2001)
  • S. Carlsson, J. Sullivan, Action recognition by shape matching to key frames, in: IEEE Workshop on Models vs. Exemplars...
  • P. Dollár, G.C.V. Rabaud, S. Belongie, Behavior recognition via sparse spatio-temporal filters, in: IEEE Workshop...
  • FBGDB....
  • C.-W. Hsu et al.

    A comparison of methods for multiclass support vector machines

    IEEE Transaction on NN

    (2002)
  • M.-K. Hu

    Visual pattern recognition by moment invariants

    IRE Transaction on Information Theory

    (1962)
  • H. Jiang et al.

    Successive convex matching for action detection

    CVPR

    (2006)
  • Y. Ke et al.

    Efficient visual event detection using volumetric features

    ICCV

    (2005)
There are more references available in the full text version of this article.

Cited by (0)

A preliminary version of the paper has been presented in the IEEE International Conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands, September 2008.

View full text