Elsevier

Neurocomputing

Volume 71, Issues 16–18, October 2008, Pages 3561-3574
Neurocomputing

Activity recognition through multi-scale motion detail analysis

https://doi.org/10.1016/j.neucom.2007.09.012Get rights and content

Abstract

Activity recognition is one of the most challenging problems in the video content analysis and high-level computer vision. This paper proposes a novel activity recognition approach in which we decompose an activity into multiple interactive stochastic processes, each corresponding to one scale of motion details. For modeling the interactive processes, we present a hierarchical durational-state dynamic Bayesian network (HDS-DBN) to model two stochastic processes which are related to two appropriate scales in intelligent surveillance. In HDS-DBN, states are decomposed in terms of multi-scale motion details, and each kind of state indicates legible meaning. The effectiveness of this approach is demonstrated by experiments of individual activity recognition and two-person interacting activity recognition.

Introduction

In the past decade, there has been rapid-growing interest in recognizing human activities. This interest is mainly driven by several potential applications that have risen from the human activity recognition, such as intelligent surveillance, video content analysis and human–computer interaction.

Generally speaking, activity recognition includes individual activity recognition [3], [7], [12], two-person interaction recognition [21], [31] and group interaction recognition [16]. The task of human activity recognition is to analyze human activities and produce a high-level description of them. Activity recognition can be simply considered as a classification problem of time-varying feature data. Recognition then consists of matching an unknown test sequence with a library of labeled sequences which represent the prototypical activities [15]. The feature data to be classified is obtained through motion detection [37] and tracking processes [10], which transform the pixel-level data into the features needed in activity recognition.

Aggarwal and Park [1] considered that activity recognition for different tasks can be achieved at different levels of detail: gross, intermediate and detailed level. At the gross level, human can be modeled as point, bounding box or ellipse [12], [19], [22], [28], [31]. Hongeng [19] modeled human as a bounding box, and analyzed the features related to the trajectories. At the intermediate level, a person is represented by his/her major body parts [3], [24]. Luo et al. [24] extracted the positions of body parts to represent humans, and used dynamic Bayesian network to recognize sport events. At the detailed level, activities of a single body part are recognized, which mainly aims at developing gesture-based human–computer interface (HCI) [38].

Generally, different levels of motion details are on different scales. The past researches mainly focused on activity recognition using one scale of motion details for one task, and only a few papers use several scales of motion details to recognize simple human motions [13]. How to use multiple scales of motion details for recognizing human activities has been explored rarely. Actually, there are multiple scales of motion details in an activity [35]. Fig. 1 gives a possible definition of scale in human activity recognition. In this figure, motion details related to trajectories are on the large scale, which mainly reflect the coarse motion properties, such as relations between the person and scene or between persons. Motion details related to human silhouettes are on the intermediate scale, which reflect the human poses such as squatting; motion details related to human body parts are on the small scale, which indicate some finer movements such as taking an object by hands. Motion details on different scales indicate different properties of human activities, and play different roles for recognizing activities. If we deal with them uniformly, sometimes the fine but important details on the small scale are submerged. Selecting scales depends on both the task and the resolution of frames. Generally in the application of intelligent surveillance, the three scales defined in Fig. 1 are available.

Given all the human activities in the world, we think that the motion details on a scale are independent of those on other scales. On the other hand, given a certain class of activities, there is dependence between them on different scales. On each scale, motion details construct a stochastic process over time, and the stochastic processes corresponding to different scales in an activity influence each other. Intuitively, the influence between two neighboring scales is more direct and strong than that between others. Hence we may only keep the dependence between the stochastic processes corresponding to two neighboring scales, which reduces the complexity of activity modeling. Then modeling an activity can be achieved by modeling such interactive stochastic processes.

In our work, we explore human activity recognition in the framework of dynamic Bayesian networks (DBNs), by which the multiple processes reflecting different scales of motion details can be modeled. The contributions of this paper are:

  • (1)

    To propose a novel approach to human activity recognition in terms of analyzing the multi-scale details of human activities. We consider that an activity can be decomposed into multiple interactive stochastic processes, each corresponding to one scale of motion details. The modeling of an activity is achieved through modeling the interactive processes.

  • (2)

    To present a new DBN structure, named hierarchical durational-state dynamic Bayesian network (HDS-DBN), to model and recognize human activities combining two appropriate scales: motion details related to trajectories and motion details related to human silhouettes shown in Fig. 1. HDS-DBN has two levels of state: global activity state and local activity state. The two levels of state model two stochastic processes corresponding to the two scales.

The remainder of the paper is organized as follows: In Section 2 we review the methods of activity recognition under the framework of DBNs. In Section 3 we introduce HDS-DBN model and activity recognition based on this model. The feature extraction is shown in Section 4. In Section 5, we give the experimental results. Finally, in Section 6 the summary of the proposed approach is given.

Section snippets

Related work

During the last decade there have been many techniques applied to activity recognition, such as template matching [3], [7], DBNs [11], [16], [22], [24], [40], syntactic techniques [20], etc. Several reviews about these techniques can be found in [15], [25], [39]. In the past work, a large amount of researches focused on exploring activity recognition in the framework of DBNs.

As a kind of DBNs, hidden Markov models (HMMs) were adopted [40] and have become the most popular approach. Given the

Hierarchical durational-state DBN and activity recognition

In this section, we introduce modeling and recognition of human activities using HDS-DBNs. In the paper, we select two scales shown in Fig. 1, the large scale is motion details related to trajectories, and the small one is motion details related to human silhouettes. Hence an activity is decomposed into two stochastic processes which correspond to the above two scales, respectively. In the HDS-DBN, states are decomposed into two levels of state. One level of state, named global activity state,

Feature extraction for activity representation

We extract two kinds of feature corresponding to the two scales analyzed in Section 3, respectively. One kind of feature, named global feature, corresponds to large scale and is extracted based on human motion trajectories, and the other, named local feature, corresponds to small scale and is extracted based on human silhouettes.

In our work, we do not explore how to extract features for representing human motions, hence some features we adopt here have been used in other papers. Features are

Experimental results and analysis

Experiments are conducted on recognizing individual activities and two-person interacting activities. We have two datasets: one contains individual activities, and the other contains two-person interacting activities. Activities in the datasets are simulated by different persons, and captured by a single static camera.

Conclusion

This paper presents a novel approach to human activity recognition. In the work, we consider that there are multiple scales of motion details contained in an activity. Hence an activity can be decomposed into multiple interactive stochastic processes, which are associated with different scales of motion details. According to the property of stochastic processes, we present the HDS-DBN to model an activity on two scales, which are appropriate in the case of intelligent surveillance. The proposed

Acknowledgments

This work was supported by United Technologies Research Center (UTRC). The authors would like to thank the video team at UTRC for their pertinent and constructive discussion. The authors would like to thank Dr. K.P. Murphy for his Matlab Bnet toolbox, and the anonymous reviewers for their comments.

Youtian Du received the B.S. degree in department of electric engineering from Xi’an JiaoTong University, China in 2002. He is currently a Ph.D. candidate in the department of automation at Tsinghua University, China. His research interests include computer vision, video content analysis and pattern recognition.

References (40)

  • Y. Bengio

    Markovian models for sequential data

    Neural Comput. Surv.

    (1999)
  • J.A. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation of gaussian mixture and...
  • J.A. Bilmes, Dynamic Bayesian multinets, in: Uncertainty in Artificial Intelligence, 2000, pp....
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans. PAMI

    (2001)
  • M. Brand, N. Oliver, A Pentland, Coupled hidden Markov models for complex action recognition, CVPR, San Juan, Puerto...
  • H.H. Bui et al.

    Policy recognition in the abstract hidden Markov model

    J. Artif. Intell. Res.

    (2002)
  • Y. Du, F. Chen, W. Xu, Y. Li, Recognizing interaction activities using dynamic Bayesian network, ICPR, Hong Kong,...
  • T.V. Duong, H.H. Bui, D.Q. Phung, S. Venkatesh, Activity recognition and abnormality detection with the switching...
  • C. Fanti, L.Z. Manor, P. Perona, Hybrid models for human motion recognition, CVPR, San Diego, USA, 2005, pp....
  • S. Fine et al.

    The hierarchical hidden Markov model: analysis and applications

    Mach. Learning

    (1998)
  • Cited by (26)

    • Full body movements recognition - unsupervised learning approach with heuristic R-GDL method

      2015, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      After features selection and data transformation from original space into space determined by selected features there are large variety of computational intelligence methods that are used in process of gestures recognition. Among most popular are hidden Markov models (HMMs) [10,11], support vector machines (SVM) [1,8,12], decision forests [8,13], Gaussian process dynamical models [14], k-means clustering [2] or dynamic Bayesian networks [15]. Some approaches use syntactic pattern recognition methods.

    • An information geometric framework for the optimization on a discrete probability spaces: Application to human trajectory classification

      2015, Neurocomputing
      Citation Excerpt :

      For instance, the use of dynamic time warping [6] or longest common subsequence [19] have been suggested to perform such comparisons. The class of approaches adopted in this paper models the trajectories as being produced by a probabilistic generative mechanism, usually an HMM or one of its variants [8–12]. These approaches have the key advantage of not requiring trajectory alignment or registration; moreover, they allow building a well grounded probabilistic inference formulation, based on which model parameters may be obtained from observed data.

    • Probabilistic graphical models for computer vision

      2019, Probabilistic Graphical Models for Computer Vision
    • A survey on wearable human motion state monitoring method using wearable

      2017, Proceedings of 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference, ITOEC 2017
    View all citing articles on Scopus

    Youtian Du received the B.S. degree in department of electric engineering from Xi’an JiaoTong University, China in 2002. He is currently a Ph.D. candidate in the department of automation at Tsinghua University, China. His research interests include computer vision, video content analysis and pattern recognition.

    Feng Chen received the B.S. and the M.S. degrees in automation from Saint-Petersburg Polytechnic University, Russia, in 1994 and 1996, respectively, the Ph.D. degree in automation department from Tsinghua University, Beijing, China, in 2000. He is currently an associate professor of Tsinghua University. His research interests are mainly in the area of computer vision and video processing.

    Wenli Xu received the B.S. degree in electrical engineering and the M.S. degrees in automatic control engineering from Tsinghua University, Beijing, China, in 1970 and 1980, respectively, the Ph.D. degree in electrical and computer engineering from the University of Colorado at Boulder, CO, in 1990. He is currently a professor of Tsinghua University and a director of Chinese Association of Automation. His research interests are mainly in the areas of video processing, computer vision, robotics and automatic control.

    Weidong Zhang received B.S. degree in department of electric engineering from Xi’an JiaoTong University, China in 2003. He is currently a Ph.D. candidate in the department of automation at Tsinghua University, China. His research interests include computer vision, video processing and intelligent surveillance.

    View full text