Elsevier

Pattern Recognition

Volume 42, Issue 7, July 2009, Pages 1559-1571
Pattern Recognition

Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking

https://doi.org/10.1016/j.patcog.2008.12.024Get rights and content

Abstract

We present a method to reconstruct human motion pose from uncalibrated monocular video sequences based on the morphing appearance model matching. The human pose estimation is made by integrated human joint tracking with pose reconstruction in depth-first order. Firstly, the Euler angles of joint are estimated by inverse kinematics based on human skeleton constrain. Then, the coordinates of pixels in the body segments in the scene are determined by forward kinematics, by projecting these pixels in the scene onto the image plane under the assumption of perspective projection to obtain the region of morphing appearance model in the image. Finally, the human motion pose can be reconstructed by histogram matching. The experimental results show that this method can obtain favorable reconstruction results on a number of complex human motion sequences.

Introduction

Human motion contains a wealth of information about actions, intentions, emotions, and personality traits of a person and plays an important role in many application areas, such as surveillance, human motion analysis, and virtual reality. It is a hot topic to track human joint and reconstruct the corresponding 3D human motion posture from an uncalibrated monocular video sequences, the human motion pose reconstruction can be categorized into two groups: (1) using multi-view video sequences, and (2) using monocular video sequences. Reconstruction of human motion pose from monocular video sequences is more attractive because it has many advantages such as convenient to use, conveniently available to general public and less restrictions. The depth value of an object will be lost when the object is projected onto 2D image plane. Therefore, 3D motion reconstruction from 2D motion sequences is still a challenging task. The conventional methods to reconstruct human pose from monocular video sequences may require some restrictions or prior knowledge. Rather than the classical algorithms, in this paper, we propose an approach to reconstruct the 3D human motion pose from uncalibrated monocular video sequences by combining human joint tracking and pose extraction, whose advantages include fewer constraints, without knowing the parameters of camera model, easy to implement and more precise performance of the pose reconstruction.

Human pose reconstruction from monocular video sequences is roughly divided into two categories: machine learning methods and object tracking methods. Researchers propose the use of machine learning methods that exploit prior knowledge in gaining more stable estimates of 3D human body pose [1], [2], [3], [4], [5]. However, these methods require a large amount of samples which limit their applications. Object tracking methods commonly follow two sequence steps: first, locating feature of human and tracking them in each frame, then, reconstructing human pose by these obtained features. Many researchers have conducted studies on the first step, and general surveys can be found in recent review papers [6], [7], in this step, people always use the configuration in the current frame and a dynamic model to predict the next configuration [8], [9]. Most approaches perform prediction by variants of kalman filtering [9], [10] and particle filtering [11], [12], [13]. Particle filters restrict themselves to predictions returned by a motion model which is hard to construct, such a scheme is susceptible to drift due to imprecise motion model that the predictions were worse. Annealing the particle filter [14] or performing local searches [15] is the ways to attack this difficulty. The second step is human pose reconstruction (i.e., extracting 3D coordinates of feature from its corresponding 2D image coordinates). Some researchers reconstruct the human motion pose from video sequences by using some constrains such as human skeleton proportions based on camera model, these methods can be classified into two classes depending on the camera model adopted: (1) using affine camera model; and (2) using perspective camera model. Affine camera model is only an approximation of the real camera model. Scaled-orthographic camera model is an important instance of this kind and is popularly used by many researchers [16], [17]. The scale factor s has a significant effect on the result of human motion pose reconstruction by using scaled-orthographic camera model [16]. In these methods, the scale factor s is estimated by satisfying a constrained formula, but not a ground truth value; so the reconstructed human pose is great different from the real human pose, and these methods can only handle images with very little perspective effects. In addition, there are very limited research efforts working on human pose reconstruction based on perspective camera model [18], [19], [20]. Zhao et al. [20] restrict all body segments of the human figure as almost parallel to the image plane in order to acquire accurate human skeleton proportions. Peng requires estimating the virtual scale factor for each frame [19].

The remainder of this paper describes our algorithm in more detail. In Section 2, we explain the diagram of data flow in our system, and in Section 3, we describe the initialization of our system. The detail procedure to reconstruct human motion pose is described in Section 4, while in Section 5, we illustrate results from our system. In the end of this paper, we conclude about this study and point out the future further work.

Section snippets

Overview

The basic idea of our algorithm is to reconstruct 3D human pose reconstruction from the corresponding 2D joints on the image plane. The positions of human joints in each frame of the video are located with a local search by using the technique of morphing appearance model matching.

The proposed algorithm is divided into four major steps as shown in Fig. 1. The first step is to initialize models by a simple user interface with the first frame as input, the texture information and space

Human skeleton model

We represent human body as a tree stick model, which is inspired by the human body model employed at the Human Modeling and Simulation Center at University of Pennsylvania [21]. As shown in Fig. 3, the human skeleton model consists of rigid parts connected by joints, in which, J1 is the root joint correspond to pelvis. Information about other joints is provided in Table 1. Fig. 4 shows the tree structure of human skeleton model. The relative lengths of human body segments in the model are

Initialization

The problem can be decomposed into three sub-problems: the estimation of the relative lengths of body segments in the human skeleton model, the initialization of appearance model of people and the estimation of the scale factor s correspond to root joint.

We have developed a graphical user interface that allows the user to select the projection of joints of the subject's body in the first frame. A marked image is shown in Fig. 9, in which the green dots depict all selected joints while the

Human motion tracking by template matching

The human motion tracking is performed with a local search in the image by template matching. The whole tracking is decomposed into three steps: (1) the estimation of rotation Euler angles based on the coordinates of joint candidate, (2) estimating the coordinates of the pixels of appearance model in the scene after rotation by forward kinematics based on estimated Euler angle, and (3) estimating the region of morphing appearance model by projecting these pixels onto the image plane, then

Experiments with real sequences

To test the proposed contribution, we measure 3D human motion pose on the same subject while the joints in the image were manually marked or located by semi-automatic tracking. In the first experiment, we present results to show how the imprecise estimation of T(dz) affects the result of human motion pose reconstruction, and discuss which factors affect the precise estimation of T(dz). The second experiment is performed to test the effectiveness of the proposed 3D joint points estimation from

Conclusion and future research

We proposed an algorithm to automatically reconstruct 3D human motion pose from uncalibrated monocular video sequences. A key feature of our approach is the proposed method to reconstruct 3D human pose from the corresponding 2D joints on the image plane. In the experiments, the human 3D pose reconstruction is accomplished automatically or manually.

There are several advantages of the proposed approach. First, no camera calibration is needed as required by previous approaches that use

Acknowledgements

This research was funded by National Science Foundation of China under Contracts 60673093, 90715043 and 60803024, and supported by Program for Changjiang Scholars and Innovative Research Team in University of China IRT0661.

About the author—BEIJI ZOU received the B.S. degree in computer science from Zhejing University, China, in 1982, received the M.S. degree from Tsinghua University specializing CAD and computer graphics in 1984, and obtained the Ph.D. degree from Hunan University in the field of control theory and control engineering in 2001. He is a professor in the School of Information Science and Engineering, Central South University, China. His research interests include computer graphics, CAD technology

References (22)

  • L. Wang et al.

    Recent developments in human motion analysis

    Pattern Recognition

    (2003)
  • Y. Song et al.

    Towards detection of human motion

  • R. Rosales et al.

    Estimating 3D body pose using uncalibrated cameras

  • N. Howe et al.

    Bayesian reconstruction of 3d human motion from single-camera video

  • K. Sminchisescu et al.

    Covariance scaled sampling for monocular 3D body tracking

  • K. Grauman et al.

    Inferring 3D structure with a statistical image-based shape model

  • A. Yilmaz et al.

    Object tracking: a survey

    ACM Computing Surveys

    (2006)
  • K. Rohr

    Incremental recognition of pedestrians from image sequences

  • C. Bregler et al.

    Tracking people with twists and exponential maps

  • D.M. Gavrila et al.

    3D model-based tracking of humans in action: a multiview approach

  • C. Sminchisescu et al.

    Kinematic jump processes for monocular 3d human tracking

  • Cited by (35)

    • SkeletonPose: Exploiting human skeleton constraint for 3D human pose estimation

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Recovering 3D human poses from monocular RGB imagery has attracted much academic interest because it can be widely used in video surveillance, virtual reality and human–computer interfaces. Deep learning techniques employ two different strategies to estimate a 3D human pose: (1) directly recovering 3D human poses from 2D input images in an end-to-end manner [1–3], and (2) extracting 2D poses from 2D input images and then mapping them into 3D space [4–6]. The second approach is more attractive because the task of 3D pose estimation is decoupled into two steps that benefit from intermediate supervision, and there are enough datasets for 2D pose training.

    • 3D Human motion tracking by exemplar-based conditional particle filter

      2015, Signal Processing
      Citation Excerpt :

      There has been a large amount of work on human pose tracking in the last two decades. There are mainly two categories in human motion tracking: machine learning methods and object tracking methods [12]. In the first approach, researchers proposed the use of machine learning methods that exploit prior knowledge in gaining more stable estimates of 3D human body pose [13–15].

    • Human Action Recognition using Skeleton features

      2022, Proceedings - 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct, ISMAR-Adjunct 2022
    View all citing articles on Scopus

    About the author—BEIJI ZOU received the B.S. degree in computer science from Zhejing University, China, in 1982, received the M.S. degree from Tsinghua University specializing CAD and computer graphics in 1984, and obtained the Ph.D. degree from Hunan University in the field of control theory and control engineering in 2001. He is a professor in the School of Information Science and Engineering, Central South University, China. His research interests include computer graphics, CAD technology and image processing.

    About the author—SHU CHEN is a Ph.D. candidate in computer application technology from Central South University, China. His research interests include computer vision, human motion tracking and recognition, animation.

    View full text