Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking

doi:10.1016/j.patcog.2008.12.024

Pattern Recognition

Volume 42, Issue 7, July 2009, Pages 1559-1571

https://doi.org/10.1016/j.patcog.2008.12.024 Get rights and content

Abstract

We present a method to reconstruct human motion pose from uncalibrated monocular video sequences based on the morphing appearance model matching. The human pose estimation is made by integrated human joint tracking with pose reconstruction in depth-first order. Firstly, the Euler angles of joint are estimated by inverse kinematics based on human skeleton constrain. Then, the coordinates of pixels in the body segments in the scene are determined by forward kinematics, by projecting these pixels in the scene onto the image plane under the assumption of perspective projection to obtain the region of morphing appearance model in the image. Finally, the human motion pose can be reconstructed by histogram matching. The experimental results show that this method can obtain favorable reconstruction results on a number of complex human motion sequences.

Introduction

Human motion contains a wealth of information about actions, intentions, emotions, and personality traits of a person and plays an important role in many application areas, such as surveillance, human motion analysis, and virtual reality. It is a hot topic to track human joint and reconstruct the corresponding 3D human motion posture from an uncalibrated monocular video sequences, the human motion pose reconstruction can be categorized into two groups: (1) using multi-view video sequences, and (2) using monocular video sequences. Reconstruction of human motion pose from monocular video sequences is more attractive because it has many advantages such as convenient to use, conveniently available to general public and less restrictions. The depth value of an object will be lost when the object is projected onto 2D image plane. Therefore, 3D motion reconstruction from 2D motion sequences is still a challenging task. The conventional methods to reconstruct human pose from monocular video sequences may require some restrictions or prior knowledge. Rather than the classical algorithms, in this paper, we propose an approach to reconstruct the 3D human motion pose from uncalibrated monocular video sequences by combining human joint tracking and pose extraction, whose advantages include fewer constraints, without knowing the parameters of camera model, easy to implement and more precise performance of the pose reconstruction.

Human pose reconstruction from monocular video sequences is roughly divided into two categories: machine learning methods and object tracking methods. Researchers propose the use of machine learning methods that exploit prior knowledge in gaining more stable estimates of 3D human body pose [1], [2], [3], [4], [5]. However, these methods require a large amount of samples which limit their applications. Object tracking methods commonly follow two sequence steps: first, locating feature of human and tracking them in each frame, then, reconstructing human pose by these obtained features. Many researchers have conducted studies on the first step, and general surveys can be found in recent review papers [6], [7], in this step, people always use the configuration in the current frame and a dynamic model to predict the next configuration [8], [9]. Most approaches perform prediction by variants of kalman filtering [9], [10] and particle filtering [11], [12], [13]. Particle filters restrict themselves to predictions returned by a motion model which is hard to construct, such a scheme is susceptible to drift due to imprecise motion model that the predictions were worse. Annealing the particle filter [14] or performing local searches [15] is the ways to attack this difficulty. The second step is human pose reconstruction (i.e., extracting 3D coordinates of feature from its corresponding 2D image coordinates). Some researchers reconstruct the human motion pose from video sequences by using some constrains such as human skeleton proportions based on camera model, these methods can be classified into two classes depending on the camera model adopted: (1) using affine camera model; and (2) using perspective camera model. Affine camera model is only an approximation of the real camera model. Scaled-orthographic camera model is an important instance of this kind and is popularly used by many researchers [16], [17]. The scale factor $s$ has a significant effect on the result of human motion pose reconstruction by using scaled-orthographic camera model [16]. In these methods, the scale factor $s$ is estimated by satisfying a constrained formula, but not a ground truth value; so the reconstructed human pose is great different from the real human pose, and these methods can only handle images with very little perspective effects. In addition, there are very limited research efforts working on human pose reconstruction based on perspective camera model [18], [19], [20]. Zhao et al. [20] restrict all body segments of the human figure as almost parallel to the image plane in order to acquire accurate human skeleton proportions. Peng requires estimating the virtual scale factor for each frame [19].

The remainder of this paper describes our algorithm in more detail. In Section 2, we explain the diagram of data flow in our system, and in Section 3, we describe the initialization of our system. The detail procedure to reconstruct human motion pose is described in Section 4, while in Section 5, we illustrate results from our system. In the end of this paper, we conclude about this study and point out the future further work.

Section snippets

Overview

The basic idea of our algorithm is to reconstruct 3D human pose reconstruction from the corresponding 2D joints on the image plane. The positions of human joints in each frame of the video are located with a local search by using the technique of morphing appearance model matching.

The proposed algorithm is divided into four major steps as shown in Fig. 1. The first step is to initialize models by a simple user interface with the first frame as input, the texture information and space

Human skeleton model

We represent human body as a tree stick model, which is inspired by the human body model employed at the Human Modeling and Simulation Center at University of Pennsylvania [21]. As shown in Fig. 3, the human skeleton model consists of rigid parts connected by joints, in which, $J_{1}$ is the root joint correspond to pelvis. Information about other joints is provided in Table 1. Fig. 4 shows the tree structure of human skeleton model. The relative lengths of human body segments in the model are

Initialization

The problem can be decomposed into three sub-problems: the estimation of the relative lengths of body segments in the human skeleton model, the initialization of appearance model of people and the estimation of the scale factor $s$ correspond to root joint.

We have developed a graphical user interface that allows the user to select the projection of joints of the subject's body in the first frame. A marked image is shown in Fig. 9, in which the green dots depict all selected joints while the

Human motion tracking by template matching

The human motion tracking is performed with a local search in the image by template matching. The whole tracking is decomposed into three steps: (1) the estimation of rotation Euler angles based on the coordinates of joint candidate, (2) estimating the coordinates of the pixels of appearance model in the scene after rotation by forward kinematics based on estimated Euler angle, and (3) estimating the region of morphing appearance model by projecting these pixels onto the image plane, then

Experiments with real sequences

To test the proposed contribution, we measure 3D human motion pose on the same subject while the joints in the image were manually marked or located by semi-automatic tracking. In the first experiment, we present results to show how the imprecise estimation of $T (dz)$ affects the result of human motion pose reconstruction, and discuss which factors affect the precise estimation of $T (dz)$ . The second experiment is performed to test the effectiveness of the proposed 3D joint points estimation from

Conclusion and future research

We proposed an algorithm to automatically reconstruct 3D human motion pose from uncalibrated monocular video sequences. A key feature of our approach is the proposed method to reconstruct 3D human pose from the corresponding 2D joints on the image plane. In the experiments, the human 3D pose reconstruction is accomplished automatically or manually.

There are several advantages of the proposed approach. First, no camera calibration is needed as required by previous approaches that use

Acknowledgements

This research was funded by National Science Foundation of China under Contracts 60673093, 90715043 and 60803024, and supported by Program for Changjiang Scholars and Innovative Research Team in University of China IRT0661.

About the author—BEIJI ZOU received the B.S. degree in computer science from Zhejing University, China, in 1982, received the M.S. degree from Tsinghua University specializing CAD and computer graphics in 1984, and obtained the Ph.D. degree from Hunan University in the field of control theory and control engineering in 2001. He is a professor in the School of Information Science and Engineering, Central South University, China. His research interests include computer graphics, CAD technology

References (22)

L. Wang et al.
Recent developments in human motion analysis
Pattern Recognition
(2003)
Y. Song et al.
Towards detection of human motion
R. Rosales et al.
Estimating 3D body pose using uncalibrated cameras
N. Howe et al.
Bayesian reconstruction of 3d human motion from single-camera video
K. Sminchisescu et al.
Covariance scaled sampling for monocular 3D body tracking
K. Grauman et al.
Inferring 3D structure with a statistical image-based shape model
A. Yilmaz et al.
Object tracking: a survey
ACM Computing Surveys
(2006)
K. Rohr
Incremental recognition of pedestrians from image sequences
C. Bregler et al.
Tracking people with twists and exponential maps
D.M. Gavrila et al.
3D model-based tracking of humans in action: a multiview approach

C. Sminchisescu et al.

Kinematic jump processes for monocular 3d human tracking

Cited by (35)

Multiple-input streams attention (MISA) network for skeleton-based construction workers' action recognition using body-segment representation strategies
2023, Automation in Construction
With the rapid growth of deep learning algorithms, graph convolutional networks (GCNs) have become a common choice for skeleton-based human action recognition, boasting impressive performance. However, existing GCN-based models often rely on physical human body connections, which may not suit complex construction tasks involving various body parts and hand movements. To address this concern, the human body is modeled in this paper through topological graphs at varying levels, designed based on body-segment strategies. A multiple-input streams attention (MISA) network is introduced, incorporating GCN and temporal convolutional network (TCN) components to enhance the body-structure topology graph of GCNs with more comprehensive input graphs. Additionally, two-modality motion data and three attention blocks are integrated to capture more discerning features. Finally, experimental results using the Construction Motion Library (CML) dataset demonstrated the superiority of the developed method, reaching approximately 84.94% recognition accuracy.
Prior-knowledge-based self-attention network for 3D human pose estimation
2023, Expert Systems with Applications
Estimating three-dimensional (3D) human poses from two-dimensional (2D) joints has achieved promising results. However, there is relatively little work focused on exploiting domain-specific knowledge as prior. In this work, we present a learning framework based on prior knowledge for the task of estimating a 3D human pose from a 2D pose. In contrast to other state-of-the-art 3D pose estimation approaches, the proposed method is a systematic analysis pipeline that takes full advantage of prior knowledge based on three observations. The proposed approach can model the spatial and temporal relations between joints to achieve better performance. Our approach formulates the learning network as an encoder–decoder architecture that explicitly encodes prior knowledge about the task. The encoder is a multi-head self-attention network which can capture human joint spatial relations. The decoder is formulated as three separated sub-networks, each sub-network represents a kinematic chain which is derived from our prior knowledge about human motion. The experimental results on the Human3.6M, HumanEva and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach. The code and data are available at https://github.com/XTU-PR-LAB/PK-SAN.
SkeletonPose: Exploiting human skeleton constraint for 3D human pose estimation
2022, Knowledge-Based Systems
Citation Excerpt :
Recovering 3D human poses from monocular RGB imagery has attracted much academic interest because it can be widely used in video surveillance, virtual reality and human–computer interfaces. Deep learning techniques employ two different strategies to estimate a 3D human pose: (1) directly recovering 3D human poses from 2D input images in an end-to-end manner [1–3], and (2) extracting 2D poses from 2D input images and then mapping them into 3D space [4–6]. The second approach is more attractive because the task of 3D pose estimation is decoupled into two steps that benefit from intermediate supervision, and there are enough datasets for 2D pose training.
We present SkeletonPose, a novel method to improve the accuracy of 3D human pose estimation via human skeleton constraints. In contrast to other state-of-the-art 3D pose estimation approaches, which use deep convolutional networks to regress the human pose, the proposed method employs a combination of data-driven and calculation methods. Our approach explicitly exploits the skeleton length prior during testing which can decrease the predicted human skeleton length error. As a result, the predicted 3D human pose error deceases accordingly. First, the proposed approach uses deep convolutional networks to regress the z-coordinate of the root joint. Second, the predicted outputs of the networks are used to calculate the 3D human pose according to the skeleton length invariance constraint. This combined method increases the accuracy of pose estimation because the skeleton length prior restricts the human pose space. Moreover, to eliminate the ambiguity of the z-coordinate difference between two connected joints, we propose a step-wise refinement to reduce the adverse effect caused by inaccurate predictions. Thorough evaluations were performed on three public databases; the Human3.6M dataset, the Human Eva-I database and the MPI-INF-3DHP dataset. Comparisons illustrate that SkeletonPose achieved better performance with respect to other state-of-the-art pose estimation approaches. The code and data are available at https://github.com/XTU-PR-LAB/skeleton-pose.
3D Human motion tracking by exemplar-based conditional particle filter
2015, Signal Processing
Citation Excerpt :
There has been a large amount of work on human pose tracking in the last two decades. There are mainly two categories in human motion tracking: machine learning methods and object tracking methods [12]. In the first approach, researchers proposed the use of machine learning methods that exploit prior knowledge in gaining more stable estimates of 3D human body pose [13–15].
3D human motion tracking has received increasing attention in recent years due to its broad applications. Among various 3D human motion tracking methods, the particle filter is regarded as one of the most effective algorithms. However, there are still several limitations of current particle filter approaches such as low prediction accuracy and sensitivity to discontinuous motion caused by low frame rate or sudden change of human motion velocity. Targeting such problems, this paper presents a full-body human motion tracking system by proposing exemplar-based conditional particle filter (EC-PF) for monocular camera. By introducing a conditional term with respect to exemplars and image data, dynamic model is approximated and used to predict current states of particles in prediction phase. In update phase, weights of particles are refined by matching images with projected human model using a set of features.
This method retains advantages of classic particle filters while increases prediction accuracy by replacing the smooth motion model with exemplars-based dynamic model which constrains evolved particles within an area closer to true state. Therefore, tracking robustness to discontinuous motion is improved such as under conditions of sudden change in motion velocity or using low-frame rate cameras. To verify the effectiveness and efficiency of the proposed algorithm, a variety of datasets are selected for testing and the results are also compared with the state-of-the-art methods in this domain.
Model-based 3D tracking of an articulated hand from single depth images
2013, Pattern Recognition Letters
This article has been retracted: please see Elsevier Policy on Article Withdrawal (http://www.elsevier.com/locate/withdrawalpolicy).
It has come to the attention of the Editors-in-Chief of Pattern Recognition Letters that most of the contents of this article are plagiarized from the two papers mentioned below:
I. Oikonomidis, N. Kyriazis, and A. A. Argyros: Efficient Model-based 3D Tracking of Hand Articulations using Kinect, Proc. Of British Machine Vision Conference (BMVC) 2011, http://dx.doi.org/10.5244/C.25.101 and
I. Oikonomidis, N. Kyriazis, and A. A. Argyros: Tracking the articulated motion of two strongly interacting hands, Proc. Computer Vision and Pattern Recognition (CVPR), 2012, http://doi.ieeecomputersociety.org/10.1109/CVPR.2012.6247885.
Human Action Recognition using Skeleton features
2022, Proceedings - 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct, ISMAR-Adjunct 2022

View all citing articles on Scopus

About the author—SHU CHEN is a Ph.D. candidate in computer application technology from Central South University, China. His research interests include computer vision, human motion tracking and recognition, animation.

View full text

Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking

Abstract

Introduction

Section snippets

Overview

Human skeleton model

Initialization

Human motion tracking by template matching

Experiments with real sequences

Conclusion and future research

Acknowledgements

Pattern Recognition

Towards detection of human motion

Estimating 3D body pose using uncalibrated cameras

Bayesian reconstruction of 3d human motion from single-camera video

Covariance scaled sampling for monocular 3D body tracking

Inferring 3D structure with a statistical image-based shape model

Object tracking: a survey

ACM Computing Surveys

Incremental recognition of pedestrians from image sequences

Tracking people with twists and exponential maps

3D model-based tracking of humans in action: a multiview approach

Kinematic jump processes for monocular 3d human tracking