Retrieval and constraint-based human posture reconstruction from a single image

https://doi.org/10.1016/j.jvcir.2005.01.002Get rights and content

Abstract

In this study, we present a novel model-based approach to reconstruct the 3D human posture from a single image. The approach is guided by a posture library and a set of constraints. Given a 2D human figure, i.e., a set of labeled body segments and estimated root orientation in the image, a 3D pivotal posture whose 2D projection is similar to the human figure is first retrieved from the posture library. To facilitate the retrieval process, a table-lookup technique is proposed to index postures according to their 2D projections with respect to designated view directions. Next physical and environmental constraints, including segment length ratios, joint angle limits, pivotal posture reference, and feet-floor contact, are automatically applied to reconstruct the 3D posture. Experimental results show the effectiveness of the proposed approach.

Introduction

We seek to reconstruct 3D postures of a human actor from given 2D images. This issue has drawn a great attention due to its variety of applications, such as motion capture [1], [2], user interface [3], [4], character animation [5], [6], etc. In these applications, the source image data can be a single image or single/multi-view video. In this paper, we confine ourselves to the single image case, which is also required for initialization in the video case.

Suppose that a 2D human figure, i.e., a set of labeled body segments and estimated root orientation in the image, is given by a user. To reconstruct the 3D posture of the 2D human figure, the main challenge is to determine depth information of the human figure elements. That is, since an image does not record 3D depth, each foreshortened body segment can be pointed either towards or away from the viewer with respect to the image plane. Consequently, the number of possible postures grows exponentially with the number of body segments. For example, if there are n body segments in the human figure, the number of possible 3D postures according to the given image is 2n in general. To solve the depth ambiguity problem, several methods have been proposed. In general, there are two main approaches, namely, model-based and learning-based. A brief review of the two approaches is given in the following section. Moreover, posture reconstruction from a single image and motion recovery from single-view video are discussed in the following review.

The model-based approach uses an articulated human model to generate possible 3D postures that match the 2D human figure. To obtain the best 3D solution, a set of physical, environmental, or dynamic constraints is then applied to cull invalid 3D postures generated initially. Lee and Chen [7] first extract the camera extrinsic parameters through geometric calibration and then generate a set of 3D postures for the given 2D human figure image. These 3D postures are verified by using joint angle limits, body segment lengths, collision detection, and heuristic motion rules to prune infeasible ones. Bregler and Malik [8] introduce the twists and product of exponential maps to model the kinematic relationship of an articulated human model. Based on this model, the 3D posture of the first video frame is acquired by minimizing the difference between the projected 3D posture and the given 2D human figure. Difranco et al. [9] propose a Scaled Prismatic Model (SPM) [10] to track 2D joint positions. They formulate a batch optimization function that involves a series of SPM measurements and constraints, including kinematic constraints, joint angle limits, dynamic smoothing, and 3D key frames. The optimization function is solved iteratively to recover 3D articulated motion. Taylor [11] presents an analysis to show that solutions for the 3D posture reconstruction problem can be parameterized by a scale factor under scaled orthographic projection. He further deduces the lower bound on the scale factor. Parameswaran and Chellappa [12] further extend Taylor’s work by using the perspective projection model, and Loy et al. [13] apply Taylor’s method to reconstruct long action sequences. Barron and Kakadiaris [14] estimate the anthropometry and 3D posture simultaneously for the given 2D human figure by minimizing a cost function subject to joint angle limits and segment length ratios. Park et al. [15] exploit 3D motion data given by users to recover motion from video. These motion data are expected to provide a good initial guess in the objective function for estimating joint orientations and the root trajectory.

Since reconstruction solely based on a single image is in general insufficient to solve the depth ambiguity problem thoroughly, extra information is needed to obtain the desired 3D posture. Therefore, either particular motion types such as unidirectional walking are presented to reduce the reconstruction complexity [7], [8], [16], [17], [18], or some extra visual cues about the 2D human figure are provided by users. For example in Difranco’s method [9], users are asked to set several keyframes of the video sequence and guess initial 3D coordinates of body joints with respect to these keyframes. In Taylor’s method [11], users have to specify, for each body segment, the joint that is nearer to the viewer. In Barron’s method [14], users must locate those segments being nearly parallel to the image plane for anthropometry estimation. In Park’s method [15], users first prepare appropriate 3D motion data for the given video clip and then mark corresponding keyframes between the video clip and motion data for motion synchronization. All these methods require complicate human perceptions and interactions to provide extra visual cues. Some studies [19], [20] propose fully automatic methods to locate body segments in an image. However, the accuracy of their methods is still far from user expectation.

For learning-based approaches, they try to derive mapping functions between features in the 2D image and that in the 3D posture through stochastic learning processes. It requires a large set of training data to learn prior knowledge of specific postures and motion. Pavlović et al. [21] describe a switching linear dynamic system (SLDS) to learn figure dynamics of fronto-parallel motion from video. A novel Viterbi approximation algorithm for inference in the SLDS is derived to overcome exponential complexity of motion classification, tracking, and synthesis. Brand [22] and Elgammal and Lee [23] use dynamic manifolds to model high-dimension human motion. Given a 2D silhouette in a video sequence, 3D motion and orientation are inferred through the dynamic manifolds. Howe et al. [24] divide motion data into short motion elements called snippets that are used to build a probability density function. To reconstruct 3D motion, they divide the 2D tracking data into snippets and then find the best 3D snippet for each 2D observation using maximum-a-posteriori estimation. Tomasi and Kanade [34] propose a factorization technique that decomposes rigid shapes in image sequences to generate basis shapes. Then given 2D tracking data, these basis shapes can recover corresponding 3D information. Bregler et al. [35] further extend Tomasi’s work on non-rigid shapes. Rosales and Sclaroff [25] design the Specialized Mappings Architecture (SMA) that maps 2D image features onto the 3D body posture parameters. Mapping functions in SMA are learned through the EM algorithm. Agarwal and Triggs [26] apply Relevance Vector Machine (RVM), which regresses 55D vectors of 3D body joint angles from 100D vectors of the human image silhouette, to learn 2D–3D mapping functions. Grochow et al. [36] present a novel model called a Scaled Gaussian Process Latent Variable Model (SGPLVM) to learn the probability density function of motion capture postures. The SGPLVM model can be learned automatically from a small training data set, and it works well in real-time animation applications. These above-mentioned methods only search databases to find the postures that are most similar to the given 2D images. No extra mechanism is provided to tune the found postures. Besides, the learning-based approach spends much time to learn 2D–3D mapping functions from large amount of training data. When the training data is modified, these mapping functions have to be recomputed again.

To conclude, 3D posture reconstruction from a single image is ill-posed due to insufficient spatial information. Using domain constraints or knowledge can moderate the underconstrained depth ambiguity problem. Both model-based and learning-based approaches do have their own merits and can provide feasible solutions under particular considerations. By taking the guiding data set in the learning-based approach and a priori knowledge of human model and constraints in the model-based approach, we propose a novel algorithm for the reconstruction problem.

In this section, we present a novel approach to reconstruct the human posture from a single image. To overcome the depth ambiguity problem, we exploit a posture library and constraints to guide the reconstruction work. Suppose that a 2D human figure, i.e., a set of labeled body segments and an estimated root orientation in that image, is given. The proposed approach will first retrieve from the library an appropriate candidate whose 2D projection is similar to the human figure in the image. Since the candidate solution is from a large volume of the posture library, the effectiveness of the approach highly depends on the efficiency of the retrieval process. Therefore, we propose a table-lookup technique to index 3D human postures in the library. Each of library postures is projected onto several sampling view directions and the corresponding projection features are extracted. These features are stored in corresponding array elements for future retrieval. Next physical and environmental constraints, including segment length ratios, joint angle limits, pivotal posture reference, and feet–floor contact, are automatically applied step by step to reconstruct the 3D human posture for the given 2D human figure. Fig. 1 shows the reconstruction procedure of the proposed approach, where the word “ERO” beneath the image is the abbreviation of “Estimated Root Orientation.”

Our approach effectively integrates the techniques of model-based approach and the postures of guiding data set used in learning-based approach. Compared with the requirement of providing extra visual cues in existing model-based methods, our approach only asks users to label body segments on the image (the same requirement in existed model-based methods), and no further complicated indication required. The posture library is exploited to deal with the depth ambiguity problem automatically. Compared with the learning-based approaches, our approach can further refine the retrieved posture automatically according to given constraints rather than outputting the retrieved posture only. Besides, a table-lookup index mechanism is proposed to speed up the retrieval process. This index mechanism does not need to spend time for data training.

Note that the posture library is assumed to contain data that are similar to the posture implied by the given image. This assumption is reasonable for most corpus-based applications. In other words, we assume that users have an appropriate posture library that records the same motion type implied by the given image. For example, if users want to reconstruct some postures of Tai Chi Chuan from images, they will use the posture library containing posture data of Tai Chi Chuan.

This paper is organized as follows. Section 2 presents preprocessing for the posture library, including posture feature representation and posture table creation. Section 3 describes the posture reconstruction process, including pivotal posture retrieval and constraint-based reconstruction. Section 4 shows our experimental results. Section 5 gives some conclusions and future work.

Section snippets

Posture library preprocessing

The objective of this section is to build an index structure for effectively retrieving pivotal posture from the posture library. It consists of two parts, namely, posture feature representation and posture table creation. In the posture feature representation part, we introduce the definitions and notations of 3D human postures in the posture library. In the posture table creation part, we propose a table-lookup technique to index 3D human postures. The index structure of lookup table is easy

Human posture reconstruction

Suppose that an image with a postured character is given for 3D human posture reconstruction. In our approach, users are first asked to provide a 2D human figure by labeling body segments and estimating the root orientation of the postured character in the image. Then the reconstruction work is accomplished through the following two processes: pivotal posture retrieval and constraint-based reconstruction. In the pivotal posture retrieval process, a 3D pivotal posture whose 2D projection is the

Experimental results

We use motion capture data of Cheng’s Tai Chi Chuan [31], a traditional Chinese martial art, as our posture library. The library contains more than 20,000 3D human postures captured from a professional martial art master. The proposed approach is implemented using Matlab on an Intel Pentium 4 2.4 GHz computer with 512 MB memory. The posture table we used is a 12 × 8 array in this study. In other words, the range of angle α  [0, 2π) is equally divided into 12 bins. The search range on the posture

Discussion

We remark the error comes from the following factors:

  • 1.

    Scaled orthographic projection. Taylor [11] designed a simulation experiment to investigate the effect of the scaled orthographic projection compared to the perspective projection. According to Taylor’s simulation result, there will be at least 5.88% RMSE due to scaled orthographic projection contributed in our experimental case. Compare with our experimental results in Table 2, we speculate that there is about 1–2% error caused by other

Conclusions and future work

In this study, we present a novel model-based approach to reconstruct the 3D human posture from a single image. The approach is guided by posture library retrieval and constraint-based reconstruction. A table-lookup index structure is devised to facilitate the retrieval. Besides, the physical and environmental constraints are automatically applied to reconstruct the 3D human posture. The major contribution is that we use the posture library to avoid providing extra visual cues manually.

References (36)

  • J. Davis, M. Agrawala, E. Chuang, Z. Popoviæ, D. Salesin, A sketching interface for articulated figure animation, in:...
  • C. Bregler, J. Malik, Tracking people with twists and exponential maps, in: IEEE Conference on Computer Vision and...
  • D.E. Difranco, T.J. Cham, J.M. Rehg, Recovery of 3D articulated motion from 2D correspondences, Compaq Cambridge...
  • D.D. Morris, J.M. Rehg, Singularity analysis for articulated object tracking, in: IEEE Conference on Computer Vision...
  • V. Parameswaran, R. Chellappa, View independent human body pose estimation from a single perspective image, in: IEEE...
  • G. Loy, M. Eriksson, BJ. SullivanB, S. Carlsson, Monocular 3D reconstruction of human motion in long action sequences,...
  • C. Barron, I.A. Kakadiaris, Estimating anthropometry and posture from a single image, in: IEEE Conference on Computer...
  • M.J. Park, M.G. Choi, S.Y. Shin, Human motion reconstruction from inter-frame feature correspondences of a single video...
  • Cited by (0)

    This study was supported partially by the MOE Program for Promoting Academic Excellence of Universities under the Grant No. 89-E-FA04-1-4 and the National Science Council, Taiwan under the Grant NSC92-2213-E-007-081.

    View full text