Model-based 3D hand posture estimation from a single 2D image

https://doi.org/10.1016/S0262-8856(01)00094-4Get rights and content

Abstract

Passive sensing of the 3D geometric posture of the human hand has been studied extensively over the past decade. However, these research efforts have been hampered by the computational complexity caused by inverse kinematics and 3D reconstruction. In this paper, our objective focuses on 3D hand posture estimation based on a single 2D image. We introduce the human hand model with 27 degrees of freedom (DOFs) and analyze some of its constraints to reduce the 27 to 12 DOFs without any significant degradation of performance. A novel algorithm to estimate the 3D hand posture from eight 2D projected feature points is proposed. Experimental results using real images confirm that our algorithm gives good estimates of the 3D hand pose.

Introduction

Hand posture analysis is an interesting research field and has received much attention in recent years. It is the pivotal context of particular applications such as gesture recognition [1], [2], [3], [4], human computer interaction (HCI) [5], sign language recognition (SLR) [6], [7], [8], virtual reality (VR), computer graphic animation (CGA) and medical studies.

General solutions for posture analysis are divided into two categories: One attempt is to use mechanical devices, such as glove-based devices, to directly measure hand joint angles and spatial positions. The other attempt uses computer vision-based techniques. Although the former can give real-time processing and reliable information, it requires the user to wear a cumbersome device and generally carry a load of cables that connect the device to a computer. All these requirements make the sensing of natural hand motion difficult. On the other hand, the latter is suitable for hand posture estimation since vision is a non-invasive way of sensing.

Vision-based approaches can be classified into two types: appearance- and three-dimensional (3D) model-based approach. The appearance-based methods are mainly based on the visual image model and use the image templates to describe the postures. The gestures are modeled by relating the appearance of any gesture to the appearance of the set of predefined, template gestures. Starner et al. [6] use silhouette moments as the features to analyze the American sign language (ASL). In their research project, “Real-time American Sign Language Recognition Using Desktop and Wearable Computer Based Video”, they present two real-time hidden Markov model-based systems for recognizing sentence-level continuous ASL using a single camera to track the user's unadorned hands.

The major advantage of this approach is the simplicity of their parameter computation. However, the loss of precise spatial information makes them less suitable for manipulative hand posture analysis. Since appearance-based methods are sensitive to viewpoint changes and cannot provide precise spatial information, it is less suited for manipulative and interactive applications.

Conventional model-based methods are mainly used in two areas: 3D hand tracking and 3D hand posture estimation. Hand tracking is to locally track and estimate the positions of joints and tips of the hand in the image sequence. By analysis of static and dynamic motions of the human hand, Lee and Knuii [9] present some constraints on the joints and use them to simulate the human hand in real images. In the experiments, they used markers to identify the fingertips. On the basis of Lee's contribution, Lien et al. [10] proposed a fast hand model fitting method for the tracking of hand motion. Although they improve the performance of the tracking algorithm, the computation of inverse kinematics is still required.

Rehg [11] described DigitalEyes for a real-time hand tracking system, in which the articulated motion of fingers was recognized as a 3D mouse by using a hand model having 27 degrees of freedoms (DOFs). This approach was based on the assumption that the positions of fingertips in the human hand, relative to the palm, are almost always sufficient to differentiate a finite number of different gestures. The hand gesture was estimated by a non-linear least squares method that minimizes the residual distances in finger links and tips of the model hand and those of the observed hand.

Shimada et al. [4] present a method to track the pose (joint angles) of a moving hand and refine the 3D shape (widths and lengths) of the given hand model from a monocular image sequence. First, the algorithm uses the silhouette features and motion prediction to obtain the approximated 3D shape. Then, with inequality constraints, they refine the estimation by the extended Kalman filter (EKF).

Without the motion information, some research efforts have concentrated on 3D hand posture estimation. In the study of Chang [12], a prototype system for estimating the position and orientation of a human hand as well as the joint angles of the thumb and the fingers from a single image is developed. The hand pose is estimated by using sparse range data generated by laser beams and by using the generalized Hough transform. Possible configurations for the fingers and the thumb are generated by the inverse kinematic technique.

Although the above algorithms have promising results, posture estimation is not yet advanced enough to provide a flexible and reliable performance for potential applications. The estimation of kinematic parameters from the detected features is a complex and cumbersome task. They face the following problems: First, the articulated mechanism of the human hand, which involves high DOF, is more difficult to analyze than a single rigid object: its state space is larger and its appearance is more complicated. Second, model-based methods always involve finding the inverse kinematics, which are in general ill posed. It is obviously a task of computational complexity to estimate these kinematic parameters from the detected features. Third, previous methods on 3D require the use of multiple cameras, which not only is resource consuming, but also needs some form of 3D reconstruction that itself is computationally intense. Finally, it should be pointed out that the knowledge of exact hand posture parameters seems unnecessary for the recognition of communicative gestures.

In this paper, the goal of our work is to avoid the complex computation of inverse kinematics and 3D reconstruction; that is, without using 3D information, we propose a new approach to estimate the 3D hand posture from a single two-dimensional (2D) image. Preliminary results can be found in [13], [14], which deals only with finger posture. This paper extends the idea further to compute the 3D posture for the entire hand. First, we analyze the human hand model with 27 DOFs and its constraints. The constraints play an important role in our study, which help us to reduce 27 to 12 DOFs without significant degradation of performance. Using the hand model and its constraints, we develop an algorithm to estimate the 3D hand posture by using eight feature points to retrieve the 3D hand posture. The eight feature points are the point of wrist, the tips of the fingers and thumb, and the metacarpophalangeal joints for the middle finger and thumb. We use color markers to identify these eight points and retrieve the approximate posture of the hand. Occlusion of any of the eight points is not considered in this paper.

In the experiments, two feature extraction methods are utilized: one for model building and the other for on-line hand posture estimation. In extracting the parameters for the hand model, a higher degree of accuracy in detecting the feature points is necessary. In this regard, the feature points are extracted from the silhouette contour of the out-stretched hand. For on-line hand posture estimation, the silhouette contour may not contain essential feature points (say, the tips of the fingers of a clenched fist). Color markers, placed on the necessary positions of the hand are utilized. Pose estimation results obtained from real images are shown by comparison. These results confirm that our algorithm gives correct hand posture estimation.

This paper is organized as follows: Section 2 discusses the hand model and its constraints. Section 3 presents the methodology to estimate the hand posture. Two test cases involving various degrees of finger-extension are investigated in Section 4.

Section snippets

DOFs hand model

Lee and Knuii [9] defined a hand model with 27 DOFs. The joints of the human hand are classified into three kinds: flexion, directive or spherical joints, which consist of one DOFs (extension/flexion), two DOFs (one for extension/flexion and one for adduction/abduction) and three DOFs (rotation), respectively, (see Fig. 1). For each finger, there are four DOFs described by θ1θ4. The thumb has five DOFs described by θ1θ5. Including the six DOFs for the translation and rotation of the wrist,

Problem description

The purpose of this part is to analyze the geometric characteristics of the hand and replicate its pose based on 2D projected information. Without loss of generality, we assume that the world coordinate frame is aligned with the camera frame; in other words, the image plane coincides with the XY frame of the world. In our experiments, we adopt color markers to identify the feature points of the joints and tips.

Solution for five points in the 2D ‘finger plane’

For a particular finger, we define the 3D distance between the tip of finger T to

Feature detection based on hand contour in initial frame

Feature detection/extraction stage is concerned with the detection of features which is used for the estimation of the parameters of the chosen hand model. It affects the accuracy of the model parameters and so does the estimation results. For the accuracy, we use the following steps to detect the feature with hand contour:

  • 1.

    From the original image (see Fig. 5), extract the contour of the open hand using the LOG operator [18].

  • 2.

    Calculate the contour curvature [19].

  • 3.

    Obtain the local maximum and

Conclusion

In this paper, we proposed an algorithm to estimate the 3D hand posture using a single 2D image. The new algorithm is promising because of the following reasons: First, the algorithm uses 2D positions of the feature points and avoids the computational complexity caused by the 3D reconstruction. Second, the algorithm does not involve the computation of inverse kinematics. Third, the algorithm uses only single 2D image to retrieve the 3D hand posture. There is no need to know the motion

References (20)

  • S Tamura et al.

    Recognition of sign language motion images

    Pattern Recognition

    (1988)
  • C.C Lien et al.

    Model-based articulated hand motion tracking for gesture recognition

    Image and Vision Computing

    (1998)
  • J. Triesch, C. Von der Malsburg, A gesture interface for human–robot-interaction, The Proceedings of FG'98, Nara,...
  • A.F Bobick et al.

    A state-based approach to the representation and recognition of gesture

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1997)
  • T.J. Darrel, A. Pentland, Attention-driven expression and gesture analysis in an interactive environment, Proceedings...
  • N. Shimada, Y. Shirai, J. Miura, Hand gesture estimation and model refinement using monocular camera—ambiguity...
  • V.I Pavlovic et al.

    Visual interpretation of hand gestures for human–computer interaction: a review

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1997)
  • T.E. Starner, Unencumbered virtual environments, PhD Thesis, MIT Media Arts and Sciences Section, USA,...
  • J.S Kim et al.

    A dynamic gesture recognition system for the korean sign language (ksl)

    IEEE Transactions on System, Man, and Cybernetics

    (1996)
  • J Lee et al.

    Model-based analysis of hand posture

    IEEE Computer Graphics and Application

    (1995)
There are more references available in the full text version of this article.

Cited by (72)

View all citing articles on Scopus
View full text