Recovering 3D human pose based on biomechanical constraints, postures comfort and image shading

doi:10.1016/j.eswa.2014.03.049

Expert Systems with Applications

Volume 41, Issue 14, 15 October 2014, Pages 6305-6314

https://doi.org/10.1016/j.eswa.2014.03.049 Get rights and content

Highlights

•
We introduce a set of biomechanical constraints to reduce the number of postures.
•
We verify that in 341 images the correct posture is ranked in the first 10 positions.
•
The result was that for 92% of the images, the model ranked the correct answer.

Abstract

This paper presents a new model to identify 3D human poses in pictures, given a single input image. The proposed approach is based on a well known model found in the literature, including improvements in terms of biomechanical restrictions aiming to reduce the number of 3D possible postures that correctly represent the pose in the 2D image. Since the generated set of poses can have more than one possible posture, we propose a ranking system in order to suggest the best generated postures according to a “comfort” criterion and shading characteristics in the image as well. The comfort criterion adopts assumptions in terms of pose equilibrium, while the shading criterion eliminates the ambiguities of postures taken into account the image illumination. We must emphasize that the removal of ambiguous 3D poses related to a single image is the main focus of this work. The achieved results were analyzed w.r.t. visual inspection of users as well as a state of the art technique and indicate that our model contributed in some way to the solution of that challenge problem.

Introduction

The way a person poses in front of camera can indicate his/her emotions, attitudes and intentions. Recovering 3D human pose from video streams or images can be useful in areas such as sport performance analysis, automatic search in image databases, avatar reconstruction, and person identification systems, among other applications. Indeed, many applications can benefit from this technology that can deal with data coming from single images or videos. However, besides the large range of applications, this is an open research field due to the variability of possible human movements, partial occlusions and the restrictions imposed by loss of information when the 3D world is mapped into a single 2D image. Considerable effort has been done in order to solve such problems. Several papers surveying the state of the art in the area are available in the literature, providing good summaries of the models currently being developed (Agarwal and Triggs, 2006, Moeslund et al., 2006). The recovery of human poses can deal with information from a single image, where there is no depth information, or stream video, which presents motion and time information involved in the process. The output of pose recovery algorithms also can vary in nature, being a pose in two or three dimensions, depending on the needs of the solution.

An important problem in 3D human pose recovery is the ambiguity. Such ambiguity is generated when estimating three-dimensional positions from a 2D image, since many 3D postures can present same 2D projections, and it is inherent to the loss of depth information present in 2D images (Hua, Yang, & Wu, 2005). Many authors (Jiang, 2010, Pishchulin et al., 2012, Wei and Jinxiang, 2009, Fergie and Galata, 2013) mention this as one of the main problems to obtain the 3D human posture. In the study developed by Agarwal and Triggs (2004), the authors define this as an intrinsic challenge to estimate 3D poses. The treatment of ambiguity is dealt in several ways. For instance, Wei and Jinxiang (2009) propose a method that uses a set of biomechanical restrictions on the angles of the body joints to eliminate ambiguity. In the study by Lee and Cohen (2006), the ambiguity is handled through an approach that employs Markov Chains. Moeslund et al. (2006) uses techniques based on kinematic constraints and movements to treat ambiguity based on motion capture. Moreover, in the study of state of the art we find methods that are focused on a particular controlled situation, usually requiring databases containing posture samples. It is the case of the model proposed by Mori, Ren, Efros, and Malik (2004), who developed an approach for obtaining poses of baseball players.

In this paper, we propose a new model to handle the ambiguity problem, by initially using a set of biomechanical restrictions to obtain a set of possible 3D poses from 2D images. The resulting set is then ranked based on a comfort criterion that encodes assumptions in terms of pose equilibrium. In addition, a luminosity criterion that considers the original image lighting to improve the rank in the set of postures is also used. It is important to emphasize that our paper does not have a main focus of application, e.g. baseball players (Mori et al., 2004). However, our scope is more focused on frontal poses in images without perspective deformation. The main contributions are the use of the biomechanical model, which reduces the number of possible 3D poses by discarding impossible postures, and the ranking of more appropriate poses based on comfort of posture and luminosity of the scene. While the biomechanical model and posture comfort deal with joints positions, the luminosity of the scene considers the pixels of the original processed image. The experimental results show that our model brings new ground to the area when compared to the competitive approach.

This paper is organized as follows: Section 2 presents an overview of the main techniques currently developed. In Section 3 we present our model to recovery human 3D poses from single images. In Section 4, experimental results are discussed. Finally, the Section 5 shows the limitations and conclusions of the proposed model.

Section snippets

Related work

Although many efforts have been employed to obtain the 3D human poses, there is still no well-defined taxonomy in the literature to deal with such problem. Hu, Wang, Lin, and Yan (2009) suggest two main categories: models that get the human posture from video sequences (Agarwal and Triggs, 2006, Chen et al., 2011, Lee and Nevatia, 2007, Menier et al., 2006) and the models that recover the pose from static images such as photographs, which is the case of present work. The models based on static

Proposed model

The problem of estimating the 3D pose of a person in videos have received special attention in the literature of computer vision as discussed in Section 2. This is partly due to the fact that solutions to this problem can be employed in a wide range of applications. However, less attention has been given to the problem of determining the 3D human pose based on a single image. In fact, this problem is a challenge because the restrictions of 2D images are often not sufficient to determine poses

Results

In this section we describe the experiments performed in order to perform a validation of presented model. Initially, we selected 430 images containing people in various postures. These images were obtained in databases available on the Internet according to the work (Bourdev and Malik, 2009, Dalal and Triggs, 2005, Ferrari et al., 2008). For each image, it was generated the ground truth of the human pose in the image (defined through the sign of $Δ Z$ of each joint i). The ground truth process

Conclusions

This paper described a model for the recovery of 3D poses from a single 2D image. We verified through the literature investigation that some challenges in this area are still open and there is no definitive solution to the problem. Characteristics such as perspective, lighting, noise, partial occlusions, different clothes and the ambiguity in 3D poses are examples of challenges that need to be solved. In particular, the approach proposed in this paper aimed to minimize the ambiguous 3D postures

References (31)

C. Chen et al.
3D human pose recovery from image by efficient visual feature selection
Computer Vision and Image Understanding
(2011)
M. Fergie et al.
Mixtures of gaussian process models for human pose estimation
Image and Vision Computing
(2013)
U. Güdükbay et al.
Motion capture and human pose reconstruction from a single-view video sequence
Digital Signal Processing
(2013)
Z. Hu et al.
Recovery of upper body poses in static images based on joints detection
Pattern Recognition Letters
(2009)
M.W. Lee et al.
Body part detection for human pose estimation and tracking
T. Moeslund et al.
A survey of advances in vision-based human motion capture and analysis
Computer Vision Image Understanding
(2006)
C.J. Taylor
Reconstruction of articulated objects from point correspondences in a single uncalibrated image
Computer Vision and Image Understanding
(2000)
Agarwal, A., & Triggs, B. (2006). A local basis representation for estimating human pose from cluttered images. In...
A. Agarwal et al.
3D human pose from silhouettes by relevance vector regression
A. Agarwal et al.
Recovering 3D human pose from monocular images
Pattern Analysis and Machine Intelligence
(2006)

Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose...

Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In IEEE 12th...

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Computer vision and pattern...

Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation....

Hornung, A., Deckers, E., & Kobbelt, L. (2007). Character animation from 2D pictures and 3D motion data. In ACM...

Cited by (5)

TSwinPose: Enhanced monocular 3D human pose estimation with JointFlow
2024, Expert Systems with Applications
Monocular estimation of 3D human poses is challenging due to ambiguity in depths and partial occlusion. Most recent works define this as a 2D-to-3D lifting task, taking 2D key point sequences and using spatial and temporal relationships. However, prior works focus on capturing spatio-temporal correlations but ignore the motion of joints that is needed for continuous estimation. To extend the potential of 2D-to-3D pose estimation, we propose TSwinPose, which learns multi-scale spatio-temporal representations from 2D key point locations and patterns of motion. The input 2D key point sequences are enhanced by JointFlow, which encodes the motion of each human joint. Based on Swin-Transformer, we designed a temporal domain Swin-Unet structure to model multi-scale spatio-temporal relationships of human joints across different temporal windows. The final 3D pose generated by multi-stage representations is consistent temporally and has a higher accuracy. Experiments conducted on three benchmark datasets, Human3.6M, MPI-INF-3DHP, and HumanEva-I, demonstrate that TSwinPose achieves performance that is on par with state-of-the-art methods. Moreover, the introduction of JointFlow as a plug-in extension enhances performance significantly, particularly benefiting long-term 2D-to-3D lifting human pose estimation methods.
A3GC-IP: Attention-oriented adjacency adaptive recurrent graph convolutions for human pose estimation from sparse inertial measurements
2023, Computers and Graphics (Pergamon)
Conventional methods for human pose estimation either require a high degree of instrumentation, by relying on many inertial measurement units (IMUs), or constraint the recording space, by relying on extrinsic cameras. These deficits are tackled through the approach of human pose estimation from sparse IMU data. We define attention-oriented adjacency adaptive graph convolutional long-short term memory networks (A3GC-LSTM), to tackle human pose estimation based on six IMUs, through incorporating the human body graph structure directly into the network. The A3GC-LSTM combines both spatial and temporal dependency in a single network operation, more memory efficiently than previous approaches. The recurrent graph learning on arbitrarily long sequences is made possible by equipping graph convolutions with adjacency adaptivity, which eliminates the problem of information loss in deep or recurrent graph networks, while it also allows for learning unknown dependencies between the human body joints. To further boost accuracy, a spatial attention formalism is incorporated into the recurrent LSTM cell. With our presented approach, we are able to utilize the inherent graph nature of the human body, and thus can outperform the state-of-the-art for human pose estimation from sparse IMU data.
Prior-knowledge-based self-attention network for 3D human pose estimation
2023, Expert Systems with Applications
Estimating three-dimensional (3D) human poses from two-dimensional (2D) joints has achieved promising results. However, there is relatively little work focused on exploiting domain-specific knowledge as prior. In this work, we present a learning framework based on prior knowledge for the task of estimating a 3D human pose from a 2D pose. In contrast to other state-of-the-art 3D pose estimation approaches, the proposed method is a systematic analysis pipeline that takes full advantage of prior knowledge based on three observations. The proposed approach can model the spatial and temporal relations between joints to achieve better performance. Our approach formulates the learning network as an encoder–decoder architecture that explicitly encodes prior knowledge about the task. The encoder is a multi-head self-attention network which can capture human joint spatial relations. The decoder is formulated as three separated sub-networks, each sub-network represents a kinematic chain which is derived from our prior knowledge about human motion. The experimental results on the Human3.6M, HumanEva and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach. The code and data are available at https://github.com/XTU-PR-LAB/PK-SAN.
Feature extraction in Brazilian Sign Language Recognition based on phonological structure and using RGB-D sensors
2014, Expert Systems with Applications
Citation Excerpt :
More information about the sensor Kinect can be found in Cruz, Lucio, and Velho (2012) and Mankoff and Russo (2013). For HGR we can see the use of Kinect in the following work: (Chaaraoui, Padilla-López, Climent-Pérez, & Flórez-Revuelta, 2014; Chen et al., 2013; Dihl & Musse, in press; Frati & Prattichizzo, 2011; Li, 2012; Palacios et al., 2013; Ramey, González-Pacheco, & Salichs, 2011; Ramirez-Giraldo, Molina-Giraldo, Alvarez-Meza, Daza-Santacoloma, & Castellanos-Dominguez, 2012; Suarez & Murphy, 2012). In the specific case of SLR, Kinect has been used in the papers presented by Zafrulla, Brashear, Starner, Hamilton, and Presti (2011), Uebersax, Gall, Van den Bergh, and Van Gool (2011), Zaki and Shaheen (2011), Phadtare, Kushalnagar, and Cahill (2012), Boulares and Jemni (2012) and Oszust and Wysocki (2013).
In contrast to speech recognition, whose speech features have been extensively explored in the research literature, feature extraction in Sign Language Recognition (SLR) is still a very challenging problem. In this paper we present a methodology for feature extraction in Brazilian Sign Language (BSL, or LIBRAS in Portuguese) that explores the phonological structure of the language and relies on RGB-D sensor for obtaining intensity, position and depth data. From the RGB-D images we obtain seven vision-based features. Each feature is related to one, two or three structural elements in BSL. We investigate this relation between extracted features and structural elements based on shape, movement and position of the hands. Finally we employ Support Vector Machines (SVM) to classify signs based on these features and linguistic elements. The experiments show that the attributes of these elements can be successfully recognized in terms of the features obtained from the RGB-D images, with accuracy results individually above 80% on average. The proposed feature extraction methodology and the decomposition of the signs into their phonological structure is a promising method to help expert systems designed for SLR.
Increasing postural deformity trends and body mass index analysis in school-age children
2018, Zdravstveno Varstvo

View full text

Recovering 3D human pose based on biomechanical constraints, postures comfort and image shading

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed model

Results

Conclusions

Computer Vision and Image Understanding

Image and Vision Computing

Digital Signal Processing

Pattern Recognition Letters

Computer Vision Image Understanding

Computer Vision and Image Understanding

3D human pose from silhouettes by relevance vector regression

Recovering 3D human pose from monocular images

Pattern Analysis and Machine Intelligence