A CNN-based 3D human pose estimation based on projection of depth and ridge data

doi:10.1016/j.patcog.2020.107462

Pattern Recognition

Volume 106, October 2020, 107462

https://doi.org/10.1016/j.patcog.2020.107462 Get rights and content

Highlights

•
We propose a CNN-based human pose estimation using depth and ridge data.
•
We project the depth and ridge data on three orthogonal planes (XY, XZ, ZY).
•
The projected depth and ridge data can eliminate the 3D information loss.
•
Ridge data is introduced to avoid joint drift, which improves the accuracy of estimated poses.
•
The proposed method achieved the state-of-the-art pose estimation accuracies.

Abstract

We propose a method that use a convolutional neural network (CNN) to estimate human pose by analyzing the projection of the depth and ridge data, which represent local maxima in a distance transform map. To fully utilize the 3D information of depth points, we propose a method to project the depth and ridge data on various directions. The proposed projection method can reduce the 3D information loss, the ridge data can avoid joint drift, and the CNN increases localization accuracy. The proposed method proceeds as follows. (1) We use depth data to segment the human from the background and extract ridge data from human silhouettes. (2) We project the depth and ridge data onto XY, XZ, and ZY planes. (3) ResNet-101 accepts six projected images and use 1 × 1 convolution layers to generate 2D heatmaps and offsets. (4) We generate 2D keypoints per plane by using the soft-argmax operation. (5) We obtain 3D joint positions by using the fully-connected layers. In experiments on the SMMC-10, EVAL, and ITOP datasets, the proposed method achieved the state-of-the-art pose estimation accuracies. The proposed method can eliminate the 3D information loss and drift of joint positions that can occur during estimation of human pose.

Introduction

Human pose estimation is the task of finding human model parameters such as the length and orientation of body parts (e.g., head, torso, limbs) that fit an input image [1]. High-speed depth imaging devices permit extraction of rich information from depth silhouettes, so human pose estimation is simplified.

There are two approaches to identifying human poses from depth silhouettes: discriminative and generative. The discriminative approach is to find human joints from an observed input image using the pre-trained body part detectors. Kong et al. [2] proposed a geodesic feature based joint detector to localize body joints in a depth data. Jain et al. [3] proposed a head-torso detector based on Haar candidates and a template matching algorithm for each limb, but it required that upper-body and face be visible without any occlusion. Shotton et al. [4] proposed a human pose recognition approach that predicted an intermediate body-part representation to estimate the human pose but the prediction usually required expensive training steps and a large number of training samples to cover the wide human pose-space. Girshick et al. [5] used a regression forest to localize the joint positions directly from votes of each pixel, but it required increasingly complex training because their voting was modeled from the result of [4]. Wang et al. [6] used semi-local features extracted from randomly-sampled 4D sub-volumes of depth sequences and it required highly-complicated training for large processing time. Buys et al. [7] described a method to transform the depth data into the intermediate representation without background subtraction. Most of these discriminative approaches do not miss the body parts totally but are suffered from the occlusion problem because they can detect only visible parts, which degrades the human pose estimation accuracy seriously.

The generative approach is to find human joints by fitting the pre-defined human body model to an observed input image. Grest et al. [8] and Knoop et al. [9] proposed an iterative closest points (ICP) method to estimate human poses and to track human body parts, but they were computationally complicated, so they were not applicable in real-time systems. Rosenhahn et al. [10] presented a marker-less motion capture system that took the lower-dimensional human pose manifold into account by using soft constraints to model the motion restrictions during human pose optimization, but it could not be used for challenging outdoor scenes that included shadows and strong illumination changes. Straka et al. [11] addressed the occlusion problems by using graph-based inference, and Ye et al. [12] did so by using joint energy minimization, but they have limited applicabilities because they required multiple cameras. Zhang et al. [13] used a parameterized human shape model that estimated the human pose from all available data retrieved from several cameras. Although it improved human pose estimation, it still could not differentiate occluded body parts. Ganapathi et al. [14] proposed an efficient filtering algorithm for tracking human pose that combined an accurate generative model with a discriminative model that provided data-driven evidence about body part locations. Most of these generative approaches can solve the occlusion problem in some degree by exploiting prior knowledge of human body model and requires no training steps but fails to track the body parts entirely and takes long computation time because they fit a complicated human model to an observed input image iteratively.

Recently, convolutional neural networks (CNNs) have been successfully applied to the task of 3D human pose estimation. Toshev and Szegedy [15] proposed to directly regress the 2D Cartesian coordinate of joints in a holistic manner. Cao et al. [16] used the heatmaps that are intermediate representations for each joint to refine the joint position in the 2D pose estimation.

This paper proposes a method to identify human joints reliably. We propose use of ridge data, which constitute a novel representation of the human body. The ridge data are more plentiful and scale-invariant than the existing skeletonization techniques [17], [18]. Using these data, we achieve higher pose estimation accuracy (0.9868, 0.9835, and 0.9689 mAP) than the current state-of-the-art methods on SMMC-10, EVAL, and ITOP dataset.

The rest of this paper is organized as follows. Section 2 describes the proposed human depth silhouette segmentation. Section 3 describes the motivation, definition and extraction of ridge data, which constitute a novel feature for human pose estimation. Section 4 describes the projection method of the depth and ridge data. Section 5 describes the proposed CNN-based human pose estimation methods. Section 6 validates the proposed feature and methods using the experiments in the SMMC-10, EVAL, and ITOP datasets. Section 7 presents conclusions.

Section snippets

Human segmentation

Humans are segmented from a depth image in four steps: floor removal, object segmentation, human detection, and human identification (Fig. 1).

Ridge data

The motivation to introduce the concept of ridge data is to solve the common problem that estimated joint position drifts. When we use the raw depth data directly in human pose estimation, the estimation errors of the overlapping body part are likely to increase, because the raw depth data of the overlapped body parts may become conflated. For example, when a forearm overlaps a torso, the raw depth data from the torso can be included as the joint candidates for the hand position. This

Projection of depth and ridge data

The direct mapping from the input image to the 3D position of joints is highly non-linear and complicated to learn, the recent trend of human pose estimation to try to map the input image to a set of heatmaps that represent the probability distributions of joint positions. However, the heatmap only provides 2D information of the joint position [22] and little depth information is utilized.

We consider a CNN (Fig. 6) that uses the projected images, as in [23]. It fully utilizes the 3D information

Feature extraction and 3D joint prediction

One traditional approach to predicting the keypoints is to utilize the Gaussian-shaped heatmap. The major limitation of this approach is that as the network is going deeper, the size of heatmap also getting small due to the use of pooling layers. Moreover, the heatmap based approach tends to lose the structural information of the human body. To overcome these problems, we divide the original regression problem into a classification problem and a regression problem. We have decided to use the

Implementation details

The proposed model was trained using 1 million synthetic depth images, which were generated using a 3D modeling tool from 15 base models with 36,000 uniformly-distributed poses. To ensure that the synthetic dataset is sufficiently realistic and has a wide variety of human body shapes, we randomly sampled a set of parameters, such as the type of base model, length variation of each body part. Then we rendered the depth images according to the predefined camera parameters, such as position,

Conclusions

We proposed a novel representation of the human body by using ridge data, which are defined as the local maxima in the distance transform map in a selective representation of the skeleton. The proposed ridge data makes the human pose estimation more robust because ridge data can be extracted even if occlusion, full-body rotation, and fast movement occur.

We proposed the projection method that utilizes the 3D information from the depth image successfully. The proposed projection method can also

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research was partially supported by the MSIT (Ministry of Science, ICT), Korea, under the SW Starlab support program (IITP-2017-0-00897) supervised by the IITP (Institute for Information & Communications Technology Promotion). This research was partially supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (IITP-2018-0-01290, Development of Open Informal Dataset and Dynamic Object Recognition Technology Affecting

Yeonho Kim received the B.S. degree in computer engineering from Catholic University of Korea, Seoul, Korea, in 2008. He received his Ph.D. degree in computer science and engineering at Pohang University of Science and Technology (POSTECH), Pohang, Korea, in 2019. He is working as a senior researcher at Samsung S1 now. His research interests include computer vision, human-computer interaction, and human pose estimation.

References (36)

D.F. Atrevi et al.
A very simple framework for 3d human poses estimation using a single 2d image: comparison of geometric moments descriptors
Pattern Recognit.
(2017)
L. Kong et al.
A hybrid framework for automatic joint detection of human poses in depth frames
Pattern Recognit.
(2018)
K. Buys et al.
An adaptable system for RGB-D based human body detection and pose estimation
J. Vis. Commun. Image Represent.
(2014)
C. Di Ruberto
Recognition of shapes by attributed skeletal graphs
Pattern Recognit.
(2004)
R. Marie et al.
The delta medial axis: a fast and robust algorithm for filtered skeleton extraction
Pattern Recognit.
(2016)
D. Zhang et al.
A gamma-signal-regulated connected components labeling algorithm
Pattern Recognit.
(2019)
D.C. Luvizon et al.
Human pose regression by combining indirect part detection and contextual information
Comput. Graph.
(2019)
G. Wang et al.
Region ensemble network: towards good practices for deep 3d hand pose estimation
J. Vis. Commun. Image Represent.
(2018)
H.P. Jain et al.
Real-time upper-body human pose estimation using a depth camera
Proceedings of the International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications
(2011)
J. Shotton et al.
Efficient human pose estimation from single depth images
IEEE Trans. Pattern Anal. Mach. Intell.
(2012)

R. Girshick et al.

Efficient regression of general-activity human poses from depth images

Proceedings of the International Conference on Computer Vision

(2011)

J. Wang et al.

Robust 3d action recognition with random occupancy patterns

Proceedings of European Conference on Computer Vision

(2012)

D. Grest et al.

Nonlinear body pose estimation from depth images

Joint Pattern Recognition Symposium

(2005)

S. Knoop et al.

Sensor fusion for 3d human body tracking with an articulated 3d body model

Proceedings of the IEEE International Conference on Robotics and Automation

(2006)

B. Rosenhahn et al.

Markerless motion capture of man-machine interaction

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2008)

M. Straka et al.

Skeletal graph based human pose estimation in real-time

Proceedings of British Machine Vision Conference

(2011)

G. Ye et al.

Performance capture of interacting characters with handheld kinects

Proceedings of European Conference on Computer Vision

(2012)

L. Zhang et al.

Real-time human motion tracking using multiple depth cameras

Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems

(2012)

Cited by (34)

3D human pose estimation based on 2D–3D consistency with synchronized adversarial training
2024, Robotics and Autonomous Systems
3D human pose estimation from a single image is still a challenging problem despite the large amount of work that has been performed in this field. Generally, most methods directly use neural networks and ignore certain constraints ( $e . g .$ , reprojection constraints, joint angle, and bone length constraints). While a few methods consider these constraints but train the network separately, they cannot effectively solve the depth ambiguity problem. In this paper, we propose a GAN-based model for 3D human pose estimation, in which a reprojection network is employed to learn the mapping of the distribution from 3D poses to 2D poses, and a discriminator is employed for 2D–3D consistency discrimination. We adopt a novel strategy to synchronously train the generator, the reprojection network and the discriminator. Furthermore, inspired by the typical kinematic chain space (KCS) matrix, we introduce a weighted KCS matrix and take it as one of the discriminator’s inputs to impose joint angle and bone length constraints. The experimental results on Human3.6M show that our method significantly outperforms state-of-the-art methods in most cases.
3D human pose estimation with single image and inertial measurement unit (IMU) sequence
2024, Pattern Recognition
Three-dimensional human pose estimation plays an important role in the field of computer vision, such as in healthcare, sports, activity recognition, motion capture, and augmented reality. However, monocular image or video based methods are sensitive to occlusions, while multi-view methods usually require enormous computation resources. Currently, inertial measurement unit (IMU)-based methods have begun to overcome the occlusion problem and can potentially achieve real-time inference. Yet, they still suffer from insufficient precision and scale drift error over time. In this paper, we propose a novel, efficient framework to fuse a single image with temporal sequence from IMU sensors to estimate human poses and reconstruct human shapes. Our method achieves 46 mm Mean Per Joint Positional Error (MPJPE) on the Total Capture dataset with 30 frames time segment, and surpasses state-of-the-art pure IMU-based methods. Moreover, in comparison with other vision-based methods, the proposed method shows great advantage in reducing computing floating point operations per second (FLOPS) quota while still achieving competitive estimation precision. Our method achieves 74 FPS on an IPhone 12 for offline processing. In addition, our method can easily be generalized for outdoor cases.
JoyPose: Jointly learning evolutionary data augmentation and anatomy-aware global–local representation for 3D human pose estimation
2024, Pattern Recognition
Video-based 3D human pose estimation is an important yet challenging task for many human-involved pattern recognition systems. Existing deep learning-based 3D human pose estimation methods are faced with the problems of lacking large-scale training data and lacking effective solutions to represent the complicated human body structure. To this end, this paper proposes a jointly learning framework entitled JoyPose that simultaneously leverages both human pose data augmentation and human pose estimation. In particular, JoyPose consists of an evolutionary data augmentation module and an anatomy-aware global–local pose feature representation module for 3D human pose estimation. The evolution for data augmentation is guided by a reinforcement learning strategy in a probabilistic way according to pose estimation loss. The anatomy-aware global–local pose feature representation module separately captures global features and local features according to anatomical and kinematic patterns observed from pose estimation errors across different human joints. The performance of the final human pose estimation is leveraged by both data augmentation and anatomy-aware global–local feature representation. Extensive experiments on three real-world datasets demonstrate the superiority and robustness against state-of-the-art methods.
Masked Kinematic Continuity-aware Hierarchical Attention Network for pose estimation in videos
2024, Neural Networks
Existing methods for estimating human poses from video content exploit the temporal features of the video sequences and have shown impressive results. However, most methods address spatiotemporal issues separately. They compromise on accuracy to reduce jitter, or require high-resolution images to deal with occlusion, preventing full consideration of temporal features. Unfortunately, these two issues are interrelated. For example, occlusion causes uncertainty between successive frames, leading to unsmoothed results. To address these issues, we propose the Masked Kinematic Continuity-aware Hierarchical Attention Network (M-HANet) as a novel framework that exploits masked kinematic keypoint features by extending our framework HANet framework. First, we randomly select and mask a keypoint to treat the masked keypoint as it is occluded, which allows us to make the network resilient to occlusion. We also use the velocity and acceleration of each individual keypoint to effectively capture temporal features. Second, the proposed hierarchical transformer encoder refines a 2D or 3D input pose derived from existing estimators by aggregating the masked continuity of the spatiotemporal dependencies of human motion. Finally, to facilitate collaborative optimization, we perform an online cross-supervision between the final pose from our decoder and the refined input pose produced by our encoder. We validate the effectiveness of our model demonstrating that our proposed approach improves [email protected] by 14.1% and MPJPE by 8.7 mm compared to the existing method on a variety of tasks, including 2D and 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation.
JSL3D: Joint subspace learning with implicit structure supervision for 3D pose estimation
2022, Pattern Recognition
Citation Excerpt :
Based on these MoCap datasets, considerable effort has been devoted to inferring 3D human poses from the 2D body joints. A typical solution makes use of depth information [19] and multi-view images [20] captured in some highly sensored environment. Since sensor devices are difficult to deploy in real scenarios, it is more realistic to estimate the 3D human pose simply from a single image [7,9].
Estimating 3D human poses from a single image is an important task in computer graphics. Most model-based estimation methods represent the labeled/detected 2D poses and the projection of approximated 3D poses using vector representations of body joints. However, such lower-dimensional vector representations fail to maintain the spatial relations of original body joints, because the representations do not consider the inherent structure of body joints. In this paper, we propose JSL3d, a novel joint subspace learning approach with implicit structure supervision based on Sparse Representation (SR) model, capturing the latent spatial relations of 2D body joints by an end-to-end autoencoder network. JSL3djointly combines the learned latent spatial relations and 2D joints as inputs for the standard SR inference frame. The optimization is simultaneously processed via geometric priors in both latent and original feature spaces. We have evaluated JSL3dusing four large-scale and well-recognized benchmarks, including Human3.6M, HumanEva-I, CMU MoCap and MPII. The experiment results demonstrate the effectiveness of JSL3d.
Single image based 3D human pose estimation via uncertainty learning
2022, Pattern Recognition
Citation Excerpt :
3D human pose estimation aims to infer the 3D joint locations of a single or multiple persons from an input image or a sequence of images [1–8].
In monocular image scenes, 3D human pose estimation exhibits inherent ambiguity due to the loss of depth information and occlusions. Simply regressing body joints with high uncertainties will lead to model overfitting and poor generalization.
In this paper, we propose an uncertainty-based framework to jointly learn 3D human poses and the uncertainty of each joint. Our proposed joint estimation framework aims to mitigate the adverse effects of training samples with high uncertainties and facilitate the training procedure. To be specific, we model each body joint as a Laplace distribution for uncertainty representation. Since visual joints often exhibit low uncertainties while occluded ones have high uncertainties, we develop an adaptive scaling factor, named the uncertainty-aware scaling factor, to ease the network optimization in accordance with the joint uncertainties.
By doing so, our network is able to converge faster and significantly reduce the adverse effects caused by those ambiguous joints. Furthermore, we present an uncertainty-aware graph convolutional network by exploiting the learned joint uncertainties and the relationships among joints to refine the initial joint localization. Extensive experiments on single-person (Human3.6M) and multi-person (MuCo-3DHP & MuPoTS-3D) 3D human pose estimation datasets demonstrate the effectiveness of our method.

View all citing articles on Scopus

Daijin Kim received the B.S. degree in electronic engineering from Yonsei University, Seoul, Korea, in 1981, and the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Taejon, 1984. In 1991, he received the Ph.D. degree in electrical and computer engineering from Syracuse University, Syracuse, NY. During 1992–1999, he was an Associate Professor in the Department of Computer Engineering at DongA University, Pusan, Korea. He is currently a Professor in the Department of Computer Science and Engineering at POSTECH, Pohang, Korea, Director of BK21+ POSTECH CSE Institute, and Director of Software Start Lab. (ADAS Computer Vision). His research interests include computer vision, pattern recognition, and machine learning.

View full text

A CNN-based 3D human pose estimation based on projection of depth and ridge data

Highlights

Abstract

Introduction

Section snippets

Human segmentation

Ridge data

Projection of depth and ridge data

Feature extraction and 3D joint prediction

Implementation details

Conclusions

Declaration of Competing Interest

Acknowledgement

Pattern Recognit.

Pattern Recognit.

J. Vis. Commun. Image Represent.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Comput. Graph.

J. Vis. Commun. Image Represent.

Real-time upper-body human pose estimation using a depth camera

Proceedings of the International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications

Efficient human pose estimation from single depth images

IEEE Trans. Pattern Anal. Mach. Intell.

Efficient regression of general-activity human poses from depth images

Proceedings of the International Conference on Computer Vision

Robust 3d action recognition with random occupancy patterns

Proceedings of European Conference on Computer Vision

Nonlinear body pose estimation from depth images

Joint Pattern Recognition Symposium

Sensor fusion for 3d human body tracking with an articulated 3d body model

Proceedings of the IEEE International Conference on Robotics and Automation

Markerless motion capture of man-machine interaction

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Skeletal graph based human pose estimation in real-time

Proceedings of British Machine Vision Conference

Performance capture of interacting characters with handheld kinects

Proceedings of European Conference on Computer Vision

Real-time human motion tracking using multiple depth cameras

Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems