Elsevier

Pattern Recognition

Volume 106, October 2020, 107462
Pattern Recognition

A CNN-based 3D human pose estimation based on projection of depth and ridge data

https://doi.org/10.1016/j.patcog.2020.107462Get rights and content

Highlights

  • We propose a CNN-based human pose estimation using depth and ridge data.

  • We project the depth and ridge data on three orthogonal planes (XY, XZ, ZY).

  • The projected depth and ridge data can eliminate the 3D information loss.

  • Ridge data is introduced to avoid joint drift, which improves the accuracy of estimated poses.

  • The proposed method achieved the state-of-the-art pose estimation accuracies.

Abstract

We propose a method that use a convolutional neural network (CNN) to estimate human pose by analyzing the projection of the depth and ridge data, which represent local maxima in a distance transform map. To fully utilize the 3D information of depth points, we propose a method to project the depth and ridge data on various directions. The proposed projection method can reduce the 3D information loss, the ridge data can avoid joint drift, and the CNN increases localization accuracy. The proposed method proceeds as follows. (1) We use depth data to segment the human from the background and extract ridge data from human silhouettes. (2) We project the depth and ridge data onto XY, XZ, and ZY planes. (3) ResNet-101 accepts six projected images and use 1  ×  1 convolution layers to generate 2D heatmaps and offsets. (4) We generate 2D keypoints per plane by using the soft-argmax operation. (5) We obtain 3D joint positions by using the fully-connected layers. In experiments on the SMMC-10, EVAL, and ITOP datasets, the proposed method achieved the state-of-the-art pose estimation accuracies. The proposed method can eliminate the 3D information loss and drift of joint positions that can occur during estimation of human pose.

Introduction

Human pose estimation is the task of finding human model parameters such as the length and orientation of body parts (e.g., head, torso, limbs) that fit an input image [1]. High-speed depth imaging devices permit extraction of rich information from depth silhouettes, so human pose estimation is simplified.

There are two approaches to identifying human poses from depth silhouettes: discriminative and generative. The discriminative approach is to find human joints from an observed input image using the pre-trained body part detectors. Kong et al. [2] proposed a geodesic feature based joint detector to localize body joints in a depth data. Jain et al. [3] proposed a head-torso detector based on Haar candidates and a template matching algorithm for each limb, but it required that upper-body and face be visible without any occlusion. Shotton et al. [4] proposed a human pose recognition approach that predicted an intermediate body-part representation to estimate the human pose but the prediction usually required expensive training steps and a large number of training samples to cover the wide human pose-space. Girshick et al. [5] used a regression forest to localize the joint positions directly from votes of each pixel, but it required increasingly complex training because their voting was modeled from the result of [4]. Wang et al. [6] used semi-local features extracted from randomly-sampled 4D sub-volumes of depth sequences and it required highly-complicated training for large processing time. Buys et al. [7] described a method to transform the depth data into the intermediate representation without background subtraction. Most of these discriminative approaches do not miss the body parts totally but are suffered from the occlusion problem because they can detect only visible parts, which degrades the human pose estimation accuracy seriously.

The generative approach is to find human joints by fitting the pre-defined human body model to an observed input image. Grest et al. [8] and Knoop et al. [9] proposed an iterative closest points (ICP) method to estimate human poses and to track human body parts, but they were computationally complicated, so they were not applicable in real-time systems. Rosenhahn et al. [10] presented a marker-less motion capture system that took the lower-dimensional human pose manifold into account by using soft constraints to model the motion restrictions during human pose optimization, but it could not be used for challenging outdoor scenes that included shadows and strong illumination changes. Straka et al. [11] addressed the occlusion problems by using graph-based inference, and Ye et al. [12] did so by using joint energy minimization, but they have limited applicabilities because they required multiple cameras. Zhang et al. [13] used a parameterized human shape model that estimated the human pose from all available data retrieved from several cameras. Although it improved human pose estimation, it still could not differentiate occluded body parts. Ganapathi et al. [14] proposed an efficient filtering algorithm for tracking human pose that combined an accurate generative model with a discriminative model that provided data-driven evidence about body part locations. Most of these generative approaches can solve the occlusion problem in some degree by exploiting prior knowledge of human body model and requires no training steps but fails to track the body parts entirely and takes long computation time because they fit a complicated human model to an observed input image iteratively.

Recently, convolutional neural networks (CNNs) have been successfully applied to the task of 3D human pose estimation. Toshev and Szegedy [15] proposed to directly regress the 2D Cartesian coordinate of joints in a holistic manner. Cao et al. [16] used the heatmaps that are intermediate representations for each joint to refine the joint position in the 2D pose estimation.

This paper proposes a method to identify human joints reliably. We propose use of ridge data, which constitute a novel representation of the human body. The ridge data are more plentiful and scale-invariant than the existing skeletonization techniques [17], [18]. Using these data, we achieve higher pose estimation accuracy (0.9868, 0.9835, and 0.9689 mAP) than the current state-of-the-art methods on SMMC-10, EVAL, and ITOP dataset.

The rest of this paper is organized as follows. Section 2 describes the proposed human depth silhouette segmentation. Section 3 describes the motivation, definition and extraction of ridge data, which constitute a novel feature for human pose estimation. Section 4 describes the projection method of the depth and ridge data. Section 5 describes the proposed CNN-based human pose estimation methods. Section 6 validates the proposed feature and methods using the experiments in the SMMC-10, EVAL, and ITOP datasets. Section 7 presents conclusions.

Section snippets

Human segmentation

Humans are segmented from a depth image in four steps: floor removal, object segmentation, human detection, and human identification (Fig. 1).

Ridge data

The motivation to introduce the concept of ridge data is to solve the common problem that estimated joint position drifts. When we use the raw depth data directly in human pose estimation, the estimation errors of the overlapping body part are likely to increase, because the raw depth data of the overlapped body parts may become conflated. For example, when a forearm overlaps a torso, the raw depth data from the torso can be included as the joint candidates for the hand position. This

Projection of depth and ridge data

The direct mapping from the input image to the 3D position of joints is highly non-linear and complicated to learn, the recent trend of human pose estimation to try to map the input image to a set of heatmaps that represent the probability distributions of joint positions. However, the heatmap only provides 2D information of the joint position [22] and little depth information is utilized.

We consider a CNN (Fig. 6) that uses the projected images, as in [23]. It fully utilizes the 3D information

Feature extraction and 3D joint prediction

One traditional approach to predicting the keypoints is to utilize the Gaussian-shaped heatmap. The major limitation of this approach is that as the network is going deeper, the size of heatmap also getting small due to the use of pooling layers. Moreover, the heatmap based approach tends to lose the structural information of the human body. To overcome these problems, we divide the original regression problem into a classification problem and a regression problem. We have decided to use the

Implementation details

The proposed model was trained using 1 million synthetic depth images, which were generated using a 3D modeling tool from 15 base models with 36,000 uniformly-distributed poses. To ensure that the synthetic dataset is sufficiently realistic and has a wide variety of human body shapes, we randomly sampled a set of parameters, such as the type of base model, length variation of each body part. Then we rendered the depth images according to the predefined camera parameters, such as position,

Conclusions

We proposed a novel representation of the human body by using ridge data, which are defined as the local maxima in the distance transform map in a selective representation of the skeleton. The proposed ridge data makes the human pose estimation more robust because ridge data can be extracted even if occlusion, full-body rotation, and fast movement occur.

We proposed the projection method that utilizes the 3D information from the depth image successfully. The proposed projection method can also

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research was partially supported by the MSIT (Ministry of Science, ICT), Korea, under the SW Starlab support program (IITP-2017-0-00897) supervised by the IITP (Institute for Information & Communications Technology Promotion). This research was partially supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (IITP-2018-0-01290, Development of Open Informal Dataset and Dynamic Object Recognition Technology Affecting

Yeonho Kim received the B.S. degree in computer engineering from Catholic University of Korea, Seoul, Korea, in 2008. He received his Ph.D. degree in computer science and engineering at Pohang University of Science and Technology (POSTECH), Pohang, Korea, in 2019. He is working as a senior researcher at Samsung S1 now. His research interests include computer vision, human-computer interaction, and human pose estimation.

References (36)

  • R. Girshick et al.

    Efficient regression of general-activity human poses from depth images

    Proceedings of the International Conference on Computer Vision

    (2011)
  • J. Wang et al.

    Robust 3d action recognition with random occupancy patterns

    Proceedings of European Conference on Computer Vision

    (2012)
  • D. Grest et al.

    Nonlinear body pose estimation from depth images

    Joint Pattern Recognition Symposium

    (2005)
  • S. Knoop et al.

    Sensor fusion for 3d human body tracking with an articulated 3d body model

    Proceedings of the IEEE International Conference on Robotics and Automation

    (2006)
  • B. Rosenhahn et al.

    Markerless motion capture of man-machine interaction

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2008)
  • M. Straka et al.

    Skeletal graph based human pose estimation in real-time

    Proceedings of British Machine Vision Conference

    (2011)
  • G. Ye et al.

    Performance capture of interacting characters with handheld kinects

    Proceedings of European Conference on Computer Vision

    (2012)
  • L. Zhang et al.

    Real-time human motion tracking using multiple depth cameras

    Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems

    (2012)
  • Cited by (34)

    • JSL3D: Joint subspace learning with implicit structure supervision for 3D pose estimation

      2022, Pattern Recognition
      Citation Excerpt :

      Based on these MoCap datasets, considerable effort has been devoted to inferring 3D human poses from the 2D body joints. A typical solution makes use of depth information [19] and multi-view images [20] captured in some highly sensored environment. Since sensor devices are difficult to deploy in real scenarios, it is more realistic to estimate the 3D human pose simply from a single image [7,9].

    • Single image based 3D human pose estimation via uncertainty learning

      2022, Pattern Recognition
      Citation Excerpt :

      3D human pose estimation aims to infer the 3D joint locations of a single or multiple persons from an input image or a sequence of images [1–8].

    View all citing articles on Scopus

    Yeonho Kim received the B.S. degree in computer engineering from Catholic University of Korea, Seoul, Korea, in 2008. He received his Ph.D. degree in computer science and engineering at Pohang University of Science and Technology (POSTECH), Pohang, Korea, in 2019. He is working as a senior researcher at Samsung S1 now. His research interests include computer vision, human-computer interaction, and human pose estimation.

    Daijin Kim received the B.S. degree in electronic engineering from Yonsei University, Seoul, Korea, in 1981, and the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Taejon, 1984. In 1991, he received the Ph.D. degree in electrical and computer engineering from Syracuse University, Syracuse, NY. During 1992–1999, he was an Associate Professor in the Department of Computer Engineering at DongA University, Pusan, Korea. He is currently a Professor in the Department of Computer Science and Engineering at POSTECH, Pohang, Korea, Director of BK21+ POSTECH CSE Institute, and Director of Software Start Lab. (ADAS Computer Vision). His research interests include computer vision, pattern recognition, and machine learning.

    View full text