Elsevier

Neurocomputing

Volume 164, 21 September 2015, Pages 210-219
Neurocomputing

Depth-images-based pose estimation using regression forests and graphical models

https://doi.org/10.1016/j.neucom.2015.02.068Get rights and content

Abstract

Depth-images-based human pose estimation is facing two challenges: how to extract features which are discriminative to variations in human poses and robust against noise, and how to reliably learn body joints based on their dependence structure. To tackle the first problem, we propose a novel 3D Local Shape Context feature extracted from human body silhouette to characterise the local structure of body joints. To tackle the second problem, we incorporate a graphical model into regression forests to exploit structural constrains. Experiments demonstrate that our method can efficiently learn local body structures and localise joints. Compared with the state-of-the-art methods, our method significantly improves the accuracy of pose estimation from depth images.

Introduction

Accurate estimation of human poses is a key step for many visual applications, such as human computer interaction, smart video surveillance, character animation and augmented reality. A nice review on this topic can be found in [1]. Although considerable research efforts have been devoted to it, pose estimation is still a challenging task due to cluttered background, occlusion and variation in appearance and pose [2]. Most techniques address these challenges from two aspects: one seeks discriminative and robust features to fight against noise and variations in appearance and pose, and the other designs graphical models to utilise structural information to constrain the distributions of body joints.

With respect to the features for pose estimation, a variety of discriminative features have been developed [3]. Recently, with the development of depth sensor techniques (such as Kinect or time-of-flight sensors), many works focus on extracting features from depth images [4], [5]. A depth image represents depth measurements of the scene [6], [7], [8]. Compared with RGB images, depth images supply much richer geometrical information, facilitating both the separation of human body from background and the disambiguation of similar poses. Generally, appearance and shape are the commonly used features for pose estimation. As to depth-appearance-based features, Plagemann et al. [4] proposed a geodesic-distance-based feature, which costs a large amount of computation for iteratively calculating points of interest, and Shotton et al. [5] proposed DCF (depth comparison features), which describe body parts by depth differences at a sequence of random offsets. Their works yielded state-of-the-art results. Their features are effective and efficient on depth images, from which many of later works [9], [10], [11], [12] benefited. As to depth-shape-based features, Li et al. [13] proposed a shape-based feature, termed 3DSC, which utilised depth information to obtain an edge-point mask and calculated silhouette histograms on this 2D mask to detect end-points of interest. As extracted on the mask images, their features lack the 3D information. Furthermore, their framework only processes limited endpoints (e.g. head, hand and foot). Baak et al. [14] and Ye et al. [15] used point cloud matching techniques for pose estimation, which are computationally demanding. To the best of our knowledge, it seems that none of the shape-based features have achieved a performance comparable to DCF yet. In our work, we aim to propose a novel depth-shape-based feature which can attain satisfactory results.

With respect to the models of human pose, the pictorial structure model [16] is one of the most popular models, for its effective representation of articulated objects and its efficient inference algorithm. It is trained to learn the spatial relationship between pairs of joints, since the location of a joint is well constrained by its connected joints. At its inference stage, the likelihood of each body joint is evaluated over the 2D/3D space restricted by the trained model. Many improvements of this model have been made, and the most relevant work goes in either of three directions: to build more reliable body part (or joint) detectors [17], [18], [19], [20], [21], to introduce richer body models [22], [23], [24], [25], [26], [27] or to perform inference [24], [28] by imposing temporary constraints. In the first direction, many methods tend to be finely tuned to a specific dataset. In the other two directions, complex models and inference require extensive computation. As we know, most of these methods could hardly provide a real-time output due to the complexity of part detection and inference on RGB images. In recent years, some joint detection algorithms using random forests give real-time state-of-the-art results [29], [30], [31], [32], [33]. However, they infer locations of body joints either independently [5], [10] or relying on some global latent variables [9], neglecting the dependence between body joints. Dantone et al. [21] designed two-layers regression forests to learn more reliable joint detectors and modelled the constraints by using Gaussian distributions for efficient inference on RGB images. Yu et al. [34] integrated action detection and cross-modality regression forests for the estimation of 3D human pose.

In this paper, we propose a novel framework for human pose recognition. It mainly consists of two modules. Firstly, we propose a new depth-shape-based feature, termed 3D Local Shape Context feature (3DLSC), by extending the 2D Shape Context (2DSC) [35] to 3D space, to characterise the location cues between human silhouette and joints. Different from 3DSC [13], our 3DLSC captures relative position information of silhouette points in 3D space. Thus our feature is body-size invariant and efficiently adaptive to persons with different heights. Experiments demonstrate that our shape-based features could achieve comparable results with the widely used DCF for pose estimation on depth images. Secondly, we propose a combined learning scheme by incorporating a data-dependent pictorial structure into regression forests. More specifically, depending on the training data arriving at the leaf nodes of the regression forests, our model can learn distributions of each joint and spatial constrains between adjacent joints. Different from the general pictorial structure [16], our proposal models relative distributions according to the specific test image. Compared with the state-of-the-art methods, our proposal can significantly increase the accuracy of pose estimation.

The rest of the paper is organised as follows. In Section 2, we present the construction of our 3DLSC feature, which consists of two steps: silhouette extraction and histogram binning. The details of our graphical models and regression forests are presented in Section 3. Finally, experiments and discussion are shown in Section 4 and conclusion and future work are given in Section 5.

Section snippets

3D local shape context

In this section we present our 3DLSC feature. In [35] the 2DSC feature was first proposed for shape matching. It has been applied to pose estimation as it efficiently encodes local information of human silhouette by using histograms at logarithmic polar (log-polar) coordinates [36], [37], [13]. However, it faces two problems: (1) it is usually noisy in the body silhouette obtained by motion detection and it is difficult to extract inner edges due to the ambiguity on clothing texture [36]; (2)

Graphical models and regression forests

There have been some methods proposed to learn mappings from shape feature to human pose [36], [37], [13]. However, they are either easily affected by ambiguous shape for single pose estimation [36], [37] or designed for some specific end-points detection [13]. This makes them unfit for joints detection. Recently, regression forests have proved to be an efficient algorithm for pose recognition [9], [10]. They can handle high-dimensional feature vectors and are of low computational complexity.

Datasets

In this section, we evaluate our algorithm for human pose estimation on two depth datasets, the Stanford dataset [41] and our THU pose dataset. There is a similar dataset [5], but it is not publicly available yet.

The Stanford dataset consists of 28 action sequences, which include 7891 images in total with a resolution of 176×144. All the images were captured from frontal view using a ToF camera in a lab environment. Among the images, 6000 are selected for training and the rest, less than 2000,

Conclusion and future work

We have proposed a novel approach to human pose estimation from depth images, which significantly outperforms the state-of-the-art methods. Our model combines regression forests and graphical models. It considers the dependence between body joints by using a predefined graphical model. The results have shown that, by employing such a combination, the accuracy for human pose estimation could be dramatically improved. Furthermore, we have proposed a new 3D local shape feature called 3DLSC, which

Acknowledgements

This work was partially sponsored by NSFC 61271390, by 863 Project 2015AA016304 and by the Special Foundation for the Development of Strategic Emerging Industries of Shenzhen (No. ZDSYS201405091729599 & No. YJ20130402145002441).

Li He received the B.S. degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2010. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China. His research interests include the applications of machine learning and pattern recognition in human pose/action recognition and tracking.

References (42)

  • Z. Hu et al.

    Recovery of upper body poses in static images based on joints detection

    Pattern Recognit. Lett.

    (2009)
  • K. Buys et al.

    An adaptable system for RGB-D based human body detection and pose estimation

    J. Vis. Commun. Image Represent.

    (2014)
  • F. Li et al.

    Attribute-based knowledge transfer learning for human pose estimation

    Neurocomputing

    (2013)
  • T.B. Moeslund

    Visual Analysis of Humans: Looking at People

    (2011)
  • M. Andriluka et al.

    Discriminative appearance models for pictorial structures

    Int. J. Comput. Vis.

    (2012)
  • C. Plagemann, V. Ganapathi, D. Koller, S. Thrun, Real-time identification and localization of body parts from depth...
  • J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose...
  • G. Wang et al.

    Depth estimation for speckle projection system using progressive reliable points growing matching

    Appl. Opt.

    (2013)
  • A. Shpunt, Z. Zalevsky, Three-Dimensional Sensing using Speckle Patterns, US Patent App. 12/282,517, March 8...
  • C. Shi, G. Wang, X. Yin, X. Pei, B. He, X. Lin, High-accuracy stereo matching based on adaptive ground control points,...
  • M. Sun, P. Kohli, J. Shotton, Conditional regression forests for human pose estimation, in: IEEE Conference on Computer...
  • R. Girshick, J. Shotton, P. Kohli, A. Criminisi, A. Fitzgibbon, Efficient regression of general-activity human poses...
  • G. Fanelli et al.

    Random forests for real time 3D face analysis

    Int. J. Comput. Vis.

    (2013)
  • Z. Li, D. Kulic, Local shape context based real-time endpoint body part detection and identification from depth images,...
  • A. Baak, M. Müller, G. Bharaj, H.-P. Seidel, C. Theobalt, A data-driven approach for real-time full body pose...
  • M. Ye, X. Wang, R. Yang, L. Ren, M. Pollefeys, Accurate 3D pose estimation from a single depth image, in: IEEE...
  • P.F. Felzenszwalb et al.

    Pictorial structures for object recognition

    Int. J. Comput. Vis.

    (2005)
  • M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation,...
  • B. Sapp, A. Toshev, B. Taskar, Cascaded models for articulated pose estimation, in: Computer Vision–ECCV 2010,...
  • Y. Yang, D. Ramanan, Articulated pose estimation with flexible mixtures-of-parts, in: IEEE Conference on Computer...
  • M. Sun, S. Savarese, Articulated part-based model for joint object detection and pose estimation, in: IEEE...
  • Cited by (20)

    • Dual-channel cascade pose estimation network trained on infrared thermal image and groundtruth annotation for real-time gait measurement

      2022, Medical Image Analysis
      Citation Excerpt :

      This challenge remains even with deep learning approaches. Likewise, depth based pose estimation and analysis via deep neural networks and the subsequent production of large human pose estimation datasets has also attracted considerable attention over the last decade (He et al., 2015; He et al., 2016; Vasileiadis et al., 2019; Zhang et al., 2020; Ren et al., 2021). Various efforts have been made in the domain of gait analysis using depth-based human pose estimation (Eltoukhy et al., 2017b; Latorre et al., 2018; Summa et al., 2020; Hazra et al., 2021).

    • An information fusion framework for person localization via body pose in spectator crowds

      2019, Information Fusion
      Citation Excerpt :

      The computational complexity will linearly increase with the number of resolutions being used. In recent years, significant work has been done on person pose estimation in images and videos [25–28]. Deformable part based models [2,29–31] and approaches based on deep convolutional neural network [32–35] have achieved quite good performance.

    • 3D human pose estimation from range images with depth difference and geodesic distance

      2019, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus

    Li He received the B.S. degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2010. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China. His research interests include the applications of machine learning and pattern recognition in human pose/action recognition and tracking.

    Guijin Wang was born in 1976. He received the B.S. and Ph.D. degrees (with honor) from the Department of Electronic Engineering, Tsinghua University, China in 1998 and 2003 respectively, all in Signal and Information Processing. From 2003 to 2006, he has been with Sony Information Technologies Laboratories as a researcher. From Oct., 2006, he has been with the Department of Electronic Engineering, Tsinghua University, China as an Associate Professor. He has published over 80 international journal and conference papers, and held several patents. His research interests are focused on wireless multimedia, image and video processing, depth imaging, pose recognition, intelligent surveillance, industry inspection, object detection and tracking, online learning, etc.

    Qingmin Liao received the B.S. degree in radio technology from the University of Electronic Science and Technology of China, Chengdu, China, in 1984, and the M.S. and Ph.D. degrees in Signal Processing and Telecommunications from the University of Rennes 1, Rennes, France, in 1990 and 1994, respectively. Since 1995, he has been with Tsinghua University, Beijing, China. He became a Professor in the Department of Electronic Engineering of Tsinghua University, in 2002. From 2001 to 2003, he served as the Invited Professor with a tri-year contract at the University of Caen, France. Since 2010, he has been the Director of the Division of Information Science and Technology in the Graduate School at Shenzhen, Tsinghua University, Shenzhen, China. His research interests include image/video processing, transmission and analysis; biometrics; and their applications to teledetection, medicine, industry, and sports. He has published over 90 papers internationally.

    Jing-Hao Xue received the B.Eng. degree in Telecommunication and Information Systems in 1993 and the Dr.Eng. degree in Signal and Information Processing in 1998, both from Tsinghua University, the M.Sc. degree in Medical Imaging and the M.Sc. degree in Statistics, both from Katholieke Universiteit Leuven in 2004, and the degree of Ph.D. in Statistics from the University of Glasgow in 2008. Since 2008, he has worked in the Department of Statistical Science at University College London, as a Lecturer and Senior Lecturer. His current research interests include statistical classification, high-dimensional data analysis, computer vision, and pattern recognition.

    View full text