Elsevier

Signal Processing

Volume 112, July 2015, Pages 162-179
Signal Processing

3D real human reconstruction via multiple low-cost depth cameras

https://doi.org/10.1016/j.sigpro.2014.10.021Get rights and content

Highlights

  • Scan 3D real human body by low cost depth cameras.

  • The whole 3D point cloud of human is globally registered.

  • The reconstructed mesh quality is satisfactory.

Abstract

In traditional human-centered games and virtual reality applications, a skeleton is commonly tracked using consumer-level cameras or professional motion capture devices to animate an avatar. In this paper, we propose a novel application that automatically reconstructs a real 3D moving human captured by multiple RGB-D cameras in the form of a polygonal mesh, and which may help users to actually enter a virtual world or even a collaborative immersive environment. Compared with 3D point clouds, a 3D polygonal mesh is commonly adopted to represent objects or characters in games and virtual reality applications. A vivid 3D human mesh can hugely promote the feeling of immersion when interacting with a computer. The proposed method includes three key steps for realizing dynamic 3D human reconstruction from RGB images and noisy depth data captured from a distance. First, we remove the static background to obtain a 3D partial view of the human from the depth data with the help of calibration parameters, and register two neighboring partial views. The whole 3D human is globally registered using all the partial views to obtain a relatively clean 3D human point cloud. A complete 3D mesh model is constructed from the point cloud using Delaunay triangulation and Poisson surface reconstruction. Finally, a series of experiments demonstrates the reconstruction quality of the 3D human meshes. Dynamic meshes with different poses are placed in a virtual environment, which can be used to provide personalized avatars for everyday users, and enhance the interactive experience in games and virtual reality environments.

Introduction

Real 3D scene and human reconstruction from a professional 3D scanner has been studied extensively in multimedia, virtual reality, and computer graphics [1]. Satisfactory results have been obtained that are adaptable to industrial inverse engineering and production designs based on virtual reality. However, the techniques cannot be applied directly in home-centered environments owing to the high cost, large volume, complex operation, and computational burden. In recent years, portable low-cost, easy-to-use, RGB-D cameras such as the Kinect [2] have become very popular and widely used. However, this type of camera captures low-quality depth images, which is a major constraint in providing high-quality immersive and virtual applications. Thus, many researchers have shown great interest in recent developments in scene reconstruction and human modeling using RGB-D sensors.

There are several pioneering works focusing on interesting 3D virtual applications of RGB-D sensors. Alexiadis et al. [3] built a real-time automatic system for dance performance evaluation using a Kinect RGB-D sensor, and provided visual feedback for beginners in a 3D virtual scene. Liu et al. [4] used depth sensor data from a Kinect to track human movement and evaluate players׳ energy consumption to investigate how much energy is expended while playing in a virtual environment. Bleiweiss et al. [5] proposed a solution to animate in-game avatars using real-time motion capture data, and blend actual movements of players using predefined animation sequences. Pedersoli et al. [6] provided a framework for a Kinect that enables more natural and intuitive hand-gesture communication between a human and a computer. Tong et al. [7] presented a novel scanning system for capturing different parts of the human body at close range, and then reconstructing a whole 3D human body. However, this system requires external calibration and an accurate capture space, which is not available in home applications. The system proposed by Anguelov et al. [8] deforms a data-driven shape completion and animation of people (SCAPE) model to fit the scanned data given a limited set of markers specifying the target shape. Using a single static scan and markers for motion capture, their system constructs a high-quality animated surface model of a moving person with realistic muscle deformation. Similarly, Weiss et al. [9] introduced a SCAPE model with 15 body parts that fits poses of a real human to a 3D model, while Chen et al. [10] adopted training data, including 3D meshes for multiple users with different poses as obtained from SCAPE, to model existing 3D human body meshes. Alexiadis et al. [11] proposed a full 3D reconstruction of moving foreground objects from depth cameras. Several authors have also focused on other related applications, e.g., tracking a human body [12], [13], recognizing human poses in real time [14], and converting a movie clip into a comic [15] using depth cameras.

However, unlike commercial 3D scanners that can output clean, dense point clouds for use in building a structured mesh easily, the point cloud captured from depth cameras is sparse, discontinuous, and noisy, resulting in the generation of many holes and missing depth values. This is caused by the nature of infrared light measurement, which is affected by multiple factors, including scattering, absorption by dark objects, transmission through transparent objects, multiple reflections, and large distances. We propose a pipeline aimed at obtaining a real textured 3D human body from low-cost depth cameras. We first remove the background to obtain a 3D partial view of the human from each depth image with the help of intrinsic and stereo calibration parameters, and successively register two neighboring partial views as a concatenated partial view. The complete 3D point cloud of the human is globally registered using all the pairwise registrations. A 3D mesh structure incorporating color is reconstructed using Delaunay triangulation and the Poisson method. To verify the effectiveness, efficiency, and robustness of the real 3D human reconstruction using multiple low-cost depth cameras, we built an experimental environment. Six depth cameras (Kinect) were placed at different positions on a circle to capture multiple views with as many overlapping points as possible. Fig. 1 shows the entire process from setting up the hardware system to the final 3D human reconstruction. The results show that the quality of the reconstructed mesh is satisfactory.

How to build a correspondence of partial 3D views automatically and effectively is the key to 3D human reconstruction. Shape correspondence, matching, and retrieval of 3D shapes have been extensively investigated in recent years [16], [17], [18], and these are still hot topics in multimedia, computer vision, computer graphics, and computer-aided design. Many researchers have attempted to provide content-based matching and retrieval techniques, focusing mainly on local point description and high-level feature extraction [19], [20], [21], [22], topological structure [23], non-rigid shape feature [24], sketch based retrieval [25], view comparison [26], [27], [28], [29], [30], [31], and a relevance feedback mechanism [32]. For example, to match a single-view Kinect scan to high-quality 3D models, Shen et al. [33] proposed recovering the underlying structure of a scanned object by assembling suitable parts obtained from the repository models. Kim et al. [34] presented an efficient method for acquiring 3D indoor environments with variability and repetition through a Kinect sensor. Their work segments a single 3D point cloud scanned using a real-time simultaneous localization and mapping (SLAM) technique, classifies it into plausible objects, recognizes these using primitive fitting and connected component analysis, and extracts their pose parameters. Kakadiaris et al. [35] used a composed deformable model to fit 2D body silhouettes extracted from three mutually orthogonal views. Wang [36] constructed a feature wireframe of a human body from laser scanned 3D unorganized points using semantic feature extraction, and modeled the symmetric detail mesh surface of the human body. Hsieh et al. [37] presented a novel segmentation algorithm to segment a body posture into different body parts using the deformable triangulation technique. Holte et al. [38] summarized recent approaches for 3D human pose estimation based on 3D features reconstructed from multiple views. However, our problem is different: we require robust correspondence between noisy partial views without markers, which are scanned by low-cost depth cameras. Moreover, global registration with a small accumulated error distributed among multiple cameras must be automatically implemented in almost real time. We design a registration framework to realize the two objectives simultaneously.

Contributions: The past decade has seen the emergence of new 3D imaging devices and techniques capable of capturing full human body shapes [39], which are used extensively used in human engineering, health assessment, protective equipment (car or airplane seats), medical diagnosis, entertainment, the clothing industry, and virtual reality. In this study, we implemented an entire reconstruction pipeline for a real 3D human as a prototype system composed of multiple low-cost depth cameras. Furthermore, the effectiveness, efficiency, and robustness of the system are investigated by applying it to a group of users with different poses. The entire process is completely automatic and does not require human body landmarks, or any manual operation on the 3D model, which is important for dynamic virtual reality environments. Our approach is realizable with the following two technical contributions:

  • 1.

    A solution to 3D partial view generation of humans is proposed. Given color images and noisy depth images captured by a single sensor, our method adopts background removal to extract a coarse depth view of the human, followed by depth smoothing. Partial view filters are employed to obtain a relatively clean 3D partial point cloud.

  • 2.

    To provide more robust and accurate registration for sparse and noisy point clouds, we introduce an initial pairwise registration method using segment constrained correspondence. The correspondence is built for key points and uniform sampling points, and performed by comparing their feature descriptors. We add a segment constraint to spectral matching particularly for a human point cloud, which significantly improves the correspondence effect. The proposed scheme provides an initial alignment around the optimal solution of fine registration.

Section snippets

3D partial view generation

In this section, we propose a method for extracting 3D partial views of humans from captured RGB-D images to obtain a relatively clean 3D human automatically from cluttered environments. The scheme relies on dealing with noisy low-resolution depth maps by resorting to relatively high-quality color images. The static background in a color image is easily removed and mapped to the depth image to extract a depth view of the human. Internal calibration parameters are used to convert the depth view

Feature correspondence for initial registration

Latest advances dealing with feature correspondence have mainly focused on the following topics: feature extraction based on a local convolutional auto-encoder [42] and topic model [43], and feature comparison via Hausdorff distance learning [44]. Differing from these studies, our work relies on key point selection on a 3D point cloud, and computes feature descriptions for coarse correspondence between partial 3D views.

Key point selection: Owing to the large number of points in a point cloud,

Fine pairwise registration

To obtain accurate registration of two partial 3D views and significantly reduce the influence of noisy data and outliers, we follow the notion of coherent point drift [57]. A set of points X in one partial 3D view can be modeled using Gaussian mixture models, while another set of points Y is regarded as the data generated by these Gaussian mixture models. The posterior probability of data points in Y corresponding to the Gaussian mixture models centered on points in X should be maximized after

Global registration

While sequentially aligning each 3D partial view pairwise with its neighbor, the error gradually accumulates so that the starting and ending partial views cannot be aligned. We introduce global registration to obtain a consensus registration for all partial views by diffusing pairwise registration errors. From all the pairwise registrations, we iteratively select the view with the most matching pairs. The current view is aligned with all its neighboring views to obtain the new registration

Human mesh reconstruction

The human point cloud including color generated above has three defects, which may cause an incorrect mesh reconstruction. One problem is that the point cloud is not smooth and contains several burrs on the boundary, resulting from interference of several RGB-D cameras, inevitable pairwise and global registration errors, and the background mixture. We solve the organization problem of these unordered 3D points, and adopt a polygonal form to connect these points topologically. Another

Experimental results

Experimental environment: To verify the whole procedure, we built a hardware system, which includes six low-cost depth cameras (Kinect), their brackets, a dark blanket, six USB cables connecting the depth cameras to the computer, and a desktop computer with an i7 CPU and 16 GB memory. Fig. 1(a) illustrates the experimental environment, with six depth cameras placed at different locations on a circle to capture different views with as many overlapping points as possible. We do not require that

Conclusion

In this paper, we presented a novel proposal for 3D real human reconstruction of real users by employing multiple depth cameras. The core of the implementation process lies in how to reconstruct realistic humans from noisy RGB data. The process consists of three key steps: 3D partial view generation, registration of partial views, and human mesh reconstruction. The novel application allows a user to be truly immersed in a virtual really scene using low cost RGB-D cameras. This should greatly

Acknowledgments

The work is partially supported by NSFC (Nos. 61003137, 61202185, 61473231, and 91120005), NWPU Basic Research Fund 310201401 (JCQ01009 and JCQ01012), Fund of National Engineering Center for Commercial Aircraft Manufacturing (201410), National Aerospace Science Foundation of China (2014ZD53), and Open Fund of State Key Lab of CAD & CG in Zhejiang University (A15).

References (63)

  • F. Pedersoli, N. Adami, S. Benini, R. Leonardi, Xkin-extendable hand pose and gesture recognition library for kinect,...
  • J. Tong et al.

    Scanning 3D full human bodies using kinects

    IEEE Trans. Vis. Comput. Graph.

    (2012)
  • D. Anguelov et al.

    Scapeshape completion and animation of people

    ACM Trans. Graph.

    (2005)
  • A. Weiss, D. Hirshberg, M.J. Black, Home 3D body scans from noisy image and range data, in: Proceedings of the...
  • Y. Chen, Z. Liu, Z. Zhang, Tensor-based human body modeling, in: Proceedings of the IEEE Conference on Computer Vision...
  • D.S. Alexiadis et al.

    full 3-D reconstruction of moving foreground objects from multiple consumer depth cameras

    IEEE Trans. Multimed.

    (2013)
  • T. Helten, M. Müller, H.-P. Seidel, C. Theobalt, Real-time body tracking with one depth camera and inertial sensors,...
  • O. Gedik et al.

    3-D rigid body tracking using vision and depth sensors

    IEEE Trans. Cybern.

    (2013)
  • J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose...
  • M. Wang et al.

    Movie2comicstowards a lively video content presentation

    IEEE Trans. Multimed.

    (2012)
  • B. Leng et al.

    Modelseekan effective 3D model retrieval system

    Multimed. Tools Appl.

    (2011)
  • Z. Liu et al.

    A survey on partial retrieval of 3D shapes

    J. Comput. Sci. Technol.

    (2013)
  • Y. Gao et al.

    View-based 3D object retrievalchallenges and approaches

    IEEE MultiMed.

    (2014)
  • A.M. Bronstein et al.

    Shape googlegeometric words and expressions for invariant shape retrieval

    ACM Trans. Graph.

    (2011)
  • B. Leng, X. Zhang, M. Yao, Z. Xiong, 3D object classification using deep belief networks, in: MultiMedia Modeling,...
  • B. Leng, X. Zhang, M. Yao, X. Zhang, A 3D model recognition mechanism based on deep Boltzmann machines, Neurocomputing,...
  • R. Ji, H. Yao, X. Sun, B. Zhong, W. Gao, Towards semantic embedding in visual vocabulary, in: IEEE Conference on...
  • M. Eitz et al.

    Sketch-based shape retrieval

    ACM Trans. Graph.

    (2012)
  • Y. Gao et al.

    Less is moreefficient 3D object retrieval with query view selection

    IEEE Trans. Multimed.

    (2011)
  • Y.-S. Liu et al.

    Computing the inner distances of volumetric models for articulated shape description with a visibility graph

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • Y. Gao et al.

    Camera constraint-free view-based 3D object retrieval

    IEEE Trans. Image Process.

    (2012)
  • Cited by (29)

    • Predicting high-fidelity human body models from impaired point clouds

      2022, Signal Processing
      Citation Excerpt :

      3D scanning [1,2] is a fast automated geometry acquisition technology which has been successfully applied to various tasks such as clothing design [3] and extraction of anthropometric measurements [4] to cite a few.

    • High-accuracy multi-camera reconstruction enhanced by adaptive point cloud correction algorithm

      2019, Optics and Lasers in Engineering
      Citation Excerpt :

      The multi-vision system is an extension of the monocular and binocular vision system wherein a wider range of 3D geometric information is obtained by extending the field of view of the cameras and merging the geometric information from different cameras through coordinate correlation technology. Current multi-vision-based 3D reconstruction technologies are relatively sophisticated and are widely used for target recovery [12], industrial robot environmental perception [13], human-computer interaction [14], and visualization [15]. The critical step in establishing any multi-vision system is global calibration.

    • Flexible body scanning without template models

      2019, Signal Processing
      Citation Excerpt :

      Despite the many advances achieved in the last years, the task of scanning a standing person with RGBD sensors remains a challenge requiring specific solutions. A multi-view configuration (composed by six sensors) was proposed by Liu et al. [29]. Later, then same authors propose in [30] a registration algorithm that determines a segment constrained correspondence between pairs of partial point clouds, that is then integrated with a rigid transformation.

    • A 3D reconstruction method of the body envelope from biplanar X-rays: Evaluation of its accuracy and reliability

      2015, Journal of Biomechanics
      Citation Excerpt :

      3D body scanning technologies are becoming accessible making it possible to easily obtain 3D body shape (Daanen and Haar, 2013; Liu et al., 2015).

    View all citing articles on Scopus
    View full text