3D real human reconstruction via multiple low-cost depth cameras

doi:10.1016/j.sigpro.2014.10.021

Signal Processing

Volume 112, July 2015, Pages 162-179

https://doi.org/10.1016/j.sigpro.2014.10.021 Get rights and content

Highlights

•
Scan 3D real human body by low cost depth cameras.
•
The whole 3D point cloud of human is globally registered.
•
The reconstructed mesh quality is satisfactory.

Abstract

In traditional human-centered games and virtual reality applications, a skeleton is commonly tracked using consumer-level cameras or professional motion capture devices to animate an avatar. In this paper, we propose a novel application that automatically reconstructs a real 3D moving human captured by multiple RGB-D cameras in the form of a polygonal mesh, and which may help users to actually enter a virtual world or even a collaborative immersive environment. Compared with 3D point clouds, a 3D polygonal mesh is commonly adopted to represent objects or characters in games and virtual reality applications. A vivid 3D human mesh can hugely promote the feeling of immersion when interacting with a computer. The proposed method includes three key steps for realizing dynamic 3D human reconstruction from RGB images and noisy depth data captured from a distance. First, we remove the static background to obtain a 3D partial view of the human from the depth data with the help of calibration parameters, and register two neighboring partial views. The whole 3D human is globally registered using all the partial views to obtain a relatively clean 3D human point cloud. A complete 3D mesh model is constructed from the point cloud using Delaunay triangulation and Poisson surface reconstruction. Finally, a series of experiments demonstrates the reconstruction quality of the 3D human meshes. Dynamic meshes with different poses are placed in a virtual environment, which can be used to provide personalized avatars for everyday users, and enhance the interactive experience in games and virtual reality environments.

Introduction

Real 3D scene and human reconstruction from a professional 3D scanner has been studied extensively in multimedia, virtual reality, and computer graphics [1]. Satisfactory results have been obtained that are adaptable to industrial inverse engineering and production designs based on virtual reality. However, the techniques cannot be applied directly in home-centered environments owing to the high cost, large volume, complex operation, and computational burden. In recent years, portable low-cost, easy-to-use, RGB-D cameras such as the Kinect [2] have become very popular and widely used. However, this type of camera captures low-quality depth images, which is a major constraint in providing high-quality immersive and virtual applications. Thus, many researchers have shown great interest in recent developments in scene reconstruction and human modeling using RGB-D sensors.

There are several pioneering works focusing on interesting 3D virtual applications of RGB-D sensors. Alexiadis et al. [3] built a real-time automatic system for dance performance evaluation using a Kinect RGB-D sensor, and provided visual feedback for beginners in a 3D virtual scene. Liu et al. [4] used depth sensor data from a Kinect to track human movement and evaluate players׳ energy consumption to investigate how much energy is expended while playing in a virtual environment. Bleiweiss et al. [5] proposed a solution to animate in-game avatars using real-time motion capture data, and blend actual movements of players using predefined animation sequences. Pedersoli et al. [6] provided a framework for a Kinect that enables more natural and intuitive hand-gesture communication between a human and a computer. Tong et al. [7] presented a novel scanning system for capturing different parts of the human body at close range, and then reconstructing a whole 3D human body. However, this system requires external calibration and an accurate capture space, which is not available in home applications. The system proposed by Anguelov et al. [8] deforms a data-driven shape completion and animation of people (SCAPE) model to fit the scanned data given a limited set of markers specifying the target shape. Using a single static scan and markers for motion capture, their system constructs a high-quality animated surface model of a moving person with realistic muscle deformation. Similarly, Weiss et al. [9] introduced a SCAPE model with 15 body parts that fits poses of a real human to a 3D model, while Chen et al. [10] adopted training data, including 3D meshes for multiple users with different poses as obtained from SCAPE, to model existing 3D human body meshes. Alexiadis et al. [11] proposed a full 3D reconstruction of moving foreground objects from depth cameras. Several authors have also focused on other related applications, e.g., tracking a human body [12], [13], recognizing human poses in real time [14], and converting a movie clip into a comic [15] using depth cameras.

However, unlike commercial 3D scanners that can output clean, dense point clouds for use in building a structured mesh easily, the point cloud captured from depth cameras is sparse, discontinuous, and noisy, resulting in the generation of many holes and missing depth values. This is caused by the nature of infrared light measurement, which is affected by multiple factors, including scattering, absorption by dark objects, transmission through transparent objects, multiple reflections, and large distances. We propose a pipeline aimed at obtaining a real textured 3D human body from low-cost depth cameras. We first remove the background to obtain a 3D partial view of the human from each depth image with the help of intrinsic and stereo calibration parameters, and successively register two neighboring partial views as a concatenated partial view. The complete 3D point cloud of the human is globally registered using all the pairwise registrations. A 3D mesh structure incorporating color is reconstructed using Delaunay triangulation and the Poisson method. To verify the effectiveness, efficiency, and robustness of the real 3D human reconstruction using multiple low-cost depth cameras, we built an experimental environment. Six depth cameras (Kinect) were placed at different positions on a circle to capture multiple views with as many overlapping points as possible. Fig. 1 shows the entire process from setting up the hardware system to the final 3D human reconstruction. The results show that the quality of the reconstructed mesh is satisfactory.

How to build a correspondence of partial 3D views automatically and effectively is the key to 3D human reconstruction. Shape correspondence, matching, and retrieval of 3D shapes have been extensively investigated in recent years [16], [17], [18], and these are still hot topics in multimedia, computer vision, computer graphics, and computer-aided design. Many researchers have attempted to provide content-based matching and retrieval techniques, focusing mainly on local point description and high-level feature extraction [19], [20], [21], [22], topological structure [23], non-rigid shape feature [24], sketch based retrieval [25], view comparison [26], [27], [28], [29], [30], [31], and a relevance feedback mechanism [32]. For example, to match a single-view Kinect scan to high-quality 3D models, Shen et al. [33] proposed recovering the underlying structure of a scanned object by assembling suitable parts obtained from the repository models. Kim et al. [34] presented an efficient method for acquiring 3D indoor environments with variability and repetition through a Kinect sensor. Their work segments a single 3D point cloud scanned using a real-time simultaneous localization and mapping (SLAM) technique, classifies it into plausible objects, recognizes these using primitive fitting and connected component analysis, and extracts their pose parameters. Kakadiaris et al. [35] used a composed deformable model to fit 2D body silhouettes extracted from three mutually orthogonal views. Wang [36] constructed a feature wireframe of a human body from laser scanned 3D unorganized points using semantic feature extraction, and modeled the symmetric detail mesh surface of the human body. Hsieh et al. [37] presented a novel segmentation algorithm to segment a body posture into different body parts using the deformable triangulation technique. Holte et al. [38] summarized recent approaches for 3D human pose estimation based on 3D features reconstructed from multiple views. However, our problem is different: we require robust correspondence between noisy partial views without markers, which are scanned by low-cost depth cameras. Moreover, global registration with a small accumulated error distributed among multiple cameras must be automatically implemented in almost real time. We design a registration framework to realize the two objectives simultaneously.

Contributions: The past decade has seen the emergence of new 3D imaging devices and techniques capable of capturing full human body shapes [39], which are used extensively used in human engineering, health assessment, protective equipment (car or airplane seats), medical diagnosis, entertainment, the clothing industry, and virtual reality. In this study, we implemented an entire reconstruction pipeline for a real 3D human as a prototype system composed of multiple low-cost depth cameras. Furthermore, the effectiveness, efficiency, and robustness of the system are investigated by applying it to a group of users with different poses. The entire process is completely automatic and does not require human body landmarks, or any manual operation on the 3D model, which is important for dynamic virtual reality environments. Our approach is realizable with the following two technical contributions:

1.
A solution to 3D partial view generation of humans is proposed. Given color images and noisy depth images captured by a single sensor, our method adopts background removal to extract a coarse depth view of the human, followed by depth smoothing. Partial view filters are employed to obtain a relatively clean 3D partial point cloud.
2.
To provide more robust and accurate registration for sparse and noisy point clouds, we introduce an initial pairwise registration method using segment constrained correspondence. The correspondence is built for key points and uniform sampling points, and performed by comparing their feature descriptors. We add a segment constraint to spectral matching particularly for a human point cloud, which significantly improves the correspondence effect. The proposed scheme provides an initial alignment around the optimal solution of fine registration.

Section snippets

3D partial view generation

In this section, we propose a method for extracting 3D partial views of humans from captured RGB-D images to obtain a relatively clean 3D human automatically from cluttered environments. The scheme relies on dealing with noisy low-resolution depth maps by resorting to relatively high-quality color images. The static background in a color image is easily removed and mapped to the depth image to extract a depth view of the human. Internal calibration parameters are used to convert the depth view

Feature correspondence for initial registration

Latest advances dealing with feature correspondence have mainly focused on the following topics: feature extraction based on a local convolutional auto-encoder [42] and topic model [43], and feature comparison via Hausdorff distance learning [44]. Differing from these studies, our work relies on key point selection on a 3D point cloud, and computes feature descriptions for coarse correspondence between partial 3D views.

Key point selection: Owing to the large number of points in a point cloud,

Fine pairwise registration

To obtain accurate registration of two partial 3D views and significantly reduce the influence of noisy data and outliers, we follow the notion of coherent point drift [57]. A set of points X in one partial 3D view can be modeled using Gaussian mixture models, while another set of points Y is regarded as the data generated by these Gaussian mixture models. The posterior probability of data points in Y corresponding to the Gaussian mixture models centered on points in X should be maximized after

Global registration

While sequentially aligning each 3D partial view pairwise with its neighbor, the error gradually accumulates so that the starting and ending partial views cannot be aligned. We introduce global registration to obtain a consensus registration for all partial views by diffusing pairwise registration errors. From all the pairwise registrations, we iteratively select the view with the most matching pairs. The current view is aligned with all its neighboring views to obtain the new registration

Human mesh reconstruction

The human point cloud including color generated above has three defects, which may cause an incorrect mesh reconstruction. One problem is that the point cloud is not smooth and contains several burrs on the boundary, resulting from interference of several RGB-D cameras, inevitable pairwise and global registration errors, and the background mixture. We solve the organization problem of these unordered 3D points, and adopt a polygonal form to connect these points topologically. Another

Experimental results

Experimental environment: To verify the whole procedure, we built a hardware system, which includes six low-cost depth cameras (Kinect), their brackets, a dark blanket, six USB cables connecting the depth cameras to the computer, and a desktop computer with an i7 CPU and 16 GB memory. Fig. 1(a) illustrates the experimental environment, with six depth cameras placed at different locations on a circle to capture different views with as many overlapping points as possible. We do not require that

Conclusion

In this paper, we presented a novel proposal for 3D real human reconstruction of real users by employing multiple depth cameras. The core of the implementation process lies in how to reconstruct realistic humans from noisy RGB data. The process consists of three key steps: 3D partial view generation, registration of partial views, and human mesh reconstruction. The novel application allows a user to be truly immersed in a virtual really scene using low cost RGB-D cameras. This should greatly

Acknowledgments

The work is partially supported by NSFC (Nos. 61003137, 61202185, 61473231, and 91120005), NWPU Basic Research Fund 310201401 (JCQ01009 and JCQ01012), Fund of National Engineering Center for Commercial Aircraft Manufacturing (201410), National Aerospace Science Foundation of China (2014ZD53), and Open Fund of State Key Lab of CAD & CG in Zhejiang University (A15).

References (63)

V. Barra et al.
3D shape retrieval using kernels on extended Reeb graphs
Pattern Recognit.
(2013)
Z. Lian et al.
A comparison of methods for non-rigid 3D shape retrieval
Pattern Recognit.
(2013)
C.C. Wang
Parameterization and parametric design of mannequins
Comput. Aided Des.
(2005)
Y. Gao et al.
3D model comparison using spatial structure circular descriptor
Pattern Recognit.
(2010)
P. Cignoni et al.
Dewalla fast divide and conquer Delaunay triangulation algorithm in Ed
Comput. Aided Des.
(1998)
M. Berger et al.
A benchmark for surface reconstruction
ACM Trans. Graph.
(2013)
Microsoft Kinect, 〈http://www.xbox.com/kinect〉,...
D.S. Alexiadis, P. Kelly, P. Daras, N.E. O׳Connor, T. Boubekeur, M.B. Moussa, Evaluating a dancer׳s performance using...
Z. Liu, S. Tang, H. Qin, S. Bu, Evaluating user׳s energy consumption using kinect based skeleton tracking, in:...
A. Bleiweiss, D. Eshar, G. Kutliroff, A. Lerner, Y. Oshrat, Y. Yanai, Enhanced interactive gaming by blending full-body...

F. Pedersoli, N. Adami, S. Benini, R. Leonardi, Xkin-extendable hand pose and gesture recognition library for kinect,...

J. Tong et al.

Scanning 3D full human bodies using kinects

IEEE Trans. Vis. Comput. Graph.

(2012)

D. Anguelov et al.

Scapeshape completion and animation of people

ACM Trans. Graph.

(2005)

A. Weiss, D. Hirshberg, M.J. Black, Home 3D body scans from noisy image and range data, in: Proceedings of the...

Y. Chen, Z. Liu, Z. Zhang, Tensor-based human body modeling, in: Proceedings of the IEEE Conference on Computer Vision...

D.S. Alexiadis et al.

full 3-D reconstruction of moving foreground objects from multiple consumer depth cameras

IEEE Trans. Multimed.

(2013)

T. Helten, M. Müller, H.-P. Seidel, C. Theobalt, Real-time body tracking with one depth camera and inertial sensors,...

O. Gedik et al.

3-D rigid body tracking using vision and depth sensors

IEEE Trans. Cybern.

(2013)

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose...

M. Wang et al.

Movie2comicstowards a lively video content presentation

IEEE Trans. Multimed.

(2012)

B. Leng et al.

Modelseekan effective 3D model retrieval system

Multimed. Tools Appl.

(2011)

Z. Liu et al.

A survey on partial retrieval of 3D shapes

J. Comput. Sci. Technol.

(2013)

Y. Gao et al.

View-based 3D object retrievalchallenges and approaches

IEEE MultiMed.

(2014)

A.M. Bronstein et al.

Shape googlegeometric words and expressions for invariant shape retrieval

ACM Trans. Graph.

(2011)

B. Leng, X. Zhang, M. Yao, Z. Xiong, 3D object classification using deep belief networks, in: MultiMedia Modeling,...

B. Leng, X. Zhang, M. Yao, X. Zhang, A 3D model recognition mechanism based on deep Boltzmann machines, Neurocomputing,...

R. Ji, H. Yao, X. Sun, B. Zhong, W. Gao, Towards semantic embedding in visual vocabulary, in: IEEE Conference on...

M. Eitz et al.

Sketch-based shape retrieval

ACM Trans. Graph.

(2012)

Y. Gao et al.

Less is moreefficient 3D object retrieval with query view selection

IEEE Trans. Multimed.

(2011)

Y.-S. Liu et al.

Computing the inner distances of volumetric models for articulated shape description with a visibility graph

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

Y. Gao et al.

Camera constraint-free view-based 3D object retrieval

IEEE Trans. Image Process.

(2012)

Cited by (29)

Predicting high-fidelity human body models from impaired point clouds
2022, Signal Processing
Citation Excerpt :
3D scanning [1,2] is a fast automated geometry acquisition technology which has been successfully applied to various tasks such as clothing design [3] and extraction of anthropometric measurements [4] to cite a few.
Accurate 3-D models of human subjects are widely used in domains such as fashion design, non-contact body biometrics, computer animation, gaming, AR/VR, to cite a few. For these kinds of applications, a high-fidelity human body mesh in a canonical posture (e.g. pose or pose) is necessary. This paper proposes a deep learning approach to jointly reconstruct a clean, watertight body mesh and to normalize the posture of the human body model starting from an input set of impaired body point clouds. The proposed method, dubbed Impaired-to-High-fidelity human body network (I2H) is, to the best of our knowledge, the first deep learning approach in the literature that addresses these problems. The proposed method follows an Encoder-Decoder design. The Encoder directly takes the impaired point clouds (e.g. containing noise, occlusions and misalignments) as input without making any structural assumptions about the input. The Decoder interprets the latent feature and produces a high-fidelity T-pose body mesh. We compare the proposed approach against existing state-of-the-art methods through various experiments and show that our method achieves the best performance on both synthetic and scanned datasets for 3D human mesh reconstruction.
3D scan process optimisation study for rapid virtualization.
2020, Procedia CIRP
Introduction – Many of the problems facing 3d scanning as a digitisation method around the human form are caused by the time it takes to scan the entity. This can be solved by using multiple cameras organised in a way to scan the extremity simultaneously from multiple directions. This paper is the exploration around the minimum number of cameras needed to obtain a usable model.
Methodology – Using a 5-stage experimental process for 17 subjects and batch processing each stage, determined the most efficient workflow.
Results – Excluding the exploration subject, it was found that the use of 4 cameras simultaneously was 5.5 times faster, including processing time then it was to use a single camera.
Conclusion – using multiple cameras makes the process 5.5 times faster, as well as batch processing, and having a standardised method to enable the use of algorithmic file processing.
High-accuracy multi-camera reconstruction enhanced by adaptive point cloud correction algorithm
2019, Optics and Lasers in Engineering
Citation Excerpt :
The multi-vision system is an extension of the monocular and binocular vision system wherein a wider range of 3D geometric information is obtained by extending the field of view of the cameras and merging the geometric information from different cameras through coordinate correlation technology. Current multi-vision-based 3D reconstruction technologies are relatively sophisticated and are widely used for target recovery [12], industrial robot environmental perception [13], human-computer interaction [14], and visualization [15]. The critical step in establishing any multi-vision system is global calibration.
Multi-camera schemes can effectively increase the perception range of vision systems compared to single-camera schemes and are common in many optical applications. Unavoidable errors emerge in the global multi-camera calibration process, however, such as manufacturing error of the optical devices and computational error from marker detection algorithms, which drive down the accuracy of the camera system correlation. This paper discusses the causes of global calibration errors in detail. A four-camera vision system was built to obtain the visual information of targets including static objects and a dynamic concrete-filled steel tubular (CFST) specimen. Local calibration and global calibration were applied successively to realize multi-camera correlation, followed by filtering and stitching operations to acquire filtered global point clouds. A point cloud correction algorithm is designed accordingly to optimize the stitched point cloud structures and further improve the accuracy of the reconstructed surfaces. Based on the density features of the targets themselves (rather than standard calibration markers), the proposed point cloud correction algorithm is effective for various targets and adaptive under dynamic conditions. The point clouds and corresponding reconstructed models are shown to be more accurate after the proposed enhancement process. The point cloud correction algorithm also has strong adaptability to different static targets with complex surfaces and performs well under uncertain geometric changes and vibration. The results presented here provide both theoretical and practical support for advancements in multi-vision applications such as optical measurement, real-time target tracking, quality monitoring, and surface data acquisition.
Flexible body scanning without template models
2019, Signal Processing
Citation Excerpt :
Despite the many advances achieved in the last years, the task of scanning a standing person with RGBD sensors remains a challenge requiring specific solutions. A multi-view configuration (composed by six sensors) was proposed by Liu et al. [29]. Later, then same authors propose in [30] a registration algorithm that determines a segment constrained correspondence between pairs of partial point clouds, that is then integrated with a rigid transformation.
The apparition of low-cost depth cameras has lead to the development of several reconstruction methods that work well with rigid objects, but tend to fail when used to manually scan a standing person. Specific methods for body scanning have been proposed, but they have some ad-hoc requirements that make them unsuitable in a wide range of applications: they either require rotation platforms, multiple sensors and a priori template model. Scanning a person with a hand-held low-cost depth camera is still a challenging unsolved problem.
This work proposes a novel solution to easily scan standing persons by combining depth information with fiducial markers without using a template model. In our approach, a set of markers placed in the ground are used to improve camera tracking by a novel algorithm that fuses depth information with the known location of the markers. The proposed method analyzes the video sequence and automatically divides it into fragments that are employed to build partial overlapping scans of the subject. Then, a registration step (both rigid and non-rigid) is applied to create a final mesh of the scanned subject. The proposed method has been compared with the state-of-the-art KinectFusion [1], ElasticFusion [2], ORB-SLAM [3, 4], and BundleFusion [5] methods, exhibiting superior performance.
A 3D reconstruction method of the body envelope from biplanar X-rays: Evaluation of its accuracy and reliability
2015, Journal of Biomechanics
Citation Excerpt :
3D body scanning technologies are becoming accessible making it possible to easily obtain 3D body shape (Daanen and Haar, 2013; Liu et al., 2015).
The aim of this study was to propose a novel method for reconstructing the external body envelope from the low dose biplanar X-rays of a person. The 3D body envelope was obtained by deforming a template to match the surface profiles in two X-rays images in three successive steps: global morphing to adopt the position of a person and scale the template׳s body segments, followed by a gross deformation and a fine deformation using two sets of pre-defined control points. To evaluate the method, a biplanar X-ray acquisition was obtained from head to foot for 12 volunteers in a standing posture. Up to 172 radio-opaque skin markers were attached to the body surface and used as reference positions. Each envelope was reconstructed three times by three operators. Results showed a bias lower than 7 mm and a confidence interval (95%) of reproducibility lower than 6 mm for all body parts, comparable to other existing methods matching a template onto stereographic photographs. The proposed method offers the possibility of reconstructing body shape in addition to the skeleton using a low dose biplanar X-rays system.
Photogrammetry Scans for Neuroanatomy Education - a New Multi-Camera System: Technical Note
2024, Research Square

View all citing articles on Scopus

View full text

3D real human reconstruction via multiple low-cost depth cameras

Highlights

Abstract

Introduction

Section snippets

3D partial view generation

Feature correspondence for initial registration

Fine pairwise registration

Global registration

Human mesh reconstruction

Experimental results

Conclusion

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Comput. Aided Des.

Pattern Recognit.

Comput. Aided Des.

A benchmark for surface reconstruction

ACM Trans. Graph.

Scanning 3D full human bodies using kinects

IEEE Trans. Vis. Comput. Graph.

Scapeshape completion and animation of people

ACM Trans. Graph.

full 3-D reconstruction of moving foreground objects from multiple consumer depth cameras

IEEE Trans. Multimed.

3-D rigid body tracking using vision and depth sensors

IEEE Trans. Cybern.

Movie2comicstowards a lively video content presentation

IEEE Trans. Multimed.

Modelseekan effective 3D model retrieval system

Multimed. Tools Appl.

A survey on partial retrieval of 3D shapes

J. Comput. Sci. Technol.

View-based 3D object retrievalchallenges and approaches

IEEE MultiMed.

Shape googlegeometric words and expressions for invariant shape retrieval

ACM Trans. Graph.

Sketch-based shape retrieval

ACM Trans. Graph.

Less is moreefficient 3D object retrieval with query view selection

IEEE Trans. Multimed.

Computing the inner distances of volumetric models for articulated shape description with a visibility graph

IEEE Trans. Pattern Anal. Mach. Intell.

Camera constraint-free view-based 3D object retrieval

IEEE Trans. Image Process.