3D real human reconstruction via multiple low-cost depth cameras
Introduction
Real 3D scene and human reconstruction from a professional 3D scanner has been studied extensively in multimedia, virtual reality, and computer graphics [1]. Satisfactory results have been obtained that are adaptable to industrial inverse engineering and production designs based on virtual reality. However, the techniques cannot be applied directly in home-centered environments owing to the high cost, large volume, complex operation, and computational burden. In recent years, portable low-cost, easy-to-use, RGB-D cameras such as the Kinect [2] have become very popular and widely used. However, this type of camera captures low-quality depth images, which is a major constraint in providing high-quality immersive and virtual applications. Thus, many researchers have shown great interest in recent developments in scene reconstruction and human modeling using RGB-D sensors.
There are several pioneering works focusing on interesting 3D virtual applications of RGB-D sensors. Alexiadis et al. [3] built a real-time automatic system for dance performance evaluation using a Kinect RGB-D sensor, and provided visual feedback for beginners in a 3D virtual scene. Liu et al. [4] used depth sensor data from a Kinect to track human movement and evaluate players׳ energy consumption to investigate how much energy is expended while playing in a virtual environment. Bleiweiss et al. [5] proposed a solution to animate in-game avatars using real-time motion capture data, and blend actual movements of players using predefined animation sequences. Pedersoli et al. [6] provided a framework for a Kinect that enables more natural and intuitive hand-gesture communication between a human and a computer. Tong et al. [7] presented a novel scanning system for capturing different parts of the human body at close range, and then reconstructing a whole 3D human body. However, this system requires external calibration and an accurate capture space, which is not available in home applications. The system proposed by Anguelov et al. [8] deforms a data-driven shape completion and animation of people (SCAPE) model to fit the scanned data given a limited set of markers specifying the target shape. Using a single static scan and markers for motion capture, their system constructs a high-quality animated surface model of a moving person with realistic muscle deformation. Similarly, Weiss et al. [9] introduced a SCAPE model with 15 body parts that fits poses of a real human to a 3D model, while Chen et al. [10] adopted training data, including 3D meshes for multiple users with different poses as obtained from SCAPE, to model existing 3D human body meshes. Alexiadis et al. [11] proposed a full 3D reconstruction of moving foreground objects from depth cameras. Several authors have also focused on other related applications, e.g., tracking a human body [12], [13], recognizing human poses in real time [14], and converting a movie clip into a comic [15] using depth cameras.
However, unlike commercial 3D scanners that can output clean, dense point clouds for use in building a structured mesh easily, the point cloud captured from depth cameras is sparse, discontinuous, and noisy, resulting in the generation of many holes and missing depth values. This is caused by the nature of infrared light measurement, which is affected by multiple factors, including scattering, absorption by dark objects, transmission through transparent objects, multiple reflections, and large distances. We propose a pipeline aimed at obtaining a real textured 3D human body from low-cost depth cameras. We first remove the background to obtain a 3D partial view of the human from each depth image with the help of intrinsic and stereo calibration parameters, and successively register two neighboring partial views as a concatenated partial view. The complete 3D point cloud of the human is globally registered using all the pairwise registrations. A 3D mesh structure incorporating color is reconstructed using Delaunay triangulation and the Poisson method. To verify the effectiveness, efficiency, and robustness of the real 3D human reconstruction using multiple low-cost depth cameras, we built an experimental environment. Six depth cameras (Kinect) were placed at different positions on a circle to capture multiple views with as many overlapping points as possible. Fig. 1 shows the entire process from setting up the hardware system to the final 3D human reconstruction. The results show that the quality of the reconstructed mesh is satisfactory.
How to build a correspondence of partial 3D views automatically and effectively is the key to 3D human reconstruction. Shape correspondence, matching, and retrieval of 3D shapes have been extensively investigated in recent years [16], [17], [18], and these are still hot topics in multimedia, computer vision, computer graphics, and computer-aided design. Many researchers have attempted to provide content-based matching and retrieval techniques, focusing mainly on local point description and high-level feature extraction [19], [20], [21], [22], topological structure [23], non-rigid shape feature [24], sketch based retrieval [25], view comparison [26], [27], [28], [29], [30], [31], and a relevance feedback mechanism [32]. For example, to match a single-view Kinect scan to high-quality 3D models, Shen et al. [33] proposed recovering the underlying structure of a scanned object by assembling suitable parts obtained from the repository models. Kim et al. [34] presented an efficient method for acquiring 3D indoor environments with variability and repetition through a Kinect sensor. Their work segments a single 3D point cloud scanned using a real-time simultaneous localization and mapping (SLAM) technique, classifies it into plausible objects, recognizes these using primitive fitting and connected component analysis, and extracts their pose parameters. Kakadiaris et al. [35] used a composed deformable model to fit 2D body silhouettes extracted from three mutually orthogonal views. Wang [36] constructed a feature wireframe of a human body from laser scanned 3D unorganized points using semantic feature extraction, and modeled the symmetric detail mesh surface of the human body. Hsieh et al. [37] presented a novel segmentation algorithm to segment a body posture into different body parts using the deformable triangulation technique. Holte et al. [38] summarized recent approaches for 3D human pose estimation based on 3D features reconstructed from multiple views. However, our problem is different: we require robust correspondence between noisy partial views without markers, which are scanned by low-cost depth cameras. Moreover, global registration with a small accumulated error distributed among multiple cameras must be automatically implemented in almost real time. We design a registration framework to realize the two objectives simultaneously.
Contributions: The past decade has seen the emergence of new 3D imaging devices and techniques capable of capturing full human body shapes [39], which are used extensively used in human engineering, health assessment, protective equipment (car or airplane seats), medical diagnosis, entertainment, the clothing industry, and virtual reality. In this study, we implemented an entire reconstruction pipeline for a real 3D human as a prototype system composed of multiple low-cost depth cameras. Furthermore, the effectiveness, efficiency, and robustness of the system are investigated by applying it to a group of users with different poses. The entire process is completely automatic and does not require human body landmarks, or any manual operation on the 3D model, which is important for dynamic virtual reality environments. Our approach is realizable with the following two technical contributions:
- 1.
A solution to 3D partial view generation of humans is proposed. Given color images and noisy depth images captured by a single sensor, our method adopts background removal to extract a coarse depth view of the human, followed by depth smoothing. Partial view filters are employed to obtain a relatively clean 3D partial point cloud.
- 2.
To provide more robust and accurate registration for sparse and noisy point clouds, we introduce an initial pairwise registration method using segment constrained correspondence. The correspondence is built for key points and uniform sampling points, and performed by comparing their feature descriptors. We add a segment constraint to spectral matching particularly for a human point cloud, which significantly improves the correspondence effect. The proposed scheme provides an initial alignment around the optimal solution of fine registration.
Section snippets
3D partial view generation
In this section, we propose a method for extracting 3D partial views of humans from captured RGB-D images to obtain a relatively clean 3D human automatically from cluttered environments. The scheme relies on dealing with noisy low-resolution depth maps by resorting to relatively high-quality color images. The static background in a color image is easily removed and mapped to the depth image to extract a depth view of the human. Internal calibration parameters are used to convert the depth view
Feature correspondence for initial registration
Latest advances dealing with feature correspondence have mainly focused on the following topics: feature extraction based on a local convolutional auto-encoder [42] and topic model [43], and feature comparison via Hausdorff distance learning [44]. Differing from these studies, our work relies on key point selection on a 3D point cloud, and computes feature descriptions for coarse correspondence between partial 3D views.
Key point selection: Owing to the large number of points in a point cloud,
Fine pairwise registration
To obtain accurate registration of two partial 3D views and significantly reduce the influence of noisy data and outliers, we follow the notion of coherent point drift [57]. A set of points X in one partial 3D view can be modeled using Gaussian mixture models, while another set of points Y is regarded as the data generated by these Gaussian mixture models. The posterior probability of data points in Y corresponding to the Gaussian mixture models centered on points in X should be maximized after
Global registration
While sequentially aligning each 3D partial view pairwise with its neighbor, the error gradually accumulates so that the starting and ending partial views cannot be aligned. We introduce global registration to obtain a consensus registration for all partial views by diffusing pairwise registration errors. From all the pairwise registrations, we iteratively select the view with the most matching pairs. The current view is aligned with all its neighboring views to obtain the new registration
Human mesh reconstruction
The human point cloud including color generated above has three defects, which may cause an incorrect mesh reconstruction. One problem is that the point cloud is not smooth and contains several burrs on the boundary, resulting from interference of several RGB-D cameras, inevitable pairwise and global registration errors, and the background mixture. We solve the organization problem of these unordered 3D points, and adopt a polygonal form to connect these points topologically. Another
Experimental results
Experimental environment: To verify the whole procedure, we built a hardware system, which includes six low-cost depth cameras (Kinect), their brackets, a dark blanket, six USB cables connecting the depth cameras to the computer, and a desktop computer with an i7 CPU and 16 GB memory. Fig. 1(a) illustrates the experimental environment, with six depth cameras placed at different locations on a circle to capture different views with as many overlapping points as possible. We do not require that
Conclusion
In this paper, we presented a novel proposal for 3D real human reconstruction of real users by employing multiple depth cameras. The core of the implementation process lies in how to reconstruct realistic humans from noisy RGB data. The process consists of three key steps: 3D partial view generation, registration of partial views, and human mesh reconstruction. The novel application allows a user to be truly immersed in a virtual really scene using low cost RGB-D cameras. This should greatly
Acknowledgments
The work is partially supported by NSFC (Nos. 61003137, 61202185, 61473231, and 91120005), NWPU Basic Research Fund 310201401 (JCQ01009 and JCQ01012), Fund of National Engineering Center for Commercial Aircraft Manufacturing (201410), National Aerospace Science Foundation of China (2014ZD53), and Open Fund of State Key Lab of CAD & CG in Zhejiang University (A15).
References (63)
- et al.
3D shape retrieval using kernels on extended Reeb graphs
Pattern Recognit.
(2013) - et al.
A comparison of methods for non-rigid 3D shape retrieval
Pattern Recognit.
(2013) Parameterization and parametric design of mannequins
Comput. Aided Des.
(2005)- et al.
3D model comparison using spatial structure circular descriptor
Pattern Recognit.
(2010) - et al.
Dewalla fast divide and conquer Delaunay triangulation algorithm in Ed
Comput. Aided Des.
(1998) - et al.
A benchmark for surface reconstruction
ACM Trans. Graph.
(2013) - Microsoft Kinect, 〈http://www.xbox.com/kinect〉,...
- D.S. Alexiadis, P. Kelly, P. Daras, N.E. O׳Connor, T. Boubekeur, M.B. Moussa, Evaluating a dancer׳s performance using...
- Z. Liu, S. Tang, H. Qin, S. Bu, Evaluating user׳s energy consumption using kinect based skeleton tracking, in:...
- A. Bleiweiss, D. Eshar, G. Kutliroff, A. Lerner, Y. Oshrat, Y. Yanai, Enhanced interactive gaming by blending full-body...
Scanning 3D full human bodies using kinects
IEEE Trans. Vis. Comput. Graph.
Scapeshape completion and animation of people
ACM Trans. Graph.
full 3-D reconstruction of moving foreground objects from multiple consumer depth cameras
IEEE Trans. Multimed.
3-D rigid body tracking using vision and depth sensors
IEEE Trans. Cybern.
Movie2comicstowards a lively video content presentation
IEEE Trans. Multimed.
Modelseekan effective 3D model retrieval system
Multimed. Tools Appl.
A survey on partial retrieval of 3D shapes
J. Comput. Sci. Technol.
View-based 3D object retrievalchallenges and approaches
IEEE MultiMed.
Shape googlegeometric words and expressions for invariant shape retrieval
ACM Trans. Graph.
Sketch-based shape retrieval
ACM Trans. Graph.
Less is moreefficient 3D object retrieval with query view selection
IEEE Trans. Multimed.
Computing the inner distances of volumetric models for articulated shape description with a visibility graph
IEEE Trans. Pattern Anal. Mach. Intell.
Camera constraint-free view-based 3D object retrieval
IEEE Trans. Image Process.
Cited by (29)
Predicting high-fidelity human body models from impaired point clouds
2022, Signal ProcessingCitation Excerpt :3D scanning [1,2] is a fast automated geometry acquisition technology which has been successfully applied to various tasks such as clothing design [3] and extraction of anthropometric measurements [4] to cite a few.
3D scan process optimisation study for rapid virtualization.
2020, Procedia CIRPHigh-accuracy multi-camera reconstruction enhanced by adaptive point cloud correction algorithm
2019, Optics and Lasers in EngineeringCitation Excerpt :The multi-vision system is an extension of the monocular and binocular vision system wherein a wider range of 3D geometric information is obtained by extending the field of view of the cameras and merging the geometric information from different cameras through coordinate correlation technology. Current multi-vision-based 3D reconstruction technologies are relatively sophisticated and are widely used for target recovery [12], industrial robot environmental perception [13], human-computer interaction [14], and visualization [15]. The critical step in establishing any multi-vision system is global calibration.
Flexible body scanning without template models
2019, Signal ProcessingCitation Excerpt :Despite the many advances achieved in the last years, the task of scanning a standing person with RGBD sensors remains a challenge requiring specific solutions. A multi-view configuration (composed by six sensors) was proposed by Liu et al. [29]. Later, then same authors propose in [30] a registration algorithm that determines a segment constrained correspondence between pairs of partial point clouds, that is then integrated with a rigid transformation.
A 3D reconstruction method of the body envelope from biplanar X-rays: Evaluation of its accuracy and reliability
2015, Journal of BiomechanicsCitation Excerpt :3D body scanning technologies are becoming accessible making it possible to easily obtain 3D body shape (Daanen and Haar, 2013; Liu et al., 2015).