1 Introduction

A 3D model of the human body is required in many applications, such as video games, e-commerce, virtual reality, biomedical research, etc. It is therefore important to have robust and accurate methods for recovering models of humans from one or several RGB-images. This is however a difficult problem, due to non-rigid motion, different clothing and complex articulation. This makes 3D body reconstruction a very challenging and interesting task in computer vision.

Aiming at effectively acquiring a realistic and personalized 3D human body, several methods have been proposed during the past decades, many using expensive active reconstruction equipment or improving the performance of reconstruction algorithms based on structure from motion methods. Using 3D scanners or multiple calibrated cameras in a controlled environment can obtain 3D models with very high accuracy [1]. The disadvantage of such methods is that these systems are very expensive and relatively complicated to build.

Besides these scanning systems, another line of research is to obtain 3D models from images acquired by ordinary cameras or depth sensors by stereo reconstruction algorithms or fusion algorithms [5,6,7]. These methods do not require expensive equipment or complicated set-ups, but are instead based on computationally expensive computer vision algorithms. Structure from motion (SfM) can reconstruct the 3D model of a person with static pose from a moving camera. Using depth sensors, for instance the Kinect, one can also obtain a 3D model through fusion of the geometries obtained from different view-points. These methods do not require any prior information, such as human shape.

Although these ideas have achieved a lot of progress on effectively obtaining 3D reconstructions, there is still a need for simpler methods to reconstruct 3D human body model. With the remarkable progress of human pose estimation based on deep neural networks (DNN), poses have been shown to provide useful information for the reconstruction. Therefore, other methods based on strong prior information are proposed to reconstruct 3D models and are shown to have good performance. These methods can estimate a 3D human body model from one monocular RGB image by fitting the statistical human body model to the human pose predicted by a DNN [22, 23]. However, only one image is not sufficient in order to provide sufficient accurate 3D reconstruction in many cases, due to self-occlusion and complicated articulated motion.

In this paper we propose to use several (e.g. a sequence of) RGB images which are acquired from different viewpoints to reconstruct the 3D human body based on a skinned multi-person linear shape model (SMPL) [2]. We construct an energy function which measures the difference between the 2D joint points of the RGB images and the 2D joint points of the projected SMPL model. The 2D joint points of the RGB images are predicted by OpenPose [3]. The difference between our method and SfM-based methods is that we only use the estimated joint positions to reconstruct the 3D model. At the same time, the camera orientations are also regarded as parameters when the energy function is minimized. The advantage is that several images from different viewpoints can provide more accurate 3D information and the number of the images used in our method is in general fewer compared with SfM-based methods. Experiments on synthetic data and Human3.6M [4] show that our method obtains more accurate pose estimation and 3D shape, than similar methods based on a single image.

2 Related Work

As shown in [25], related work is basically divided into two categories: methods that do not use parametric models and methods based on parametric models.

Non-parametric model based methods typically reconstruct 3D models from images acquired by a camera from different viewpoints or from the fusion of depth sensors. The results of the methods can be obtained accurately without using any strong prior information. However, the person should stand still to capture the data and the computation is quite complex and time-consuming. The most well-known algorithm is KinectFusion [5] which creates 3D models in real time by incrementally fusing the partial scans from a moving RGB-D-sensor. It has good performance for rigid objects, but is not designed for articulated motion. Therefore, for the 3D reconstruction of a static person, some approaches [6, 7] inspired by KinectFusion are proposed. These methods cannot achieve satisfying result for the dynamic person since the human body typically is moving non-rigidly between different views. DynamicFusion [8], which is the pioneering work for the reconstruction of non-rigid objects, can reconstruct the 3D geometry in real time for a slowly moving person. Other methods such as KillingFusion [9] and BodyFusion [12] are proposed to improve the results based on DynamicFusion. However, these approaches are only suitable for slow motion and have high computational complexity. In order to obtain more accurate results, multiple Kinect sensors or several calibrated cameras can be utilized to create 3D human body models. In [10], the authors propose to use eight Kinects to obtain the 3D model with high accuracy. Multiple cameras are also used in [1, 11] to reconstruct the 3D human body. However, there are technical challenges and it is expensive to build a system with eight Kinects or to build the indoor environment like [1] for many practical applications.

Parametric model-based methods often rely on a template which provides strong prior information during the reconstruction. The template can be reconstructed from depth data or using a pre-computed human body model. In [13,14,15], a novel non-rigid registration algorithm is proposed to register a pre-scanned model to other partial depth data acquired by Kinect. In [14], a template is obtained through registering several high quality partial scans and then a personalized 3D model is reconstructed by fitting to the template. Some other algorithms [16, 17] have similar ideas but they use more complicated information or hardware to improve the accuracy and efficiency. Besides pre-scanning the template, a number of statistical human body models have been proposed based on training of a human body set, such as SCAPE [18], SMPL [2] and so on [19]. In [20] the authors use the SCAPE model to fit the depth image to obtain a 3D model. The improved SCAPE model, Delta, is proposed in [21] and a detailed body reconstruction algorithm is presented in this paper. In [22], the authors propose to fit the SMPL model by using 2D joint points predicted by a DNN-based method. Huang et al. [23] use a similar idea but they focus on the video problem, using temporal information. In [24] an end-to-end adversarial learning method is used to estimate the human pose and shape parameters by fitting the SMPL model. Alldieck et al. [25] propose an algorithm to obtain the consensus shape and then use both pose and consensus shape to fit the SMPL model in order to obtain better result.

Fig. 1.
figure 1

The overview of our method.

3 Method

Our aim is to obtain the 3D model of a human body from several RGB images taken from different view-points. Our approach is inspired by the work in [22] where the 3D human body model is estimated from only one RGB image. Although the method in [22] has achieved some accuracy, the error is still noticeable in many cases since one RGB image cannot supply enough information. As an improvement, we propose to use several RGB images taken from different view-points to reconstruct the 3D model. This leads to a more challenging optimization problem, since the motion of the cameras is unknown, and we need to introduce the parameters of the cameras as variables to estimate. Firstly, we estimate the positions of the 2D joint points of the person in the images by using OpenPose. Then, the SMPL model is fitted to the pose of the person in different views by optimizing an energy function in which the camera parameters are included. Finally, the pose, shape and the camera parameters are estimated to obtain the 3D model of the human. The pipeline of our method is summarized in Fig. 1. In the following, we firstly introduce SMPL model, then the energy function and finally the optimization that gives the estimation of the camera parameters as well as the pose and shape parameters of the 3D human model.

3.1 SMPL Model

The SMPL model encodes both pose and shape parameters [22]. The pose is defined from the parameter \(\varvec{\theta }\), which represents the relative rotations of the 23 joint points with respect to the root joint. The shape is represented by the parameter \(\varvec{\beta }\), which describes the strength of each mode in a shape space obtained from a principal component analysis (PCA) from a registered training set. The pose parameters are represented as a vector \(\varvec{\theta }\in {\mathbb {R}^{72}}\) and the shape parameters as a vector \(\varvec{\beta }\in {\mathbb {R}^{10}}\).

The output of the SMPL model after introducing pose and shape is a mesh with \(N=6890\) vertices and \(F=13776\) faces, \(M(\varvec{\theta },\varvec{\beta })\in \mathbb {R}^{N\times 3}\). In this model, the 3D joints are obtained by linear regression from the surface mesh vertices, i.e., a function of the pose and shape coefficients. Therefore, the pose and shape parameters can be estimated by optimizing an energy function based on the joint points.

3.2 Pose and Shape Fitting

The approach in [22] is called SMPLify, in which the projection of the 3D joints of the SMPL model is fitted to the 2D joint points predicted by a CNN-based method. The advantage of this method is that only one image is utilized to obtain the 3D model. However, one disadvantage of SMPLify is that in some situations one image does not contain enough information for obtaining an accurate 3D reconstruction (due to self-occlusion, articulated motion and ambiguous pose). Other methods based on traditional SfM pipelines, require a lot of images from different views and are computationally intensive. Therefore, we propose to use several images from different views into SMPLify because more images will provide more regularization and it is convenient to not use too many images. The problem of this idea is that the parameters of the cameras from different views are unknown, which makes the projection of the joint points of the SMPL model difficult. The solution to this problem is to use the parameters of the cameras together with the pose and shape of the SMPL model as the variables of an energy function during the optimization. The advantage of this method is that we can obtain not only an estimate of the pose and shape but also an estimate of the cameras parameters (position and orientation).

The energy function contains three parts: the pose-fitting term, the shape parameter regularization term and the pose parameter regularization term. We define the energy function as:

$$\begin{aligned} E(\varvec{\theta },\varvec{\beta },R_i) = E_{J}(\varvec{\theta },\varvec{\beta },R_i)+\lambda _{\theta }E_{\varvec{\theta }}(\varvec{\theta })+\lambda _{\beta }E_{\varvec{\beta }}(\varvec{\beta }), \end{aligned}$$
(1)

where \(E_{J}(\varvec{\theta },\varvec{\beta },R_i)\) is the pose fitting term, \(E_{\varvec{\theta }}(\varvec{\theta })\) is the pose parameters regularization term, \(E_{\varvec{\beta }}(\varvec{\beta })\) is the shape parameters regularization term and \(\lambda _{\theta }\) and \(\lambda _{\beta }\) are weights. In the energy function, the pose \(\varvec{\theta }\), the shape \(\varvec{\beta }\) and the rotation of the camera \(R_i\) can be estimated through

$$\begin{aligned} \hat{\varvec{\theta }},\hat{\varvec{\beta }},\hat{R_i}=arg\min {E(\varvec{\theta },\varvec{\beta },R_i)}. \end{aligned}$$
(2)

The most important term is \(E_{J}\) in our method and it is defined as

$$\begin{aligned} E_{J}(\varvec{\theta },\varvec{\beta },R_i)=\sum _{i=1}^{N}\sum _{k=1}^{K}\rho (\varPi _i(J_{S,k})-J_{2d,k}^{(i)}), \end{aligned}$$
(3)

where N is the number of images, K is the number of joint points, \(J_{S,k}\) is the k-th 3D joint points of the SMPL model, \(\varPi _i\) is the i-th camera, \(J_{2d,k}^{(i)}\) is the k-th 2D joint points estimated by OpenPose for the i-th image and \(R_i\) is the rotation for i-th camera. The error \(\rho \) is measured by the Geman-McClure function [26] which gives robustness to large noise and outliers. This function is defined as

$$\begin{aligned} \rho (x)=\frac{x^2}{\sigma ^2+x^2}, \end{aligned}$$
(4)

where x is the absolute errors of 2D joint points and \(\sigma \) is a constant. The projection of the 3D joint points of the SMPL model in the i-th camera is

$$\varPi _i(J_{S})=R_iJ_{S}+t_i,$$

where \(t_i\) is the translation of the i-th camera. The translation is calculated separately using the shoulders and hips, which implies that we can assume that the person is standing parallel to the image plane. Because the projection is linear, the derivatives of the error function can be computed easily during the optimization.

The pose regularization is needed for avoiding the knees and elbows bending unnaturally and it is defined as

$$\begin{aligned} E_{\theta }(\varvec{\theta })=\alpha \sum \limits _i{exp(\theta _i)}, \end{aligned}$$
(5)

where \(\varvec{\theta }_i\) denotes the pose of the joint points of elbows and knees and \(\alpha \) is a constant that controls the penalization. The shape regularization term is defined as

$$\begin{aligned} E_{\beta }(\varvec{\beta })=\sum {\beta _i}, \end{aligned}$$
(6)

i.e. as the sum of the elements of \(\varvec{\beta }\).

3.3 Optimization

The optimization is performed in two steps. In the first step the camera translation is estimated. Here the focal length of the camera is assumed to be known. The camera translation can be estimated through fitting the shoulders and hips in the SMPL model and the predicted 2D pose.

In the second step the model is fitted through minimization of (3). The parameters \(\lambda _{\theta }\) and \(\lambda _{\beta }\) are decreased gradually during the optimization. The minimization method is based on Powell’s dogleg method, which is provided by the python module OpenDR [27] and Chumpy [28]. For four different views (image size \(320\times 240\)), it takes about 2 min for the minimization on a desktop machine.

4 Experiment

In this section some experiments are presented to illustrate the performance of our method. In the first experiment, we generate a small synthetic dataset based on SURREAL [29] in which a large amount of synthetic human bodies with different poses and shapes are created based on the SMPL model. Since the SURREAL only provides videos from one view, we generate three more images from the other views. Then, for the real images, our method is evaluated on the Human3.6M which is for the evaluation of human pose estimation.

In our experiments, the parameters \((\lambda _{\theta },\lambda _{\beta })\) decrease as (404, 100), (404, 50), (58, 5), (4.78, 1). The \(\sigma \) is set to 100 and \(\alpha \) is 10. The maximum number of iterations is 100 for every stage and the stopping criteria is that the error of the energy function is smaller than \(10^{-3}\). The experiments are implemented in Python and our desktop machine has a 4 core Intel i5-6500 CPU @ 3.20 GHz with 8 GB RAM.

4.1 Results on Synthetic Data

The synthetic images are generated based on SURREAL. We utilize 100 pose and shape parameters from the training information data of SURREAL into the SMPL model to generate 100 different 3D human bodies. Then, four images whose sizes are \(320\times 240\) are projected by cameras from different view-points for each human body model. At the same time, the joint points of the human body in each image are also computed. (We will provide this small dataset online). In our method the 3D models are estimated from two, three and four views respectively. As comparison, the 3D models obtained by SMPLify are also given. In order to quantitatively compare the results, the metric for evaluation is defined as:

$$\begin{aligned} Error=\frac{1}{N}\sum _{i=1}^{N}||J_{i}^{gt}-J_{i}^{est}||_2, \end{aligned}$$
(7)

where \(J_{i}^{gt}\) is the ground truth of the 3D joint points and \(J_{i}^{est}\) is the estimated 3D joint points. In this part, there are a total of 24 joint points for the SMPL model.

Fig. 2.
figure 2

The errors of the 100 samples and the mean error. Left shows the errors of the 100 samples. Right shows the mean error of the 100 samples of SMPLify and our method with two, three and four images.

Table 1. The mean errors of SMPLify compared with our method using respectively 2, 3 and 4 images.

The errors for the 100 samples and the mean error using different number of images are given in Fig. 2. It is shown that the error is smaller when more images are used in the method for most samples. In some cases, the error of our method with two or three images is greater. This is because images from two different views may influence each others since the camera at the side position cannot capture all of the joint points. The mean error of the 100 samples is also given in Table 1. We can see that the mean error decreases when more images are utilized and that the performance of our method surmounts that of SMPLify, which shows that more images indeed can provide more useful information. In Fig. 3 some images are given as qualitative results. Each row corresponds to one person with some unknown shape and pose. For each row the left hand image in each column is the input image for frame one to four. The middle image in column one is the result from SMPLify using only the input image from frame one. The right hand image in each column is the result of our method using all four input frames. We can see from the first image that our method has better performance, especially the last one. The images from other views show that the estimation of the cameras by our method is very correct, which demonstrates the effectiveness of our method.

Fig. 3.
figure 3

The figure shows results on synthetic data. Each row corresponds to one person with some unknown shape and pose. For each row the left hand image in each column is the input image for frame one to four. The middle image in column one is the result from SMPLify using only the input image from frame one. The right hand image in each column is the result of our method using all four input frames.

4.2 Results on Human3.6M

There are total of 11 subjects (6 males, 5 females) in Human3.6M and every person has 15 actions. In order to test our method sufficiently, we choose S1 which is a female and S6 which is a male to evaluate our method on 8 actions: Directions, Discussion, Eating, Greeting, Phoning, Posing, Purchasing and Sitting. For each action, we sample the video every five frames and take total of 100 frames. The results of SMPLify and our method with four images are compared. The metric for the comparison is also computed according to Eq. (7). In this part there are 16 joint points because the number of joints in Human3.6M is 16. Similarly, the errors of every frame in the different actions for S1 and S6 are shown in Figs. 4 and 5. The mean errors of the 100 frames in each action for S1 and S2 are shown in Tables 2 and 3. It is shown in these results that our method can obtain more accurate estimation in most cases.

In addition, some images from the dataset are shown in Fig. 6. We can see from Fig. 6 that SMPLify has obvious errors such as the occlusion of the arms and bodies. Our method sometimes also have unexpected errors because of having a side-view such as such as the fifth sample. The reason is that our method relies on all of the observed images. Therefore, if one camera translation is not estimated correctly, it will affect the images from other views and then the final results may be incorrect after the optimization. However, SMPLify only uses one image and if this image is not captured from the side view, the result is sometimes better than ours. In general, our method can achieve better estimation than SMPLify.

Fig. 4.
figure 4

The errors of every frame of the eight actions for S1.

Fig. 5.
figure 5

The errors of every frame of the eight actions for S6.

Table 2. The mean errors of the eight actions of S1.
Table 3. The mean errors of the eight actions of S6.
Fig. 6.
figure 6

Some samples from S1 and S6. The first images are given the results of SMPLify and our method from left to right.

5 Conclusion

We have proposed a method to reconstruction a 3D human body model from several RGB images taken from different view-points. Our approach starts by estimating the 2D joint points of the images by using a DNN-based method called OpenPose. Then, a statistical human body model, SMPL, is utilized to fit the predicted 2D joint points from the images by minimizing an energy function over all images simultaneously. Finally, our method estimates both the pose and shape parameters of the human body as well as the camera parameters. Experiments on synthetic and real data quantitatively and qualitatively demonstrate that the results of our method are better regarding the pose error compared to the previous method based on only one image.

Our method also has some limitation. If the images are captured from the side view, the joint points will be very close to each other or even at the same position, which makes our method unstable. Also, we mainly focus on the estimation of the pose and this implies that the shape of the reconstruction is less accurate. However, this is a fundamental limitation of all methods that only use the joint positions and disregard the contours of the body.