Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Estimating human body shape from imagery is an important problem in computer vision with diverse applications. The estimated body shape provides an accurate proxy geometry for further tasks such as rendering free viewpoint videos [10, 48, 49, 53], surveillance [11], tracking [16], biometric authentication, medical and personal measurements, virtual cloth fitting [17, 36, 40, 51], and artistic image reshaping [56]. Pose estimation is also tightly coupled with shape estimation. Knowing the body shape significantly reduces the complexity and improves the robustness of pose estimation algorithms and thus expands the space of poses that can be reliably estimated [2, 55].

However, in contrast to pose estimation, body shape estimation has received substantially less attention from the community. Most existing algorithms rely on either manual input [25, 40, 56], restrictive assumptions on the acquired images [6], or require information other than just 2D images (e.g. depth) [23, 37, 50]. Furthermore, some of the methods have prohibitive complexity for real-time applications [6, 25, 50]. For practical applications, it is essential to have an automatic and fast algorithm that can work with images acquired under less restrictive conditions and body poses.

In this paper, we propose a fast and automatic method for estimating the 3D body shape of a person from images, utilizing multi-view semi-supervised learning. Our method relies on extracting novel features from a given silhouette of a single person under minimal self-occlusion like in a selfie, and a parametric human body shape model [3]. The latter is utilized to generate meshes spanning a spectrum of human body shapes, from which silhouettes are computed over multiple views, in poses compliant with the target applications for training. We firstly estimate viewing direction with high accuracy, by solving a classification task. Utilizing the information simultaneously captured in multiple synthetic views of the same body mesh, we apply Canonical Correlation Analysis (CCA) [24] to learn informative bases where the extracted features can be projected. A random forest regressor is then adopted to learn a mapping from projected feature space to parameter space. This results in lower feature dimensionality, reducing the training and test time drastically, and improves prediction as compared to plain regression forests. We demonstrate our results on real people and in a free-view-point video (supplementary [1]), and comprehensively evaluate our method by validating it on thousands of body shapes.

Contributions. In summary, the contributions of the paper are: (1) a fast and automatic system for shape estimation from monocular silhouette/s under no fixed pose and known camera assumptions, thanks to novel features that capture robust global and local information simultaneously, (2) demonstration of how CCA multi-view learning with regression forests can be applied to the task of shape estimation, leveraging synthetic data and improving prediction over random forests with raw data, (3) extensive validation on thousands of body shapes via thorough comparisons to state-of-the-art on a new bigger dataset.

2 Related Work

General methods for shape estimation. Estimating 3D geometry of body shapes from limited imagery is an inherently ill-posed problem. Early methods used simplifying assumptions such as the visual hull [30] or simple body models with geometric primitives [15, 27, 34]. Although these work well for coarse pose and shape approximations, an accurate shape estimation cannot be obtained.

Human body shape statistical priors. Instead of assuming general geometry, human body shape model based methods rely on the limited degrees of freedom for the possible body shapes. These parametric models are typically constructed from collected 3D scans of people [3, 22, 39]. Utilizing such a prior allows us to always stay within the space of realistic body shapes, and reduces the problem to estimating the parameters of the model. Such models can also be combined with articulation models to simultaneously represent pose as joint angles or transformations, and shape with parameters [3, 22, 35]. In this paper, we combine state-of-the-art 3D body shape databases [38, 54] containing thousands of meshes, and utilize a deformation model based on SCAPE [3].

Fitting body shapes by silhouette matching. Once a statistical model of 3D shapes is defined, an error metric between the input silhouettes and those of the projections of the parameterized 3D body shape can be minimized [4, 9, 11, 18, 21, 25, 56]. Although this leads to accurate matching, despite promising results on deformable 2D shape matching [42, 43], establishing correspondences between the input and output silhouettes is a very challenging problem especially when the body pose is not known or self occlusions are present. The simultaneous estimation of pose and shape is currently addressed by manual interaction to establish and refine matching or pose estimation [11, 25, 56], and under certain assumptions on the error metric, camera calibration, and views [9, 18, 25]. A recent work [29] aims at automatically finding a correspondence between 2D and 3D deformable objects by casting it as an energy minimization problem, demonstrating good results however for a shape retrieval task. Instead of fitting silhouettes directly and locally, we consider a global mapping from silhouettes to shape parameters that is invariant to various poses under mild self-occlusion assumptions. This allows us to sidestep pose estimation, avoid any manual interaction, and estimate the shape parameters for imperfect silhouettes interactively, all of which are essential components for a practical shape estimation system.

Fitting body shapes by statistical models. A recent body of works rely on global mappings between silhouettes and 3D body shapes [6, 1214, 46, 52]. These methods rely on a statistical model for body shapes as well as silhouettes. In this case, the problem reduces to estimating this mapping by various linear [52], or more complex techniques such as the shared Gaussian process latent variable model [12]. In order to generate robust and accurate shapes, these techniques typically require pre-defined and accurate poses [6, 12, 52], and have been validated with limited measurements except for the recent work of Boisvert et al. [6]. The running times can also be prohibitive for real-time applications [6, 13]. We also define a mapping from the silhouette to a statistical shape space. However, we aim at robustness to pose changes and silhouette noise via computing specialized features, projecting them at correlated spaces and training a regressor with random forests. This allows close to real-time performance, unlocking further applications. We further present an extensive evaluation of our shape estimation with thousands of test cases and tens of body measurements.

Multi-view learning. Canonical Correlation Analysis (CCA) [24] and Kernel-CCA [20] are statistical learning techniques that find maximally correlated linear and non-linear projections of two random vectors. The projected spaces learn representations of two data views such that each view‘s predictive ability is mutually maximized. Hence, information present in either view that is uncorrelated with the other view is automatically removed in the projected space. That is a helpful property in predictive tasks. The aforementioned methods have been used for unsupervised data analysis with multiple views [19], fusing learned features for better prediction [41], reducing sample complexity using unlabeled data [26], or when multiple views are hallucinated from one single view [33]. A generalized version of CCA [45] has also been proposed but for a classification and retrieval task. Despite its power, CCA in combination with regression has found little usage since its proposal [26]. It has only been empirically evaluated for linear regression [33], and utilized for an action recognition classification task [28]. To the best of our knowledge, we are the first to apply CCA in a non-linear regression task for shape estimation, illustrating its power for such non-linear problems.

3 Shape Estimation Algorithm

Fig. 1.
figure 1

Overview of our system. Training: Silhouettes from 36 views are extracted from meshes generated in various shapes and poses (Sect. 3.2). A View Classifier is learned (Sect. 3.4) from extracted silhouette features (Sect. 3.3). View specific Regression Forests are then trained to estimate shape parameters by first projecting features in CCA correlated spaces (Sect. 3.5). Testing: The extracted features from an input silhouette are used to first infer the camera view, and then the shape parameters by projecting them into CCA spaces and feeding them into the corresponding Regression Forest.

3.1 Method Overview

The goal of our system is to infer the 3D body shape of a person from a single or multiple monocular images fast and automatically. Specifically, we would like to estimate the parameters of a 3D body shape model (Sect. 3.2) such that the corresponding body shape best approximates the 3D body of the subject depicted in the input images. Despite the ambiguity that the 2D silhouette withholds, the projection of the transformed mesh in the image should at least best explain it.

An overview of our system is depicted in Fig. 1. The input to the shape estimation algorithm is a 2D silhouette of the desired individual under minimal self-occlusion (e.g. a selfie), which can be computed accurately for our target scenarios, by learning a background model through Gaussian mixture models and using Graphcuts [7]. The word “selfie” here is used interchangingly to describe the activity of taking a selfie in front of a mirror, and also as a label for poses representing mild self-occlusion (Fig. 2). We then compute features extracted from the silhouettes (Sect. 3.3). These are first used to train a classifier on the camera viewing direction (Sect. 3.4). The features from silhouettes of a particular view are then projected into bases obtained by CCA, such that the view itself and the most orthogonal one to it (e.g. front and side) are used to capture complementary information into the CCA correlated space, and fed to a Random Forest Regressor (Sect. 3.5) trained for each camera view. At test time, the extracted features from an input silhouette are used to first infer the camera view, and then the shape parameters by projecting them into CCA spaces and feeding them into the corresponding Regression Forest. The parameters are used to generate a mesh by solving a least-squares system on the vertex positions (Sect. 3.2). The generated mesh can then be utilized for various post-processing tasks such as human semantic parameter estimation, free view-point video with projective texturing, further shape refinement [6, 56], or pose refinement [25].

3.2 Shape as a Geometric Model

We utilize the SCAPE model [3], which is a low-dimensional parametric model learned from 3D range scans of different people in different poses that captures correlated deformations due to shape and pose changes simultaneously. Specifically, SCAPE is defined as a set of triangle deformations applied to a reference template 3D mesh. Estimating a new shape requires estimating parameters \(\alpha \) and \(\beta \), which determine the deformations due to pose and intrinsic body shape, respectively. Given these parameters, each of the two edges \(\mathbf {e}_{i1}\) and \(\mathbf {e}_{i2}\) of the \(i^{th}\) triangle of the template mesh (defined as the difference vectors between the vertices of the triangle), is deformed according to the following expression

$$\begin{aligned} \mathbf {e}_{ij}^{'} = \mathbf {R}_{i}(\alpha )\mathbf {S}_i(\beta )\mathbf {Q}_i(\mathbf {R}_{i}(\alpha )) \mathbf {e}_{ij}, \end{aligned}$$
(1)

with \(j \in \{ 1,2 \}\). The matrices \(\mathbf {R}_i(\alpha )\) correspond to joint rotations, and \(\mathbf {Q}_i(\mathbf {R}_i(\alpha ))\) to the pose induced non-rigid deformations, e.g. muscle bulging. \(\mathbf {S}_i(\beta )\) are matrices modeling shape variation as a function of the shape parameters \(\beta \). The body shape deformation space is learned by applying PCA to a set of meshes of different people in full correspondence and same pose, with transformations written as \(\mathbf {s}(\beta ) = \mathbf {U}\beta + \mu \), where \(\mathbf {s}(\beta )\) is obtained by stacking all transformations \(\mathbf {S}_i(\beta )\) for all triangles, \(\mathbf {U}\) is a matrix with orthonormal columns, and \(\mu \) is the mean of the triangle transformations over all meshes (please refer to Anguelov et al. [3] for further details). We therefore obtain the model by computing per-triangle deformations for each mesh of the dataset from a template mesh, which is the mean of all the meshes in the dataset (Fig. 2 left), and then applying PCA in order to extract the components capturing largest deformation variations. We chose to use 20 components (\(\beta \in R^{20})\).

Fig. 2.
figure 2

6 meshes from our database. The leftmost one is the mean mesh in the rest pose. The others are from different people in various poses.

We would like to estimate the shape parameters \(\beta \) regardless of the pose. We take the common assumption that the body shape does not significantly change due to the range of poses we consider. Hence, we ignore pose dependent shape changes given by \(\mathbf {Q}_i(\mathbf {R}(\alpha ))\). Decoupling pose and shape changes allows us to adopt a fast and efficient method from the graphics community known as Linear Blend Skinning (LBS) [31] for pose changes, similar to previous works [25, 38]. Starting from a rest pose shape with vertices \(\mathbf {v}_{1},...,\mathbf {v}_{n} \in \mathbf {R}^{4}\) in homogenous coordinates, LBS computes the new position of each vertex by a weighted combination of the bone transformation matrices \(\mathbf {T}_{1},...,\mathbf {T}_{m}\) in a skeleton controlling the mesh, and skinning weights \(w_{i,1},...,w_{i,m} \in \mathbf {R}\) for each vertex \(\mathbf {v}_{i}\), as given by the following formula:

$$\begin{aligned} \mathbf {v}_{i}^{'} = \sum _{j=1}^{m} w_{i,j}\mathbf {T}_{j} \mathbf {v}_{i} = \left( \sum _{j=1}^{m}w_{i,j}\mathbf {T}_{j} \right) \mathbf {v}_{i} \end{aligned}$$
(2)

In our model, the skinning weights are computed for a skeleton of 17 body parts (1 for the head, 2 for the torso, 2 for the hips and 3 for each of the lower and upper limbs) for the mean shape mesh using the heat diffusion method [5]. It has to be noted that \(w_{i,j} \ge 0\) and \(w_{i,1}+\cdot \cdot \cdot +w_{i,m} = 1\).

3.3 Feature Extraction

We extract novel features from the scaled silhouettes as the input to our learning method. These features are designed to capture local and global information on the silhouette shape, and be robust to pose and slight view changes. For each point in the silhouette, two feature values are calculated, namely the (weighted) normal depth and the curvature. In order to extract these, we first compute the 2D point normal for every point in the silhouette, and then smooth all normals with a circle filter of radius of 7 pixels. As different people have different silhouette lengths, we sample 1704 equidistant points from each silhouette starting from the topmost pixel of the silhouette. The sample size is set according to the smallest silhouette length over all our training data. Our feature vector per silhouette then consists of 3408 real valued numbers.

The normal depth is computed as follows. For any point from the sampled set, we send several rays starting from the point itself and oriented along the opposite direction of its normal, until they intersect the inner silhouette boundary. The lengths of the ray segments are defined as the normal depths as illustrated in Fig. 3 (left). The normals are represented in green and the ray segments in red for two different points in the silhouette. We allow an angle deviation of 50\(^\circ \) from the silhouette normal axis. The feature for a point is defined as the weighted average of all normal depths falling within one std. dev. from the median of all the depths, with weights defined as the inverse of the angle between the rays and the normal axis.

Fig. 3.
figure 3

(left) Normal depth computation in 2 different points. The arrows are the silhouette normals. The normal depth is computed as the weighted mean of the lengths of the red lines. (middle) 3D measurements on the meshes used for validation. (right) Noisy silhouette. (Colour figure online)

The normal depth is a feature inspired by 3D geodesic shape descriptors [44, 47], and different from the Inner-Distance 2D descriptor [32] used for classification of different object types while being noise sensitive, and the spectral features utilized in [29] for a shape retrieval task. The main ideas behind our feature are (a) for the same individual in different poses, under mild self-occlusions, the features look very similar with small local shifts, (b) each point feature serves as a robust body measurement, correlated with the breadth of the person in various parts of the body, which is analogous to estimating body circumference at each vertex of the real body mesh, and (c) the feature is robust to silhouette noise due to the median and averaging steps. The measure might differ though in some parts of the silhouette (e.g. elbow) for the same person in different poses. In order to alleviate this limitation, we apply smoothing on small neighborhoods of the silhouette. The curvature on the other hand is estimated as the local variance of the normals. Despite being a local feature, it provides a measure of roundness, especially around the hips, waist, belly and chest, which helps in discriminating between various shapes.

We illustrate that the combination of normal depth that captures global information on the silhouette and curvature encoding local details leads to estimators robust to limited self-occlusions, and discriminative enough to describe the silhouette and reconstruct the corresponding shape in Sect. 4.

3.4 View Direction Classification

To increase robustness with respect to view changes, we decided to train view-specific Regression Forests for 36 viewing directions around the body. In order to discriminate between the views, we train a Random Forest Classifier utilizing the 3408 features extracted (Sect. 3.3) from 100, 000 silhouettes of people in multiple poses, shapes and views, having as labels the views numbered 1 to 36. We achieve a high accuracy of 99 % if we train and test on neutral and selfie-like poses. The accuracy decreases to 85.7 % if more involved poses (e.g. walking, running etc.) are added. However, by investigating class prediction probabilities, we observed that false positives are assigned only to the views that are contiguous to the view with the correct label. As it will be shown in Sect. 4, Table 2, a 10\(^\circ \) view difference has a low reconstruction error when the features are projected into CCA bases.

3.5 Learning Shape Parameters

We pose shape parameter estimation as a regression task. Given the silhouette features, using supervised learning, we would like to estimate the shape parameters such that the reconstructed shape best explains the silhouette. To make the features more discriminative, we propose to correlate features extracted from silhouettes viewed from different directions. More specifically, we apply Canonical Correlation Analysis (CCA) [24] over features extracted from a pair of silhouettes from two camera views.

At training time, the views are selected such that they capture complementary information. While the first one is the desired view from which we want to estimate the shape (one of 36 views), the second one is chosen to be as orthogonal as possible to the first, e.g. (front and side view). Because the human body is symmetric, a complementary view to a desired one is always searched in the zero to 90\(^\circ \) angle range to that view. In practice, we round the complementary view to the closest extreme (i.e. front or side view) to ease the offline computations.

We first apply PCA to reduce the dimensionality of the extracted features from 3408 to 300 in each view. Then, we stack the PCA projected features for all mesh silhouettes from the first and second views into the columns of the matrices \(\mathbf {X}_1\) and \(\mathbf {X}_2\), respectively. Then, CCA attempts to find basis vector pairs \(\mathbf {b}_1\) and \(\mathbf {b}_2\), such that the correlations between the projections of the variables onto these vectors are mutually maximized by solving:

$$\begin{aligned} \mathop {\text {arg~max}}\limits _{\mathbf {b}_{1},\mathbf {b}_{2}\in {R}^{N}} corr(\mathbf {b}_{1}^{T}\mathbf {X}_{1},\mathbf {b}_{2}^{T}\mathbf {X}_{2} ), \end{aligned}$$
(3)

where \(N =300\). This results in a coordinate free mutual basis unaffected by rotation, translation or global scaling of the features. The features projected onto this basis thus capture mutual information coming from both views. The subsequent basis vector pairs are computed similarly, with the assumption that the new projected features are orthogonal to the existing projected ones. We use 200 basis pairs with CCA projections covering 99 % of the energy.

The final training is done on the 200 projected features extracted from one view, which is one of the 36 views we consider. These projected features are input to a Random Forest Regressor [8] of 4 trees and a maximum depth of 20. The labels for this regressor are the 20-dimensional shape parameter vectors \(\beta \). Each component of \(\beta \) is weighted with weights set to the eigenvalues of the covariance matrix defined in Sect. 3.2 in the computation of the shape deformation space, and normalized to 1, to emphasize the large scale changes in 3D body shapes. At test time, the raw features extracted from a single given silhouette are first classified into a view. These are then projected with the obtained PCA and CCA matrices for that view to obtain a 200 dimensional vector. The projected features are finally fed into the corresponding Random Forest Regressor, in order to obtain the desired shape parameters \(\beta \).

4 Validation and Results

Previous shape-from-silhouette methods lack extensive evaluation. Xi et al. [52] demonstrate results on two real images of people and 24 subjects in synthetic settings, Sigal et al. [46] validate on two measurements and two subjects in monocular settings, and Balan et al. [4] report silhouette errors for a few individuals in a sequence and height measurement for a single individual. To the best of our knowledge, only Boisvert et al. [6] perform a more extensive validation, for 220 synthetic humans consisting of scans from the CAESAR database [39], and four real individuals’ front and side images. We present the largest validation experiment with 1500 synthetic body meshes as well as real individuals.

Data Generation. In order to learn a general model, we merge two large datasets [38, 54] consisting of 3D models extracted from the commercially available CAESAR datasetFootnote 1. We select 2900 meshes from the combined dataset for learning the shape model, leaving out around 1500 meshes for testing and experiments. In order to synthesize more training meshes, we sample from the 20 dimensional multivariate normal distribution spanned by the PCA space (Sect. 3.2), such that for a random sample \(\beta = [\beta _{1}, \beta _{2},..., \beta _{20}]\), it holds that \(\beta \sim \mathcal {N}(\mu ,\Sigma )\) with \(\mu \) being the 20-dimensional mean vector and \(\Sigma \) the \(20 \times 20\) covariance matrix of the parameters. To synthesize meshes in different poses, we gather a set of animations comprising of various poses (e.g. selfie, walking, running, etc.). After transferring a generated pose to the template mesh using LBS, we compute the resulting per-triangle deformations \(\mathbf {R}_i\). For a given mesh with parameters \(\beta \), the final pose is then given by \(\mathbf {e}'_{ij} = \mathbf {R}_i \mathbf {S}_i(\beta ) \mathbf {e}_{ij}\), where \(\mathbf {e}_{ij}\) are the edges of the template mesh (Sect. 3.2).

Fig. 4.
figure 4

Visual results for predictions on 4 test meshes. From left to right: predicted mesh, ground truth mesh, the two meshes frontally overlapping, the two meshes from the side view, silhouette from the predicted mesh, input silhouette.

Fig. 5.
figure 5

Visual results for predictions on 3 females. From left to right: the two input images in a rest and selfie pose, the estimated mesh - same estimation is obtained for both poses, the two silhouettes from which features are extracted for each pose, the silhouette of the estimated mesh.

As the training set, we randomly generate 100000 samples from the multi-variate distribution over the \(\beta \) parameters, and restrict them to fall into the \(\pm 3 \times Std. Dev\) range for each dimension of the PCA projected parameters to avoid getting unrealistic human shapes. We project the generated meshes in each of the 36 camera viewpoints around the mesh (Sect. 3.4). The silhouette is computed by projecting all the mesh edges for which two coinciding triangles have normals pointing in opposite directions. The silhouettes are then uniformly scaled such that the height of the bounding box is equal to 528 pixels, and the width to 384 pixels. For testing, we evaluate our method with the meshes left out from the training dataset, as well as on real images.

Table 1. Comparisons to state-of-the-art methods, variations of our method (RF, CCA-RF-1, CCA-RF-2) and ground truth, via various measurements. The measurements are illustrated in Fig. 3 (middle). Errors are represented as Mean ± Std. Dev and are expressed in millimeters. Note that we operate under a significantly more general setting than the state-of-the-art methods, please refer to the text.

Quantitative Experiments. We distinguish two test datasets, D1 and D2. D1 consists of 1500 meshes neither used to learn the parametric shape model nor to train the regression forests (RF) and D2 of 1000 meshes used to learn the parametric model but not to train the RF. These meshes consist of \(50\,\%\) males and \(50\,\%\) females, and are in roughly the same rest pose. In order to properly quantify our method, similar to Boisvert et al. [6], we perform 16 three-dimensional measurements on the meshes, which are commonly used in garment fitting as illustrated in Fig. 3 (middle). For the measurements represented with straight lines, we compute the Euclidean distance between the two extreme vertices. The ellipses represent circumferences and are measured on the body surface. For each of the 16 measurements, we compute the difference between the one from the ground truth mesh and the estimated mesh. We report the mean error and the standard deviation for each of the measurements in Table 1. We name our main method CCA-RF, with CCA applied to the features before passing them to the random forest, specifically CCA-RF-1 and CCA-RF-2 respectively tested on D1 and D2. Similarly, RF, for the method trained on raw features and tested on D1. The last table column provides the ground truth (GT) mean errors for D1, computed between the original test meshes and their reconstructions obtained by projecting them into the learned PCA space. This provides a lower limit for the obtainable errors with our 20 parameters shape model.

Before analyzing the results, it is crucial to highlight the differences between the settings and goals of the methods we compare to. Boisvert et al. [6] employ a setting where the pose is fixed to a rest pose and the distance from the camera is also fixed. The shape estimation method is based on utilizing silhouettes from two different views (front and side), with the application of garment fitting in mind. The same setting is considered for the other two methods mentioned above  [13, 52]. In contrast, we train and test for a more general setting, where we have a single silhouette as the input at test time, the pose can change, and no assumptions on the distance from the camera are made. Furthermore, our tests involve a significantly larger dataset with high variations.

Even though our method operates under a significantly more general setting than the previous works  [6, 13, 52], with a single silhouette input and no distance information, it outperforms the non-linear and linear mapping methods. The mean absolute error for all the models is 19.4 mm for CCA-RF-1 and 16.18 mm for CCA-RF-2. The errors are very close to those of GT, illustrating the accuracy of our technique. Note that some errors for CCA-RF-2 are smaller than those of the GT, due to the different training as explained above. The higher error for D1 is due to the body shapes that cannot be represented with the parametric model learned from the rest of the shapes. The error is higher for the overall height, due to the fixed scale in the training and testing silhouettes that we use. It is important to note the differences in errors between the RF and CCA-RF-1. There is an overall decrease of error when CCA is utilized, which shows that the projection with the CCA bases significantly improves prediction. Additionally, we evaluate the performance of our method when the input comes from a less favorable view, the side view, achieving an error of 22.45 mm which is very close to the one from only the frontal view. For completeness, we compare also to Helten et al. [23], who utilize an RGB-D camera for capturing the body shapes, and a full RMSE map per vertex to measure the differences. Using two depth maps, fitting to the pose and testing only on 6 individuals they report a mean error of 10.1 mm while we have a mean error of 19.19 mm on 1500 meshes.

Poses, Views and Noise. We investigated accuracy in the presence of silhouette noise, various poses, and different or multiple views. We run the experiments with the data setup D1, explained above. For each experiment, we show the mean and standard deviation either of the accuracy gain or of the errors over all the body measurements in Table 2.

The first three columns show the accuracy gain of applying CCA-RF to the front view (F), side view (S) or when concatenating both views together (FS), as compared to RF. A larger gain is obtained in the side view as compared to the front view, due to additional information that is injected from the frontal view (the most representative one) in the projected space. An even bigger gain is obtained if both views are utilized for training and testing. This is very important, as it shows that having potentially more views improves the predictor. In fact, we have observed that utilizing the same amount (100000) of training data, and training and testing on two views with the raw features, degrades the result as compared to just one view. This is alleviated with the CCA projection, improving the results as singular view noise in the data is removed.

The fourth column (VE), displays the errors obtained by testing on features extracted from a view \(10^\circ \) rotated from the frontal view, for a CCA-RF trained on the frontal view. The column for (VG) displays the gain of CCA-RF over RF for the same scenario. The CCA-RF is again more accurate, however the error for both is generally low, implying that a classification error of the camera view of 10\(^\circ \) can be allowed in our system. (N) demonstrates the error due to random noise added to the silhouettes, as in Fig. 3 (right), showing robustness to noise to a certain extent. (P12) shows the error induced by training only on a rest pose, and testing on 12 different poses as in Fig. 2, as compared to testing on the same meshes in a rest pose, and (P1) describes the same measurement, however by training on 12 poses and testing on a different unseen one, demonstrating robustness to pose changes under minimal self occlusions. The last three columns demonstrate similar measurements, however, by increasing the articulations in the poses, with (W) consisting of poses from a walking sequence, (R) from a running sequence (supplementary [1]), and (PWR) combining all poses we have. The error increases in the latter case especially due to the introduction of poses with more self occlusions. However, when trained on individual sequences, the errors are lower, implying that for an application where a certain activity is known, one could adapt specialized regressors, especially due to the very fast training in the low dimensional spaces.

Table 2. Columns 1–3 show accuracy gain of applying CCA for the Frontal, Side and Frontal Side view altogether, over raw features. (VE) shows the error due to 10\(^\circ \) view change and (VG), the gain of applying CCA. (N) is the error due to silhouette noise. (P12) shows the error of testing on 12 poses different from the training one, and the rest (Columns 8–11) demonstrate the errors while gradually adding more difficult poses from the training ones. Mean and Std. Deviation is computed over all the body measurements.

Algorithm Speed. The method is significantly faster than previous works, allowing for interactive applications. The method of Boisvert et al. [6] needs 6 s for body shape regression, 30 s for the MAP estimation, and 3 min for the silhouette based similarity optimization, with 6 s for their implementation of sGPLVM [13] (on an Intel Core i7 CPU 3 GHz and single-threaded implementation). We, on the other hand, reach 0.3 s using a single threaded implementation on an Intel Core i7 CPU 3.4 GHz (0.045 s for feature computation, 0.25 s for mesh computation, and 0.005 s for random forest regression), with even more speed-up opportunities as the feature computation and mesh vertices computation can be highly parallelized.

Qualitative Results. In Fig. 4, we show example samples from our tests. In each row, first the predicted mesh is shown along with the ground truth test mesh. Then, their overlap is illustrated. This is followed by the side views, and the silhouette of the estimated mesh and the input silhouette. Note that the input silhouettes are in different poses, but we show the estimated meshes in rest poses for easy comparisons. Our results are visually very close to the ground truth shapes even under such pose changes.

Finally, we show an experiment where real pictures of three females are taken in a rest and a selfie-like pose along with the estimated meshes in Fig. 5. It is important to note that despite the pose change, the retrieved mesh for each person is the same. Another important observation is that even though the input is scaled to the same size, the estimated parameters yield statistically plausible heights, which turned out to be sufficient in obtaining an ordering based on relative height between the estimated meshes. We believe that this is due to the statistical shape model, where semantic parameters like height and weight are correlated in the PCA parameter space. To the best of our knowledge, no previous work can resolve this task. For example, in the work by Sigal et al. [46], the mesh needs to be scaled if no camera calibration is provided.

5 Discussion and Conclusions

In this paper, we presented a novel technique that estimates 3D human body shape from a single silhouette. It allows different views, poses with mild occlusions, and various body shapes to be estimated. We extensively evaluated our technique on thousands of human bodies, by utilizing one of the biggest databases available to the community.

In the scope of this paper, we focused on shape extraction from a single silhouette because of its various applications such as selfies or utilizing limited video footage. However, this is an inherently ill-posed problem. Further views can be incorporated to obtain more accurate reconstructions, similar to methods we compare to. This would lead to a better estimation especially in the areas around the belly and chest, hence decrease the elliptical body measurement errors.

The accuracy of our method is tied to silhouette extraction. For the difficult cases of dynamic backgrounds or very loose clothes, the large scale silhouette deformations would skew our results. This could be tackled by fusing results over multiple frames. Unlike  [13] though, our results always remain in the space of plausible human bodies. For small scale deformations (Fig. 3 right), we show in Table 2 (N) that our results stay robust.

We assume that the silhouettes come in poses with limited partial occlusion. Under this assumption, we showed robustness, the same mesh estimation is achieved from different poses (e.g. Fig. 5). However, under more pronounced occlusions, our results start degrading (Table 2 (PWR)), which could be alleviated by increasing the number of training poses and utilizing deeper learning.

Although we aimed at precise measurements for the evaluation, errors due to discretization are inevitable, hence a standardized procedure on a standard mesh dataset is needed as a benchmark. We believe that this work along with that of Boisvert et al. [6] has set an important step towards this direction.

Since our system is designed for a general setting, we apply a fixed scale to the silhouette, losing height information. We showed a fairly good performance on estimating the relative height and demonstrate better absolute height estimation, if camera calibration is incorporated (supplementary [1]).

Our fast system, running in minutes for training and milliseconds for execution in single core CPU’s, while being memory lightweight due to the low feature dimensionality, could be integrated into smart phones, allowing body shapes to be reconstructed with one click of a button. Simultaneously, it can be used for 3D sport analysis, where estimation of a 3D shape of a player seen from a sparse set of cameras can improve projections of novel-views.

Finally, we showed how CCA, which captures relations in an unsupervised linear way, can be used to correlate different views in the data to improve the prediction power and speed of the algorithm. We believe that capturing non-linear relations with Kernel CCA’s or deep architectures should lead to even better results. Our method illustrates the utility of CCA for other vision applications where two or more views describing the same object or event exists, such as multi-view pose estimation, video-to-text matching, or shape from various sources of information.