Recursive non-rigid structure from motion with online learned shape prior

doi:10.1016/j.cviu.2013.03.005

Computer Vision and Image Understanding

Volume 117, Issue 10, October 2013, Pages 1287-1298

https://doi.org/10.1016/j.cviu.2013.03.005 Get rights and content

Highlights

•
A novel approach is proposed for recursive non-rigid structure from motion.
•
An adaptive algorithm is employed for online update of shape prior.
•
Data storage requirements are reduced by removing original data once the shape model is recursively updated.
•
The algorithm is able to handle the measurements with missing data.
•
The method is not overly sensitive with respect to missing data, missing frames and measurement noise.

Abstract

Most existing approaches in structure from motion for deformable objects focus on non-incremental solutions utilizing batch type algorithms. All data is collected before shape and motion reconstruction take place. This methodology is inherently unsuitable for applications that require real-time learning. Ideally the online system is capable of incrementally learning and building accurate shapes using current measurement data and past reconstructed shapes. Estimation of 3D structure and camera position is done online. To rely only on the measurements up until that moment is still a challenging problem.

In this paper, a novel approach is proposed for recursive recovery of non-rigid structures from image sequences captured by a single camera. The main novelty in the proposed method is an adaptive algorithm for construction of shape constraints imposing stability on the online reconstructed shapes. The proposed, adaptively learned constraints have two aspects: constraints imposed on the basis shapes, the basic “building blocks” from which shapes are reconstructed; as well as constraints imposed on the mixing coefficients in the form of their probability distribution. Constraints are updated when the current model no longer adequately represents new shapes. This is achieved by means of Incremental Principal Component Analysis (IPCA). The proposed technique is also capable to handle missing data. Results are presented for motion capture based data of articulated face and simple human full-body movement.

Introduction

The Structure from Motion (SfM) problem is to jointly reconstruct 3D deformable shapes and estimate the corresponding camera external parameters from a set of images, assuming camera internal parameters to be unknown. This technique has become an active area of research in computer vision, with applications in many different domains including augmented reality, autonomous navigation and medical imaging.

A direct solution for recovery of both the motion and structure of the object is the classical algorithm of point based SfM with factorization. Tomasi and Kanade [31] first proposed a factorization algorithm based on the Singular Value Decomposition (SVD), which was used for reconstruction of a rigid object under an orthographic camera model. In this, the algorithm factorizes the measurement matrix into shape and rotation matrices under a rank constraint. Since then techniques for rigid shape recovery via point based SfM have achieved maturity over the following decades [11], [19], [30]. Subsequent work has focused on the factorization approach applied to multiple rigid objects [15] and articulated rigid objects [33]. In contrast to rigid objects, in real environments many objects deform over time, e.g. human body due to movement [2], [14], face due to articulation [4], [35] and other objects of interest [10]. This makes the problem difficult to solve because the shape of the object is varying from frame to frame. Therefore research has expanded into the Non-Rigid Structure from Motion (NRSfM).

Most factorization based SfM techniques begin with the assumption of an affine camera model approximating the real projection with either weak perspective or paraperspective viewing conditions. NRSfM with perspective camera can be seen as an extension of the classical reconstruction under orthographic projection. Perspective reconstruction has been successfully applied where the object model can be assumed rigid: Sturm and Triggs [27] described a non-iterative factorization method for uncalibrated cameras. Han and Kanade [15] proposed an alternative method using bilinear projective factorization algorithm; this iteratively improves the depth information, eliminating need for calculation of fundamental matrices. An investigation of different camera models is presented in [16]. Using the full perspective camera model can indeed help obtain a correct 3D reconstruction of the object, but too many unknown variables leads to an under constrained problem. However in some cases perspective projection model is unnecessary if the size of the object is relatively small compared to the distance between camera centre and the object.

To extend rigid SfM into the case of recovering 3D non-rigid objects [1], the seminal work of Bregler et al. [4] first described a low rank shape model for varying shapes. They factorize the 2D data matrix into shape coefficients, a camera motion matrix and 3D basis shapes using SVD, a method similar to that proposed in [31]. Following this shape model, factorization for articulated NRSfM was proposed in [25], but small inaccuracies in the affine values obtained from the initial affine decomposition greatly affect the subsequent estimation process. Xiao et al. [35] proposed a closed-form solution and demonstrated an ambiguity in orthonormality constraints, that using only orthonormality constraints is insufficient to provide unique solutions to estimated structures. They employ the traditional orthonormality constraints, but also introduce additional constraints to further determine shape basis, however this method does not cope well with noisy data. To overcome this, iterative optimization methods based on bundle adjustment were introduced in [34] as a last step of reconstruction, in order to improve the quality of estimation. Recent approaches have focused on solving problems related to the inherently large number of degrees of freedom, which together with motion degeneracy (very limited camera motion during data acquisition) may eventually result in worthless reconstructions. As a time-varying object usually cannot arbitrarily deform, the object can be represented in either shape space [32] or trajectory space [14], [12] in order to reduce the dimensionality of the problem.

Prior knowledge, using learning techniques of shape and motion may eliminate the ambiguity and can be used to improve accuracy of shape and motion recovery. Additional constraints such as prior scene information [16], [36] can be learned in advance. Bartoli et al. [3] introduced a coarse-to-fine low rank shape model. They ordered the basis shapes, starting from a mean shape and iteratively added deformation modes. Recently Zhou et al. [38] proposed a method operating in the presence of nonlinear motion and non-Gaussian distribution using Markov chain Monte Carlo technique which is applied to minimize the residuals of the estimated shapes. An alternative approach of bundle adjustment demonstrated by Del Bue introduces object shape prior information [5]. This optimization approach may improve performance for both non-rigid and articulated SfM, obtaining reliable 3D reconstructions when an appropriate initial value is provided. This method especially overcomes the problems caused by strong degeneracy within the input image sequence. But in practice, when only constrained by minimization of the 2D re-projection error and a single basis shape, the optimization often converges to a local minimum, due to large number of variables requiring a high quality initial guess.

Although tremendous progress has been made on SfM for both rigid and deformable shapes, the main limitation of most extant works is that they only refer to off-line (batch method) computations. The downside of batch methods is that the reconstruction can only start once all measurement data has been collected. To extend batch mode to the case of online (recursive) operation, Morita and Kanade [21] first presented a sequential factorization method, by considering the feature positions as a vector time series and updating only the first three eigenvectors instead of computation of singular value decomposition. Subsequent research for sequential shape and motion recovery has been developed by Mouragnon et al. [22] who demonstrated a generic and incremental method by minimizing an angular error between rays. Similarly, for the work in [9] the authors added a smoothing penalty on the camera trajectory, updating the structure accordingly as new views are added. Solutions to execute SfM in real-time can be classified as filter-based framework [26], [7] or keyframe-based [18] optimization and have proven to be successful. These methods give motivation for real-time implementations, which nevertheless, have so far only dealt with rigid objects or static environment. As yet a limited number of works have been published covering online deformable structure recovery. Most recently Paladini et al. [24] have made progress in this, proposing a rank-growing system which updates the current shape model when the 2D re-projection error exceeds an expected value. This technique makes online NRSfM more tractable, but whilst the higher number of degrees of freedom may lead to smaller re-projection error, this can result in unrealistic reconstructed shapes, unrepresentative of the true object. The method does not address the self-occlusion problem either, where measurements are assumed to be complete, which is rarely valid.

Section snippets

Paper contributions

In this paper, an incremental approach is proposed to recursively reconstruct 3D deformable structures, such as articulated face, from 2D video sequences taken by an orthographic camera. The first contribution is to present an alternative multi-part cost function by superposition of specific constraints, for both basis shapes and its coefficients within the state-of-the-art batch-processing scheme. The advantage of this approach is that the proposed prior knowledge reduces likelihood of a

Problem statement

Given a point in the world coordinate system, denoted as s_n = [x_n, y_n, z_n]^T and transformed into mth image coordinate system through rotation $R_{m}$ and translation t_m, its orthographic projection x_mn onto mth image, is given by: $x_{mn} = [\begin{matrix} u_{mn} \\ v_{mn} \end{matrix}] = [\begin{matrix} R_{m} | t_{m} \end{matrix}] \cdot [\begin{matrix} s_{n} \\ 1 \end{matrix}] = [\begin{matrix} r_{m 1} & r_{m 2} & r_{m 3} & t_{xm} \\ r_{m 4} & r_{m 5} & r_{m 6} & t_{ym} \end{matrix}] \cdot {[\begin{matrix} x_{n} & y_{n} & z_{n} & 1 \end{matrix}]}^{T}$ where x_mn represents the nth 3D point s_n projected onto mth image; the orthographic camera matrix $R_{m}$ only encodes the first two rows of rotation matrix with rotation constraint $R_{m} R_{m}^{T} = I$ . It is shown in [31] that when x_mn are

Prior information learning

As mentioned in previous section, the results obtained without using any prior information about shape and/or trajectory are sensitive to the level of noise present in the data and the algorithm initialisation. The greater number of degrees of freedom may lead to smaller re-projection error, but result in unrealistic reconstructed shapes. Appropriate prior shape information can help to augment the accuracy of motion and shape recovery. The key idea in our method is to use a learned shape space

Recursive NRSfM

For the recursive NRSfM, the shape is divided into off-line and online components: $S_{t} = S_{t}^{off} + S_{t}^{on}$ . The off-line part is mainly used to indicate the static overall shape and the online part is responsible for representing the dynamic shape changes. The method described in the preceding section was used to estimate the off-line shape S_t^off with the prior information about shapes and weights probability distribution learned in advance using standard PCA technique on a training database of

NRSfM with missing data

The two algorithms proposed above assume that the measurement matrix W is complete, with all the feature points detected in all the images. This is unlikely to happen in practice as some of the feature points will not be detected in all the images. This could be because of the feature point detection problems or because some parts of the 3D object may not be visible from all the camera positions. This means some of the entries in the measurement matrix W may be unknown. This section describes a

Results and discussion

The experiments to evaluate the proposed methodology were based on batch formulation and sequential recovery of an articulated face and human motion. In the case of reconstruction of objects undergoing only small deformations, the estimated shape can be accurately represented using a model with a relatively small number of degrees of freedom, thereby allowing for linear deformations. To demonstrate the performance of the proposed methods, extensive experimental evaluation has been provided. We

Conclusions and future work

We have presented a new approach to solve the recursive NRSfM problem and have demonstrated the accuracy and robustness of our method on a series of challenging situations. Our method successfully recovers shape and camera motion parameters as new frames arrive; additionally it allows for updates to the model, thus accounting for new shape variations as objects deform over the sequence. We have also developed several extensions to the algorithm for deformable object recovery, which include use

Acknowledgment

The “drink” motion capture data used in this project was obtained from mocap.cs.cmu.edu.

References (38)

J. Aggarwal et al.
Nonrigid motion analysis: articulated and elastic motion
Comput. Vis. Image Und.
(1998)
A. Del Bue et al.
Non-rigid structure from motion using ranklet-based tracking and non-linear optimization
Image Vis. Comput.
(2007)
M. Marques et al.
Estimating 3d shape from degenerate sequences with missing data
Comput. Vis. Image Und.
(2009)
B. Matuszewski et al.
Hi4d–adsip 3d dynamic facial articulation database
Image Vis. Comput.
(2012)
E. Mouragnon et al.
Generic and real-time structure from motion using local bundle adjustment
Image Vis. Comput.
(2009)
G. Wang et al.
Rotation constrained power factorization for structure from motion of nonrigid objects
Pattern Recognit. Lett.
(2008)
J. Aggarwal et al.
Human motion: modeling and recognition of actions and interactions
Int. Symp. 3D Data Process.
(2004)
A. Bartoli et al.
Coarse-to-fine low rank structure from motion
Comput. Vis. Pattern Recognit.
(2008)
C. Bregler et al.
Recovering non-rigid 3d shape from image streams
Comput. Vis. Pattern Recognit.
(2000)
A. Del Bue
A factorization approach to structure from motion with shape priors
Comput. Vis. Pattern Recognit.
(2008)

E. Eade et al.

Monocular slam as a graph of coalesced observations

Int. Conf. Comput. Vis.

(2007)

A. Eriksson et al.

Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l1 norm

Comput. Vis. Pattern Recognit.

(2010)

M. Farenzena et al.

Efficient camera smoothing sequential structure-from-motion using approximate cross-validation

Eur. Conf. Comput. Vis.

(2008)

J. Fayad et al.

Non-rigid structure from motion using quadratic deformation models

Br. Mach. Vis. Conf.

(2009)

J. Fortuna et al.

Rigid structure from motion from a blind source separation perspective

Int. J. Comput. Vis.

(2010)

P. Gotardo et al.

Computing smooth time-trjectories for camera and deformable shape in structure from motion with occlusion

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

P. Gotardo et al.

Kernel non-rigid structure from motion

Int. Conf. Comput. Vis.

(2011)

P. Gotardo et al.

Non-rigid structure from motion with complementary rank-3 spaces

Comput. Vis. Pattern Recognit.

(2011)

M. Han et al.

Multiple motion scene reconstruction from uncalibrated views

Int. Conf. Comput. Vis.

(2001)

Cited by (17)

3D facial shape reconstruction using macro- and micro-level features from high resolution facial images
2017, Image and Vision Computing
Citation Excerpt :
Software-based approaches reconstruct 3D faces using image sequences only. These approaches can generally be categorized into 3D Morphable Model (3DMM)-based methods [9–22], Structure from Motion (SfM)-based methods [23–33], and Shape from Shading (SFS)-based methods [34–40]. However, these methods cannot realistically reconstruct 3D faces because they use an insufficient number (approximately 80) of corresponding macro-level Facial Feature Points (FFPs).
Three-dimensional (3D) facial modeling and stereo matching-based methods are widely used for 3D facial reconstruction from 2D single-view and multiple-view images. However, these methods cannot realistically reconstruct 3D faces because they use insufficient numbers of macro-level Facial Feature Points (FFPs). This paper proposes an accurate and person-specific 3D facial reconstruction method that uses ample numbers of macro- and micro-level FFPs to enable coverage of all facial regions of high resolution facial images. Comparisons of 3D facial images reconstructed using the proposed method for ground-truth 3D facial images from the Bosphorus 3D database show that the method is superior to a conventional Active Appearance Model-Structure from Motion (AAM + SfM)-based method in terms of average 3D root mean square error between the reconstructed and ground-truth 3D faces. Further, the proposed method achieved outstanding accuracy in local facial regions such as the cheek—areas where extraction of FFPs is difficult for existing methods.
A BRMF-based model for missing-data estimation of image sequence
2017, Neurocomputing
Citation Excerpt :
A new affine factorization algorithm is proposed in [9] by employing a robust factorization scheme to handle outlier and missing data. In [10], an adaptive online learning algorithm for non-rigid SFM is proposed, and it is also capable to handle missing data. A method that exploits temporal stability and low-rank property of motion data is proposed in [11], and it has been proved effective to deal with missing data and noise.
How to effectively deal with occlusion is an important step of structure from motion. In this paper, an accurate missing data estimation method is proposed by combining Bayesian robust matrix factorization (BRMF) and particle swarm optimization (PSO). As BRMF is primely designed for outlier detection, in the proposed method, the missing entries of the observation matrix are firstly replaced by the values that are significantly larger than the non-missing entries. Then, the low-rank factorization matrices are computed via the BRMF to recover the observations. One issue of the BRMF model is that the fluctuation of output results caused by the variation of rank values. Analogously to the classifier-committee learning algorithm, a BRMF-based weaker estimator is constructed to alleviate the unfavorable condition. Moreover, a PSO-based weighting strategy is devised to integrate the outputs of weaker estimators. Experimental results on several widely used image sequences demonstrate the effectiveness and feasibility of the proposed algorithm.
Single-view-based 3D facial reconstruction method robust against pose variations
2015, Pattern Recognition
The 3D Morphable Model (3DMM) and the Structure from Motion (SfM) methods are widely used for 3D facial reconstruction from 2D single-view or multiple-view images. However, model-based methods suffer from disadvantages such as high computational costs and vulnerability to local minima and head pose variations. The SfM-based methods require multiple facial images in various poses. To overcome these disadvantages, we propose a single-view-based 3D facial reconstruction method that is person-specific and robust to pose variations. Our proposed method combines the simplified 3DMM and the SfM methods. First, 2D initial frontal Facial Feature Points (FFPs) are estimated from a preliminary 3D facial image that is reconstructed by the simplified 3DMM. Second, a bilateral symmetric facial image and its corresponding FFPs are obtained from the original side-view image and corresponding FFPs by using the mirroring technique. Finally, a more accurate the 3D facial shape is reconstructed by the SfM using the frontal, original, and bilateral symmetric FFPs. We evaluated the proposed method using facial images in 35 different poses. The reconstructed facial images and the ground-truth 3D facial shapes obtained from the scanner were compared. The proposed method proved more robust to pose variations than 3DMM. The average 3D Root Mean Square Error (RMSE) between the reconstructed and ground-truth 3D faces was less than 2.6 mm when 2D FFPs were manually annotated, and less than 3.5 mm when automatically annotated.
Robust methods for dense monocular non-rigid 3D reconstruction and alignment of point clouds
2020, Robust Methods for Dense Monocular Non-Rigid 3D Reconstruction and Alignment of Point Clouds
Verifying global minima for rotation problems
2019, ACM International Conference Proceeding Series
Shape Basis Interpretation for Monocular Deformable 3-D Reconstruction
2019, IEEE Transactions on Multimedia

View all citing articles on Scopus

View full text

Recursive non-rigid structure from motion with online learned shape prior

Highlights

Abstract

Introduction

Section snippets

Paper contributions

Problem statement

Prior information learning

Recursive NRSfM

NRSfM with missing data

Results and discussion

Conclusions and future work

Acknowledgment

Comput. Vis. Image Und.

Image Vis. Comput.

Comput. Vis. Image Und.

Image Vis. Comput.

Image Vis. Comput.

Pattern Recognit. Lett.

Human motion: modeling and recognition of actions and interactions

Int. Symp. 3D Data Process.

Coarse-to-fine low rank structure from motion

Comput. Vis. Pattern Recognit.

Recovering non-rigid 3d shape from image streams

Comput. Vis. Pattern Recognit.

A factorization approach to structure from motion with shape priors

Comput. Vis. Pattern Recognit.

Monocular slam as a graph of coalesced observations

Int. Conf. Comput. Vis.

Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l1 norm

Comput. Vis. Pattern Recognit.

Efficient camera smoothing sequential structure-from-motion using approximate cross-validation

Eur. Conf. Comput. Vis.

Non-rigid structure from motion using quadratic deformation models

Br. Mach. Vis. Conf.

Rigid structure from motion from a blind source separation perspective

Int. J. Comput. Vis.

Computing smooth time-trjectories for camera and deformable shape in structure from motion with occlusion

IEEE Trans. Pattern Anal. Mach. Intell.

Kernel non-rigid structure from motion

Int. Conf. Comput. Vis.

Non-rigid structure from motion with complementary rank-3 spaces

Comput. Vis. Pattern Recognit.

Multiple motion scene reconstruction from uncalibrated views

Int. Conf. Comput. Vis.