Recursive non-rigid structure from motion with online learned shape prior

https://doi.org/10.1016/j.cviu.2013.03.005Get rights and content

Highlights

  • A novel approach is proposed for recursive non-rigid structure from motion.

  • An adaptive algorithm is employed for online update of shape prior.

  • Data storage requirements are reduced by removing original data once the shape model is recursively updated.

  • The algorithm is able to handle the measurements with missing data.

  • The method is not overly sensitive with respect to missing data, missing frames and measurement noise.

Abstract

Most existing approaches in structure from motion for deformable objects focus on non-incremental solutions utilizing batch type algorithms. All data is collected before shape and motion reconstruction take place. This methodology is inherently unsuitable for applications that require real-time learning. Ideally the online system is capable of incrementally learning and building accurate shapes using current measurement data and past reconstructed shapes. Estimation of 3D structure and camera position is done online. To rely only on the measurements up until that moment is still a challenging problem.

In this paper, a novel approach is proposed for recursive recovery of non-rigid structures from image sequences captured by a single camera. The main novelty in the proposed method is an adaptive algorithm for construction of shape constraints imposing stability on the online reconstructed shapes. The proposed, adaptively learned constraints have two aspects: constraints imposed on the basis shapes, the basic “building blocks” from which shapes are reconstructed; as well as constraints imposed on the mixing coefficients in the form of their probability distribution. Constraints are updated when the current model no longer adequately represents new shapes. This is achieved by means of Incremental Principal Component Analysis (IPCA). The proposed technique is also capable to handle missing data. Results are presented for motion capture based data of articulated face and simple human full-body movement.

Introduction

The Structure from Motion (SfM) problem is to jointly reconstruct 3D deformable shapes and estimate the corresponding camera external parameters from a set of images, assuming camera internal parameters to be unknown. This technique has become an active area of research in computer vision, with applications in many different domains including augmented reality, autonomous navigation and medical imaging.

A direct solution for recovery of both the motion and structure of the object is the classical algorithm of point based SfM with factorization. Tomasi and Kanade [31] first proposed a factorization algorithm based on the Singular Value Decomposition (SVD), which was used for reconstruction of a rigid object under an orthographic camera model. In this, the algorithm factorizes the measurement matrix into shape and rotation matrices under a rank constraint. Since then techniques for rigid shape recovery via point based SfM have achieved maturity over the following decades [11], [19], [30]. Subsequent work has focused on the factorization approach applied to multiple rigid objects [15] and articulated rigid objects [33]. In contrast to rigid objects, in real environments many objects deform over time, e.g. human body due to movement [2], [14], face due to articulation [4], [35] and other objects of interest [10]. This makes the problem difficult to solve because the shape of the object is varying from frame to frame. Therefore research has expanded into the Non-Rigid Structure from Motion (NRSfM).

Most factorization based SfM techniques begin with the assumption of an affine camera model approximating the real projection with either weak perspective or paraperspective viewing conditions. NRSfM with perspective camera can be seen as an extension of the classical reconstruction under orthographic projection. Perspective reconstruction has been successfully applied where the object model can be assumed rigid: Sturm and Triggs [27] described a non-iterative factorization method for uncalibrated cameras. Han and Kanade [15] proposed an alternative method using bilinear projective factorization algorithm; this iteratively improves the depth information, eliminating need for calculation of fundamental matrices. An investigation of different camera models is presented in [16]. Using the full perspective camera model can indeed help obtain a correct 3D reconstruction of the object, but too many unknown variables leads to an under constrained problem. However in some cases perspective projection model is unnecessary if the size of the object is relatively small compared to the distance between camera centre and the object.

To extend rigid SfM into the case of recovering 3D non-rigid objects [1], the seminal work of Bregler et al. [4] first described a low rank shape model for varying shapes. They factorize the 2D data matrix into shape coefficients, a camera motion matrix and 3D basis shapes using SVD, a method similar to that proposed in [31]. Following this shape model, factorization for articulated NRSfM was proposed in [25], but small inaccuracies in the affine values obtained from the initial affine decomposition greatly affect the subsequent estimation process. Xiao et al. [35] proposed a closed-form solution and demonstrated an ambiguity in orthonormality constraints, that using only orthonormality constraints is insufficient to provide unique solutions to estimated structures. They employ the traditional orthonormality constraints, but also introduce additional constraints to further determine shape basis, however this method does not cope well with noisy data. To overcome this, iterative optimization methods based on bundle adjustment were introduced in [34] as a last step of reconstruction, in order to improve the quality of estimation. Recent approaches have focused on solving problems related to the inherently large number of degrees of freedom, which together with motion degeneracy (very limited camera motion during data acquisition) may eventually result in worthless reconstructions. As a time-varying object usually cannot arbitrarily deform, the object can be represented in either shape space [32] or trajectory space [14], [12] in order to reduce the dimensionality of the problem.

Prior knowledge, using learning techniques of shape and motion may eliminate the ambiguity and can be used to improve accuracy of shape and motion recovery. Additional constraints such as prior scene information [16], [36] can be learned in advance. Bartoli et al. [3] introduced a coarse-to-fine low rank shape model. They ordered the basis shapes, starting from a mean shape and iteratively added deformation modes. Recently Zhou et al. [38] proposed a method operating in the presence of nonlinear motion and non-Gaussian distribution using Markov chain Monte Carlo technique which is applied to minimize the residuals of the estimated shapes. An alternative approach of bundle adjustment demonstrated by Del Bue introduces object shape prior information [5]. This optimization approach may improve performance for both non-rigid and articulated SfM, obtaining reliable 3D reconstructions when an appropriate initial value is provided. This method especially overcomes the problems caused by strong degeneracy within the input image sequence. But in practice, when only constrained by minimization of the 2D re-projection error and a single basis shape, the optimization often converges to a local minimum, due to large number of variables requiring a high quality initial guess.

Although tremendous progress has been made on SfM for both rigid and deformable shapes, the main limitation of most extant works is that they only refer to off-line (batch method) computations. The downside of batch methods is that the reconstruction can only start once all measurement data has been collected. To extend batch mode to the case of online (recursive) operation, Morita and Kanade [21] first presented a sequential factorization method, by considering the feature positions as a vector time series and updating only the first three eigenvectors instead of computation of singular value decomposition. Subsequent research for sequential shape and motion recovery has been developed by Mouragnon et al. [22] who demonstrated a generic and incremental method by minimizing an angular error between rays. Similarly, for the work in [9] the authors added a smoothing penalty on the camera trajectory, updating the structure accordingly as new views are added. Solutions to execute SfM in real-time can be classified as filter-based framework [26], [7] or keyframe-based [18] optimization and have proven to be successful. These methods give motivation for real-time implementations, which nevertheless, have so far only dealt with rigid objects or static environment. As yet a limited number of works have been published covering online deformable structure recovery. Most recently Paladini et al. [24] have made progress in this, proposing a rank-growing system which updates the current shape model when the 2D re-projection error exceeds an expected value. This technique makes online NRSfM more tractable, but whilst the higher number of degrees of freedom may lead to smaller re-projection error, this can result in unrealistic reconstructed shapes, unrepresentative of the true object. The method does not address the self-occlusion problem either, where measurements are assumed to be complete, which is rarely valid.

Section snippets

Paper contributions

In this paper, an incremental approach is proposed to recursively reconstruct 3D deformable structures, such as articulated face, from 2D video sequences taken by an orthographic camera. The first contribution is to present an alternative multi-part cost function by superposition of specific constraints, for both basis shapes and its coefficients within the state-of-the-art batch-processing scheme. The advantage of this approach is that the proposed prior knowledge reduces likelihood of a

Problem statement

Given a point in the world coordinate system, denoted as sn = [xn, yn, zn]T and transformed into mth image coordinate system through rotation Rm and translation tm, its orthographic projection xmn onto mth image, is given by:xmn=umnvmn=Rm|tm·sn1=rm1rm2rm3txmrm4rm5rm6tym·xnynzn1Twhere xmn represents the nth 3D point sn projected onto mth image; the orthographic camera matrix Rm only encodes the first two rows of rotation matrix with rotation constraint RmRmT=I. It is shown in [31] that when xmn are

Prior information learning

As mentioned in previous section, the results obtained without using any prior information about shape and/or trajectory are sensitive to the level of noise present in the data and the algorithm initialisation. The greater number of degrees of freedom may lead to smaller re-projection error, but result in unrealistic reconstructed shapes. Appropriate prior shape information can help to augment the accuracy of motion and shape recovery. The key idea in our method is to use a learned shape space

Recursive NRSfM

For the recursive NRSfM, the shape is divided into off-line and online components: St=Stoff+Ston. The off-line part is mainly used to indicate the static overall shape and the online part is responsible for representing the dynamic shape changes. The method described in the preceding section was used to estimate the off-line shape Stoff with the prior information about shapes and weights probability distribution learned in advance using standard PCA technique on a training database of

NRSfM with missing data

The two algorithms proposed above assume that the measurement matrix W is complete, with all the feature points detected in all the images. This is unlikely to happen in practice as some of the feature points will not be detected in all the images. This could be because of the feature point detection problems or because some parts of the 3D object may not be visible from all the camera positions. This means some of the entries in the measurement matrix W may be unknown. This section describes a

Results and discussion

The experiments to evaluate the proposed methodology were based on batch formulation and sequential recovery of an articulated face and human motion. In the case of reconstruction of objects undergoing only small deformations, the estimated shape can be accurately represented using a model with a relatively small number of degrees of freedom, thereby allowing for linear deformations. To demonstrate the performance of the proposed methods, extensive experimental evaluation has been provided. We

Conclusions and future work

We have presented a new approach to solve the recursive NRSfM problem and have demonstrated the accuracy and robustness of our method on a series of challenging situations. Our method successfully recovers shape and camera motion parameters as new frames arrive; additionally it allows for updates to the model, thus accounting for new shape variations as objects deform over the sequence. We have also developed several extensions to the algorithm for deformable object recovery, which include use

Acknowledgment

The “drink” motion capture data used in this project was obtained from mocap.cs.cmu.edu.

References (38)

  • E. Eade et al.

    Monocular slam as a graph of coalesced observations

    Int. Conf. Comput. Vis.

    (2007)
  • A. Eriksson et al.

    Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l1 norm

    Comput. Vis. Pattern Recognit.

    (2010)
  • M. Farenzena et al.

    Efficient camera smoothing sequential structure-from-motion using approximate cross-validation

    Eur. Conf. Comput. Vis.

    (2008)
  • J. Fayad et al.

    Non-rigid structure from motion using quadratic deformation models

    Br. Mach. Vis. Conf.

    (2009)
  • J. Fortuna et al.

    Rigid structure from motion from a blind source separation perspective

    Int. J. Comput. Vis.

    (2010)
  • P. Gotardo et al.

    Computing smooth time-trjectories for camera and deformable shape in structure from motion with occlusion

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • P. Gotardo et al.

    Kernel non-rigid structure from motion

    Int. Conf. Comput. Vis.

    (2011)
  • P. Gotardo et al.

    Non-rigid structure from motion with complementary rank-3 spaces

    Comput. Vis. Pattern Recognit.

    (2011)
  • M. Han et al.

    Multiple motion scene reconstruction from uncalibrated views

    Int. Conf. Comput. Vis.

    (2001)
  • Cited by (17)

    • 3D facial shape reconstruction using macro- and micro-level features from high resolution facial images

      2017, Image and Vision Computing
      Citation Excerpt :

      Software-based approaches reconstruct 3D faces using image sequences only. These approaches can generally be categorized into 3D Morphable Model (3DMM)-based methods [9–22], Structure from Motion (SfM)-based methods [23–33], and Shape from Shading (SFS)-based methods [34–40]. However, these methods cannot realistically reconstruct 3D faces because they use an insufficient number (approximately 80) of corresponding macro-level Facial Feature Points (FFPs).

    • A BRMF-based model for missing-data estimation of image sequence

      2017, Neurocomputing
      Citation Excerpt :

      A new affine factorization algorithm is proposed in [9] by employing a robust factorization scheme to handle outlier and missing data. In [10], an adaptive online learning algorithm for non-rigid SFM is proposed, and it is also capable to handle missing data. A method that exploits temporal stability and low-rank property of motion data is proposed in [11], and it has been proved effective to deal with missing data and noise.

    • Robust methods for dense monocular non-rigid 3D reconstruction and alignment of point clouds

      2020, Robust Methods for Dense Monocular Non-Rigid 3D Reconstruction and Alignment of Point Clouds
    • Verifying global minima for rotation problems

      2019, ACM International Conference Proceeding Series
    View all citing articles on Scopus
    View full text