Facial model adaptation from a monocular image sequence using a textured polygonal model

https://doi.org/10.1016/S0923-5965(02)00008-5Get rights and content

Abstract

Although several algorithms have been proposed for facial model adaptation from image sequences, the insufficient feature set to adapt a full facial model, imperfect matching of feature points, and imprecise head motion estimation may degrade the accuracy of model adaptation. In this paper, we propose to resolve these difficulties by integrating facial model adaptation, texture mapping, and head pose estimation as cooperative and complementary processes. By using an analysis-by-synthesis approach, salient facial feature points and head profiles are reliably tracked and extracted to form a growing and more complete feature set for model adaptation. A more robust head motion estimation is achieved with the assistance of the textured facial model. The proposed scheme is performed with image sequences acquired with single uncalibrated camera and requires only little manual adjustment in the initialization setup, which proves to be a feasible approach for facial model adaptation.

Introduction

As the standardization of facial definition parameters (FDPs) and facial animation parameters (FAPs) has been approved by ISO MPEG-4 committee [7], [30], talking head related applications such as the virtual newscaster [4] and the virtual seller [11] are getting into market. Facial models in these applications can be animated with facial animation parameters, which may be generated from scripts, speech analysis or visual analysis in a model-based approach [2], [10]. Not only 3D visualization is provided; the low bandwidth requirement is also attractive for the construction of a multiuser virtual environment for video conferencing [9] and collaborative works [6], [24]. To provide a personalized talking head with animation capability, facial model adaptation is required to adapt a generic animated facial model to a user customized one.

In the literature, facial model adaptation algorithms can be categorized according to the acquisition of face images. The most accurate facial model is obtained from a 3D laser scanner with range sensors and image sensors. Both the outer structure and the appearance of the target object (e.g., the human head) can be derived in a precise way [1], [12], [23], [33], [41]. However, 3D scanners are very expensive and not affordable for ordinary users. An alternative method is using multiple calibrated cameras, with which the range data can be derived from disparity and intrinsic and extrinsic parameters of each camera. The accuracy of the method highly depends on the precision of the camera calibration, and usually more than two cameras are required for precise estimation of the depth information. For example, Pedersini et al. [34] use three cameras and Pighin et al. [35] use five cameras. When only one camera is available, it is also possible to perform facial model adaptation. Andrés del Valle and Ostermann [3] adapt facial model with one frontal face image only. Totally 18 feature points are manually extracted to deform the facial model by applying radial basis functions with compact support (RBFCS). Since no depth information is acquired from the frontal face image, the adapted model is only a coarse approximation and texture mapping can only use the frontal face image. Therefore, this approach can only be used in applications where the head motion is small from frontal face. Several other algorithms are proposed to use two or three images from orthogonal views to estimate the 3D locations for a predefined set of facial feature points [8], [22], [29]. The facial model is deformed with these feature points and the facial texture is generated by blending frontal and lateral views. Since the deformation of facial model takes depth into consideration, the adaptation is more accurate than using one image only. However, to achieve more accurate adaptation, more facial feature points are required and manual assistance of feature extraction is often indispensable.

Recently, increasing interests arise to utilize image sequences to achieve more flexible estimation [13], [15], [20], [27]. The utilized image sequence contains a moving head performing translation and rotation with neutral expression. Hence, the head movement can be assumed as rigid motion. Capturing moving head image sequence with a fixed camera resembles capturing a steady head with multiple uncalibrated cameras. Furthermore, if the head pose information in each frame is known, the problem can be treated as facial model adaptation with multiple calibrated cameras. However, the accurate pose information is only available with head-mounted sensors (e.g., magnetic trackers [39]). Therefore, facial model adaptation is often resolved by integrating motion and structure estimation in a structure-from-motion (SfM) framework [16]. In [15] the feature point locations and head motion are estimated with extended Kalman filtering. However, since the selected feature points are not adequate to determine the whole facial model structure, they impose structure constraints with an eigenspace filter constructed from a set of facial models previously scanned by a 3D laser scanner. Although this can effectively avoid abrupt deformations, it does not take individual difference into consideration. Still other researchers use epipolar constraints to solve the SfM problem, but this may suffer from the small baseline problem as addressed in [16]. Liu et al. [27] use point matching with a large set of corner points and perform an error elimination mechanism to avoid error matching that has deep impact on the estimation. Their algorithm highly depends on the accuracy of corner matching, and it fails in some cases where the skin is too smooth for extracting and matching of corner points. In [13], a robust facial model estimation algorithm based on regularized bundle-adjustment is proposed. Iterative reweighted least-squares method is used to reduce the influence of possible erroneous point matching, and the smoothness constraint is imposed as a regularization term to prevent excessive deformation.

To summarize, existing algorithms for facial model adaptation from image sequences suffer from problems of insufficient feature set or imperfect feature matching which may lead to imprecise head motion estimation. In this paper, we propose to resolve these difficulties by integrating facial model adaptation, texture mapping, and head pose estimation as cooperative and complementary processes. By using an analysis-by-synthesis approach, salient facial feature points and head profiles are reliably tracked and extracted to form a more complete feature set for model adaptation. A more robust head motion estimation is achieved with the assistance of the textured facial model. Fig. 1 illustrates the conceptual flow diagram of our proposed scheme. An initial facial model is estimated with a frontal face image in the initialization stage. In the tracking stage, pose estimation, feature point tracking, and head profile extraction are performed with the assistance of a synthetic and textured facial model. Similar to other approaches with rigid head motion assumption, the moving head in the utilized image sequence should be in neutral expression for reliable tracking. Afterwards, in the adaptation stage, the facial model structure is adapted according to the depth displacement of feature points and head profiles between the real face and the synthetic face. Hence, a complete textured facial model can be obtained by applying above procedures in successive frames.

The remainder of this paper is organized as follows. In Section 2, the initialization procedure is described. The tracking procedure including pose estimation, feature point tracking, and head profile extraction is detailed in Section 3. The adaptation algorithm is described in Section 4. Section 5 presents experiments tested on synthetic and real image sequences for performance verification. Finally, Section 6 concludes this paper.

Section snippets

Initialization

In the initialization stage, a coarse adaptation is performed by deforming the generic facial model according to the positions of several salient facial feature points on the frontal face. The model is then textured with the frontal-view image and acts as the starting point for finer adaptation with successive frames in the image sequence. In essence, any animated facial model composed of polygonal patches can be initialized in the proposed scheme, and here the facial model developed by the

Tracking

In the tracking phase, the head pose tracking is firstly performed with current textured facial model in an analysis-by-synthesis manner. Afterwards, feature point tracking and head profile extraction are carried out with the synthetic face image generated with the estimated head pose parameters. The reliable head pose information and enlarged set of control points from head profile contribute the most significant improvement for head model adaptation.

Adaptation

The adaptation stage is composed of model adaptation and texture update procedures. In the model adaptation, depth displacements of feature points and head profile points form a depth displacement field for adjusting all vertices of the facial model. Afterwards, texture from the visible region of the current face image is blended with the original texture to update the texture map.

Experimental results

To verify the performance of the proposed model-adaptation approach, we conducted experiments on synthetic sequences and real sequences. The improvement of feature point tracking assisted with the textured facial model is also explored.

Conclusions

In this paper, we have presented a feasible technique for adapting a generic facial model to a user-customized model. By integrating model adaptation, texture mapping and head pose estimation as cooperative and complementary processes, the facial model adaptation can be accomplished with an image sequence containing a shaking head acquired from a single uncalibrated camera. The customized facial models can be directly applied for entertainment and personal communications, such as avatars in

References (43)

  • T.K. Capin, E. Petajan, J. Ostermann, Efficient modeling of virtual humans in MPEG-4, in: Proceedings of the 2000 IEEE...
  • Y.J. Chang, C.C. Chen, J.C. Chou, Y.C. Chen, Virtual talk: A model-based virtual phone using a layered audio-visual...
  • Y.J. Chang, C.C. Chen, J.C. Chou, Y.C. Chen, Development of a multi-user virtual conference system using a layered...
  • C.S. Choi et al.

    Analysis and synthesis of facial image sequences in model-based image coding

    IEEE Trans. Circuits Systems Video Technol.

    (June 1994)
  • Cyberlink TalkingShow, web page:...
  • P. Eisert et al.

    Model-aided codinga new approach to incorporate facial animation into motion-compensated video coding

    IEEE Trans. Circuits Systems Video Technol.

    (April 2000)
  • P. Fua

    Regularized bundle-adjustment to model heads from image sequences without calibration data

    Internat. J. Comput. Vision

    (July 2000)
  • T. Horprasert, Y. Yacoob, L. Davis, Computing 3-D head orientation from monocular image sequence, in: Proceedings of...
  • T.S. Jebara, A. Pentland, Parameterized structure from motion for 3D adaptive feedback tracking of faces, in:...
  • T. Jebara et al.

    3D structure from 2D motion

    IEEE Signal Processing Mag.

    (May 1999)
  • M. Kampmann, Estimation of the chin and cheek contours for precise face model adaptation, in: Proceedings of the 1997...
  • Cited by (3)

    View full text