Facial model adaptation from a monocular image sequence using a textured polygonal model
Introduction
As the standardization of facial definition parameters (FDPs) and facial animation parameters (FAPs) has been approved by ISO MPEG-4 committee [7], [30], talking head related applications such as the virtual newscaster [4] and the virtual seller [11] are getting into market. Facial models in these applications can be animated with facial animation parameters, which may be generated from scripts, speech analysis or visual analysis in a model-based approach [2], [10]. Not only 3D visualization is provided; the low bandwidth requirement is also attractive for the construction of a multiuser virtual environment for video conferencing [9] and collaborative works [6], [24]. To provide a personalized talking head with animation capability, facial model adaptation is required to adapt a generic animated facial model to a user customized one.
In the literature, facial model adaptation algorithms can be categorized according to the acquisition of face images. The most accurate facial model is obtained from a 3D laser scanner with range sensors and image sensors. Both the outer structure and the appearance of the target object (e.g., the human head) can be derived in a precise way [1], [12], [23], [33], [41]. However, 3D scanners are very expensive and not affordable for ordinary users. An alternative method is using multiple calibrated cameras, with which the range data can be derived from disparity and intrinsic and extrinsic parameters of each camera. The accuracy of the method highly depends on the precision of the camera calibration, and usually more than two cameras are required for precise estimation of the depth information. For example, Pedersini et al. [34] use three cameras and Pighin et al. [35] use five cameras. When only one camera is available, it is also possible to perform facial model adaptation. Andrés del Valle and Ostermann [3] adapt facial model with one frontal face image only. Totally 18 feature points are manually extracted to deform the facial model by applying radial basis functions with compact support (RBFCS). Since no depth information is acquired from the frontal face image, the adapted model is only a coarse approximation and texture mapping can only use the frontal face image. Therefore, this approach can only be used in applications where the head motion is small from frontal face. Several other algorithms are proposed to use two or three images from orthogonal views to estimate the 3D locations for a predefined set of facial feature points [8], [22], [29]. The facial model is deformed with these feature points and the facial texture is generated by blending frontal and lateral views. Since the deformation of facial model takes depth into consideration, the adaptation is more accurate than using one image only. However, to achieve more accurate adaptation, more facial feature points are required and manual assistance of feature extraction is often indispensable.
Recently, increasing interests arise to utilize image sequences to achieve more flexible estimation [13], [15], [20], [27]. The utilized image sequence contains a moving head performing translation and rotation with neutral expression. Hence, the head movement can be assumed as rigid motion. Capturing moving head image sequence with a fixed camera resembles capturing a steady head with multiple uncalibrated cameras. Furthermore, if the head pose information in each frame is known, the problem can be treated as facial model adaptation with multiple calibrated cameras. However, the accurate pose information is only available with head-mounted sensors (e.g., magnetic trackers [39]). Therefore, facial model adaptation is often resolved by integrating motion and structure estimation in a structure-from-motion (SfM) framework [16]. In [15] the feature point locations and head motion are estimated with extended Kalman filtering. However, since the selected feature points are not adequate to determine the whole facial model structure, they impose structure constraints with an eigenspace filter constructed from a set of facial models previously scanned by a 3D laser scanner. Although this can effectively avoid abrupt deformations, it does not take individual difference into consideration. Still other researchers use epipolar constraints to solve the SfM problem, but this may suffer from the small baseline problem as addressed in [16]. Liu et al. [27] use point matching with a large set of corner points and perform an error elimination mechanism to avoid error matching that has deep impact on the estimation. Their algorithm highly depends on the accuracy of corner matching, and it fails in some cases where the skin is too smooth for extracting and matching of corner points. In [13], a robust facial model estimation algorithm based on regularized bundle-adjustment is proposed. Iterative reweighted least-squares method is used to reduce the influence of possible erroneous point matching, and the smoothness constraint is imposed as a regularization term to prevent excessive deformation.
To summarize, existing algorithms for facial model adaptation from image sequences suffer from problems of insufficient feature set or imperfect feature matching which may lead to imprecise head motion estimation. In this paper, we propose to resolve these difficulties by integrating facial model adaptation, texture mapping, and head pose estimation as cooperative and complementary processes. By using an analysis-by-synthesis approach, salient facial feature points and head profiles are reliably tracked and extracted to form a more complete feature set for model adaptation. A more robust head motion estimation is achieved with the assistance of the textured facial model. Fig. 1 illustrates the conceptual flow diagram of our proposed scheme. An initial facial model is estimated with a frontal face image in the initialization stage. In the tracking stage, pose estimation, feature point tracking, and head profile extraction are performed with the assistance of a synthetic and textured facial model. Similar to other approaches with rigid head motion assumption, the moving head in the utilized image sequence should be in neutral expression for reliable tracking. Afterwards, in the adaptation stage, the facial model structure is adapted according to the depth displacement of feature points and head profiles between the real face and the synthetic face. Hence, a complete textured facial model can be obtained by applying above procedures in successive frames.
The remainder of this paper is organized as follows. In Section 2, the initialization procedure is described. The tracking procedure including pose estimation, feature point tracking, and head profile extraction is detailed in Section 3. The adaptation algorithm is described in Section 4. Section 5 presents experiments tested on synthetic and real image sequences for performance verification. Finally, Section 6 concludes this paper.
Section snippets
Initialization
In the initialization stage, a coarse adaptation is performed by deforming the generic facial model according to the positions of several salient facial feature points on the frontal face. The model is then textured with the frontal-view image and acts as the starting point for finer adaptation with successive frames in the image sequence. In essence, any animated facial model composed of polygonal patches can be initialized in the proposed scheme, and here the facial model developed by the
Tracking
In the tracking phase, the head pose tracking is firstly performed with current textured facial model in an analysis-by-synthesis manner. Afterwards, feature point tracking and head profile extraction are carried out with the synthetic face image generated with the estimated head pose parameters. The reliable head pose information and enlarged set of control points from head profile contribute the most significant improvement for head model adaptation.
Adaptation
The adaptation stage is composed of model adaptation and texture update procedures. In the model adaptation, depth displacements of feature points and head profile points form a depth displacement field for adjusting all vertices of the facial model. Afterwards, texture from the visible region of the current face image is blended with the original texture to update the texture map.
Experimental results
To verify the performance of the proposed model-adaptation approach, we conducted experiments on synthetic sequences and real sequences. The improvement of feature point tracking assisted with the textured facial model is also explored.
Conclusions
In this paper, we have presented a feasible technique for adapting a generic facial model to a user-customized model. By integrating model adaptation, texture mapping and head pose estimation as cooperative and complementary processes, the facial model adaptation can be accomplished with an image sequence containing a shaking head acquired from a single uncalibrated camera. The customized facial models can be directly applied for entertainment and personal communications, such as avatars in
References (43)
- et al.
Model-based analysis synthesis image coding (MBASIC) system for a person's face
Signal Processing: Image Communication
(October 1989) - et al.
Fast head modeling for animation
Image Vision Comput.
(March 2000) - et al.
LAFTERA real-time lips and face tracker with facial expression recognition
Pattern Recognition
(August 2000) Object-based analysis-synthesis coding based on the source model of moving rigid 3D objects
Signal Processing: Image Communication
(May 1994)- et al.
A visual analysis/synthesis feedback loop for accurate face tracking
Signal Processing: Image Communication
(February 2001) - et al.
MPEG-4 facial animation technologySurvey, implementation, and results
IEEE Trans. Circuits Systems Video Technol.
(March 1999) - A.C. Andrés del Valle, J. Ostermann, 3D talking head customization by adapting a generic model to one uncalibrated...
- Annanova Ltd. web page:...
- et al.
3-D motion estimation and wireframe adaptation including photometric effects for model-based coding of facial image sequence
IEEE Trans. Circuits Systems Video Technol.
(June 1994) - et al.
Virtual human representation and communication in VLNet
IEEE Comput. Graphics Appl.
(March April 1997)
Analysis and synthesis of facial image sequences in model-based image coding
IEEE Trans. Circuits Systems Video Technol.
Model-aided codinga new approach to incorporate facial animation into motion-compensated video coding
IEEE Trans. Circuits Systems Video Technol.
Regularized bundle-adjustment to model heads from image sequences without calibration data
Internat. J. Comput. Vision
3D structure from 2D motion
IEEE Signal Processing Mag.
Cited by (3)
An ellipsoidal model for generating realistic 3D facial textures
2013, International Journal of Computer Applications in TechnologyNovel 3D model reconstruction method for the coexistence of diffuse and specular object surfaces
2007, Proceedings of the 4th IASTED International Conference on Signal Processing, Pattern Recognition, and Applications, SPPRA 2007Transferable videorealistic speech animation
2005, Computer Animation, Conference Proceedings