HMM trajectory-guided sample selection for photo-realistic talking head

Wang, Lijuan; Soong, Frank K.

doi:10.1007/s11042-014-2118-8

HMM trajectory-guided sample selection for photo-realistic talking head

Published: 03 August 2014

Volume 74, pages 9849–9869, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lijuan Wang¹ &
Frank K. Soong¹

392 Accesses
14 Citations
Explore all metrics

Abstract

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Blanz V, Vetter T (1999) A morphable model for the synthesis Of 3D faces. Proc ACM SIGGRAPH 99:187–194
Google Scholar
Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. Proc ACM SIGGRAPH 97:353–360
Google Scholar
Chen T (2001) Audiovisual speech processing. Signal Proc Mag 18(1):9–21
Article MATH Google Scholar
Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimed 2(3):152–163
Article Google Scholar
Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. Proc 1998 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 98) IEEE, pp 1703–1706
Ezzat T, Geiger G, PoggioT (2002) Trainable video realistic speech animation. Proc ACM SIGGRAPH 2002, pp 388–398
Ezzat T, Poggio T (1998) Miketalk: a talking facial display based on morphing visemes. Proc Comput Animat, pp 96–102
Hirai T, Tenpaku S (2004) Using 5ms segments in concatenative speech synthesis. Proc 5th ISCA Speech Synt Work Int’l Speech Comm Assoc pp 37–42
Huang F, Cosatto E, Graf HP (2002) Triphone based unit selection for concatenative visual speech synthesis. Proc 2002 I.E. Int’l Conf Acoust Speech Signal Proc (ICASSP 02) IEEE, pp 2037–2040
Huang X et al (1997) Recent improvements on microsoft’s trainable text-to-speech system – Whistler. Proc 1997 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 97) IEEE, pp 959–962
Hunt A, Black A (1996) Unit selection in a concatenative speech synthesis system using a large speech database. Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 373–376
King SA, Parent RE (2005) Creating speech-synchronized animation. IEEE Trans Vis Comput Graph 11(3):341–352
Article Google Scholar
Lewis JP Fast normalized cross-correlation. Industrial Light & Magic
Ling ZH, Wang RH (2006) HMM-based unit selection using frame sized speech segments. Proc 7th Ann Conf Int’l Speech Comm Assoc. (Interspeech 06) Int’l Speech Comm Assoc, pp 2034–2037
Liu K, Ostermann J (2008) Realistic facial animation system for interactive services. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08), Int’l Speech Comm Assoc, pp 2330–2333
Liu K, Weissenfeld A, Ostermann J (2006) Parameterization of mouth images by LLE and PCA for image-based facial animation. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE, pp 461–464
Mattheyses W et al (2008) Multimodal unit selection for 2D audiovisual text-to-speech synthesis. Lect Note Comput Sci, pp 125–136
Nakamura S (2002) Statistical multimodal integration for audio-visual speech processing. IEEE Trans Neural Netw 13(4):854–866
Article Google Scholar
Perez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph (SIGGRAPH’03) 22(3):313–318
Pighin F et al (1998) Synthesizing realistic facial expressions from photographs. Proc ACM SIGGRAPH 98:75–84
Google Scholar
Sako S et al (2000) HMM-based text-to-audio-visual speech synthesis. Proc 6th Int’l Conf on Spoken Lang Process (ICSLP 00) Int’l Speech Comm Assoc, pp 25–28
Scott MR, Liu X, Zhou M (2011) Towards a specialized search engine for language learners. Proc IEEE, pp 1462–1465
Theobald BJ et al (2004) Near videorealistic synthetic talking faces: implementation and evaluation. Speech Comm 44:127–140
Article Google Scholar
Theobald B et al (2008) LIPS2008: visual speech synthesis challenge. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08) Int’l Speech Comm Assoc, pp 2310–2313
Toda T, Black A, Tokuda K Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter. Proc 2005 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 05) IEEE, pp 9–12
Tokuda K et al (1996) Speech synthesis using HMMs with dynamic features Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 389–392
Video demonstration of our synthesis results: http://research.microsoft.com/en-us/projects/photo-real_talking_head/
Wang JQ et al (2004) A real-time cantonese text-to-audiovisual speech synthesizer. Proc 2004 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 04) IEEE, pp I–653–I–656
Wang Q et al (2006) Real-time Bayesian 3-D pose tracking. IEEE Trans Circ Syst Video Technol 16(12):1533–1541
Article Google Scholar
Wang LJ et al (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. Proc 11th Ann Conf Int’l Speech Comm Assoc (Interspeech 10) Int’l Speech Comm Assoc, pp 446–449
Wang LJ et al (2011) Synthesizing visual speech trajectory with minimum generation error. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 4580–4583
Wang LJ et al (2012) Computer-assisted audiovisual language learning. Computer 45(6):38–47, Computer Society
Article Google Scholar
Wu Y-J, Qin L, Tokuda K (2009) An improved minimum generation error based model adaptation for HMM-based speech synthesis. Proc 10th Ann Conf Int’l Speech Comm Assoc (Interspeech 09) Int’l Speech Comm Assoc, pp 1787–1790
Wu Y-J, Wang R-H (2006) Minimum generation error training for HMM-based speech synthesis. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE I:89–92
Google Scholar
Wu KK et al (2011) A sparse and low-rank approach to efficient face alignment for photo-real talking head synthesis. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 1397–1400
Xie L, Liu ZQ (2006) Speech animation using coupled hidden Markov models. Proc 2006 Int’l Conf Pattern Recognit (ICPR’06), pp 1128–1131
Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510
Article Google Scholar
Yan ZJ, Qian Y, Soong F (2010) Rich-context Unit Selection (RUS) approach to high quality TTS. Proc 2010 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 10) IEEE, pp 4798–4801
Zhang S et al (2007) Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar. Proc 2007 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 07) IEEE, pp IV–837–IV–840

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, Beijing, 100080, China
Lijuan Wang & Frank K. Soong

Authors

Lijuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Frank K. Soong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lijuan Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Soong, F.K. HMM trajectory-guided sample selection for photo-realistic talking head. Multimed Tools Appl 74, 9849–9869 (2015). https://doi.org/10.1007/s11042-014-2118-8

Download citation

Received: 23 January 2014
Revised: 08 May 2014
Accepted: 20 May 2014
Published: 03 August 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11042-014-2118-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HMM trajectory-guided sample selection for photo-realistic talking head

Abstract

Access this article

Similar content being viewed by others

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Talking-Head Generation with Rhythmic Head Motion

Speech-Driven Facial Animation Using Manifold Relevance Determination

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HMM trajectory-guided sample selection for photo-realistic talking head

Abstract

Access this article

Similar content being viewed by others

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Talking-Head Generation with Rhythmic Head Motion

Speech-Driven Facial Animation Using Manifold Relevance Determination

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation