Abstract
In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.
Similar content being viewed by others
References
Blanz V, Vetter T (1999) A morphable model for the synthesis Of 3D faces. Proc ACM SIGGRAPH 99:187–194
Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. Proc ACM SIGGRAPH 97:353–360
Chen T (2001) Audiovisual speech processing. Signal Proc Mag 18(1):9–21
Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimed 2(3):152–163
Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. Proc 1998 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 98) IEEE, pp 1703–1706
Ezzat T, Geiger G, PoggioT (2002) Trainable video realistic speech animation. Proc ACM SIGGRAPH 2002, pp 388–398
Ezzat T, Poggio T (1998) Miketalk: a talking facial display based on morphing visemes. Proc Comput Animat, pp 96–102
Hirai T, Tenpaku S (2004) Using 5ms segments in concatenative speech synthesis. Proc 5th ISCA Speech Synt Work Int’l Speech Comm Assoc pp 37–42
Huang F, Cosatto E, Graf HP (2002) Triphone based unit selection for concatenative visual speech synthesis. Proc 2002 I.E. Int’l Conf Acoust Speech Signal Proc (ICASSP 02) IEEE, pp 2037–2040
Huang X et al (1997) Recent improvements on microsoft’s trainable text-to-speech system – Whistler. Proc 1997 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 97) IEEE, pp 959–962
Hunt A, Black A (1996) Unit selection in a concatenative speech synthesis system using a large speech database. Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 373–376
King SA, Parent RE (2005) Creating speech-synchronized animation. IEEE Trans Vis Comput Graph 11(3):341–352
Lewis JP Fast normalized cross-correlation. Industrial Light & Magic
Ling ZH, Wang RH (2006) HMM-based unit selection using frame sized speech segments. Proc 7th Ann Conf Int’l Speech Comm Assoc. (Interspeech 06) Int’l Speech Comm Assoc, pp 2034–2037
Liu K, Ostermann J (2008) Realistic facial animation system for interactive services. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08), Int’l Speech Comm Assoc, pp 2330–2333
Liu K, Weissenfeld A, Ostermann J (2006) Parameterization of mouth images by LLE and PCA for image-based facial animation. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE, pp 461–464
Mattheyses W et al (2008) Multimodal unit selection for 2D audiovisual text-to-speech synthesis. Lect Note Comput Sci, pp 125–136
Nakamura S (2002) Statistical multimodal integration for audio-visual speech processing. IEEE Trans Neural Netw 13(4):854–866
Perez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph (SIGGRAPH’03) 22(3):313–318
Pighin F et al (1998) Synthesizing realistic facial expressions from photographs. Proc ACM SIGGRAPH 98:75–84
Sako S et al (2000) HMM-based text-to-audio-visual speech synthesis. Proc 6th Int’l Conf on Spoken Lang Process (ICSLP 00) Int’l Speech Comm Assoc, pp 25–28
Scott MR, Liu X, Zhou M (2011) Towards a specialized search engine for language learners. Proc IEEE, pp 1462–1465
Theobald BJ et al (2004) Near videorealistic synthetic talking faces: implementation and evaluation. Speech Comm 44:127–140
Theobald B et al (2008) LIPS2008: visual speech synthesis challenge. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08) Int’l Speech Comm Assoc, pp 2310–2313
Toda T, Black A, Tokuda K Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter. Proc 2005 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 05) IEEE, pp 9–12
Tokuda K et al (1996) Speech synthesis using HMMs with dynamic features Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 389–392
Video demonstration of our synthesis results: http://research.microsoft.com/en-us/projects/photo-real_talking_head/
Wang JQ et al (2004) A real-time cantonese text-to-audiovisual speech synthesizer. Proc 2004 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 04) IEEE, pp I–653–I–656
Wang Q et al (2006) Real-time Bayesian 3-D pose tracking. IEEE Trans Circ Syst Video Technol 16(12):1533–1541
Wang LJ et al (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. Proc 11th Ann Conf Int’l Speech Comm Assoc (Interspeech 10) Int’l Speech Comm Assoc, pp 446–449
Wang LJ et al (2011) Synthesizing visual speech trajectory with minimum generation error. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 4580–4583
Wang LJ et al (2012) Computer-assisted audiovisual language learning. Computer 45(6):38–47, Computer Society
Wu Y-J, Qin L, Tokuda K (2009) An improved minimum generation error based model adaptation for HMM-based speech synthesis. Proc 10th Ann Conf Int’l Speech Comm Assoc (Interspeech 09) Int’l Speech Comm Assoc, pp 1787–1790
Wu Y-J, Wang R-H (2006) Minimum generation error training for HMM-based speech synthesis. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE I:89–92
Wu KK et al (2011) A sparse and low-rank approach to efficient face alignment for photo-real talking head synthesis. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 1397–1400
Xie L, Liu ZQ (2006) Speech animation using coupled hidden Markov models. Proc 2006 Int’l Conf Pattern Recognit (ICPR’06), pp 1128–1131
Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510
Yan ZJ, Qian Y, Soong F (2010) Rich-context Unit Selection (RUS) approach to high quality TTS. Proc 2010 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 10) IEEE, pp 4798–4801
Zhang S et al (2007) Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar. Proc 2007 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 07) IEEE, pp IV–837–IV–840
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, L., Soong, F.K. HMM trajectory-guided sample selection for photo-realistic talking head. Multimed Tools Appl 74, 9849–9869 (2015). https://doi.org/10.1007/s11042-014-2118-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2118-8