Skip to main content

Advertisement

Log in

HMM trajectory-guided sample selection for photo-realistic talking head

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Blanz V, Vetter T (1999) A morphable model for the synthesis Of 3D faces. Proc ACM SIGGRAPH 99:187–194

    Google Scholar 

  2. Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. Proc ACM SIGGRAPH 97:353–360

    Google Scholar 

  3. Chen T (2001) Audiovisual speech processing. Signal Proc Mag 18(1):9–21

    Article  MATH  Google Scholar 

  4. Cosatto E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimed 2(3):152–163

    Article  Google Scholar 

  5. Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. Proc 1998 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 98) IEEE, pp 1703–1706

  6. Ezzat T, Geiger G, PoggioT (2002) Trainable video realistic speech animation. Proc ACM SIGGRAPH 2002, pp 388–398

  7. Ezzat T, Poggio T (1998) Miketalk: a talking facial display based on morphing visemes. Proc Comput Animat, pp 96–102

  8. Hirai T, Tenpaku S (2004) Using 5ms segments in concatenative speech synthesis. Proc 5th ISCA Speech Synt Work Int’l Speech Comm Assoc pp 37–42

  9. Huang F, Cosatto E, Graf HP (2002) Triphone based unit selection for concatenative visual speech synthesis. Proc 2002 I.E. Int’l Conf Acoust Speech Signal Proc (ICASSP 02) IEEE, pp 2037–2040

  10. Huang X et al (1997) Recent improvements on microsoft’s trainable text-to-speech system – Whistler. Proc 1997 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 97) IEEE, pp 959–962

  11. Hunt A, Black A (1996) Unit selection in a concatenative speech synthesis system using a large speech database. Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 373–376

  12. King SA, Parent RE (2005) Creating speech-synchronized animation. IEEE Trans Vis Comput Graph 11(3):341–352

    Article  Google Scholar 

  13. Lewis JP Fast normalized cross-correlation. Industrial Light & Magic

  14. Ling ZH, Wang RH (2006) HMM-based unit selection using frame sized speech segments. Proc 7th Ann Conf Int’l Speech Comm Assoc. (Interspeech 06) Int’l Speech Comm Assoc, pp 2034–2037

  15. Liu K, Ostermann J (2008) Realistic facial animation system for interactive services. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08), Int’l Speech Comm Assoc, pp 2330–2333

  16. Liu K, Weissenfeld A, Ostermann J (2006) Parameterization of mouth images by LLE and PCA for image-based facial animation. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE, pp 461–464

  17. Mattheyses W et al (2008) Multimodal unit selection for 2D audiovisual text-to-speech synthesis. Lect Note Comput Sci, pp 125–136

  18. Nakamura S (2002) Statistical multimodal integration for audio-visual speech processing. IEEE Trans Neural Netw 13(4):854–866

    Article  Google Scholar 

  19. Perez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph (SIGGRAPH’03) 22(3):313–318

  20. Pighin F et al (1998) Synthesizing realistic facial expressions from photographs. Proc ACM SIGGRAPH 98:75–84

    Google Scholar 

  21. Sako S et al (2000) HMM-based text-to-audio-visual speech synthesis. Proc 6th Int’l Conf on Spoken Lang Process (ICSLP 00) Int’l Speech Comm Assoc, pp 25–28

  22. Scott MR, Liu X, Zhou M (2011) Towards a specialized search engine for language learners. Proc IEEE, pp 1462–1465

  23. Theobald BJ et al (2004) Near videorealistic synthetic talking faces: implementation and evaluation. Speech Comm 44:127–140

    Article  Google Scholar 

  24. Theobald B et al (2008) LIPS2008: visual speech synthesis challenge. Proc 9th Ann Conf Int’l Speech Comm Assoc (Interspeech 08) Int’l Speech Comm Assoc, pp 2310–2313

  25. Toda T, Black A, Tokuda K Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter. Proc 2005 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 05) IEEE, pp 9–12

  26. Tokuda K et al (1996) Speech synthesis using HMMs with dynamic features Proc 1996 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 96) IEEE, pp 389–392

  27. Video demonstration of our synthesis results: http://research.microsoft.com/en-us/projects/photo-real_talking_head/

  28. Wang JQ et al (2004) A real-time cantonese text-to-audiovisual speech synthesizer. Proc 2004 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 04) IEEE, pp I–653–I–656

  29. Wang Q et al (2006) Real-time Bayesian 3-D pose tracking. IEEE Trans Circ Syst Video Technol 16(12):1533–1541

    Article  Google Scholar 

  30. Wang LJ et al (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. Proc 11th Ann Conf Int’l Speech Comm Assoc (Interspeech 10) Int’l Speech Comm Assoc, pp 446–449

  31. Wang LJ et al (2011) Synthesizing visual speech trajectory with minimum generation error. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 4580–4583

  32. Wang LJ et al (2012) Computer-assisted audiovisual language learning. Computer 45(6):38–47, Computer Society

    Article  Google Scholar 

  33. Wu Y-J, Qin L, Tokuda K (2009) An improved minimum generation error based model adaptation for HMM-based speech synthesis. Proc 10th Ann Conf Int’l Speech Comm Assoc (Interspeech 09) Int’l Speech Comm Assoc, pp 1787–1790

  34. Wu Y-J, Wang R-H (2006) Minimum generation error training for HMM-based speech synthesis. Proc 2006 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 06) IEEE I:89–92

    Google Scholar 

  35. Wu KK et al (2011) A sparse and low-rank approach to efficient face alignment for photo-real talking head synthesis. Proc 2011 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 11) IEEE, pp 1397–1400

  36. Xie L, Liu ZQ (2006) Speech animation using coupled hidden Markov models. Proc 2006 Int’l Conf Pattern Recognit (ICPR’06), pp 1128–1131

  37. Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500–510

    Article  Google Scholar 

  38. Yan ZJ, Qian Y, Soong F (2010) Rich-context Unit Selection (RUS) approach to high quality TTS. Proc 2010 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 10) IEEE, pp 4798–4801

  39. Zhang S et al (2007) Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar. Proc 2007 I.E. Int’l Conf Acoust Speech Signal Process (ICASSP 07) IEEE, pp IV–837–IV–840

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lijuan Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Soong, F.K. HMM trajectory-guided sample selection for photo-realistic talking head. Multimed Tools Appl 74, 9849–9869 (2015). https://doi.org/10.1007/s11042-014-2118-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2118-8

Keywords

Navigation