Skip to main content
Log in

A statistical parametric approach to video-realistic text-driven talking avatar

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://dict.bing.com.cn

  2. https://itunes.apple.com/us/app/talking-tom-cat/id377194688?mt=8

  3. http://marketplace.xbox.com/en-US/Product/Avatar-Kinect/66acd000-77fe-1000-9115-d8025848081a

  4. Base vector is added in.

  5. http://hts.sp.nitech.ac.jp/

  6. We tested 20, 40, 100, 150, 200, 250, 300 and 350 training sentences and we denote them as S20,..., S350.

References

  1. Berger MA, Hofer G, Shimodaira H (2011) Carnival—combining speech technology and computer animation. IEEE Comput Graph Appl 80–89

  2. Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Siggraph, pp 187–194

  3. Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Eurographics, pp 641–650

  4. Brand M (1999) Voice puppetry. In: Siggraph, pp 21–28

  5. Bregler C, Covell M, Slaney M (2007) Video rewrite: driving visual speech with audio. In: Siggraph, pp 353–360

  6. Chen T (2001) Audiovisual speech processing: lip reading and lip synchronization. IEEE Signal Proc Mag 18(1):9–21

    Article  MATH  Google Scholar 

  7. Choi K, Hwang JN (1999) Baum–welch hidden markov model inversion for reliable audio-to-visual conversion. In: Proc. IEEE 3rd workshop multimedia signal processing, pp 175–180

  8. Cootes TG, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

    Article  Google Scholar 

  9. Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428

    Article  Google Scholar 

  10. Deng Z, Neumann U (eds) (2008) Data-driven 3D facial animation. Springer, New York

    Google Scholar 

  11. Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38(1):45–57

    Article  MATH  Google Scholar 

  12. Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Siggraph, pp 388–397

  13. Fagel S, Bailly GB, Theobald B-J (2009) Animating virtual speakers or singers fromaudio: lip-synching facial animation. In: EURASIP journal on audio, speech, and music processing 2009, pp 1–2

  14. Fu S, Gutierrez-Osuna R, Esposito A, Kakumanu KP, Garcia ON (2005) Audio/visual mapping with cross-modal hidden markov models. IEEE Trans Multimedia 7:243–251

    Google Scholar 

  15. Hofer G, Yamagishi J, Shimodaira H (2008) Speech-driven lip motion generation with a trajectory hmm. In: Proc. of interspeech

  16. Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol 30–32

  17. Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on pad. EURASIP J Audio Speech Music Process 19(3):570–582

    Article  Google Scholar 

  18. Jia J, Wu Z, Zhang S, Meng H, Cai L (2013) Head and facial gestures synthesis using pad model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/S11042-013-1604-8

  19. Kessentini Y, Paquet T, Hamadou AB (2010) Off-line handwritten word recognition using multi-stream hidden markov models. Pattern Recogn Lett 31(1):60–70

    Google Scholar 

  20. Liu K, Ostermann J (2009) Optimization of an image-based talking head system. In: EURASIP journal on audio, speech, and music processing, vol 2009

  21. Meng F, Wu Z, Jia J, Meng H, Cai L (2013) Synthesizing english emphatic speech for multimodal correstive feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y

  22. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  23. Ohman T, Salvi G (1999) Using hmms and anns formapping acoustic to visual speech. TMH-QPSR 40(1–2):45–50

    Google Scholar 

  24. Ostermann J, Weissenfeld A (2004) Talking faces - technologies and applications. In: Proc. of ICPR, vol 3, pp 826–833

  25. Pandzic IS, Forchheimer R (eds) (2002) MPEG-4 facial animation the standard, implementation and applications. Wiley, New York

    Google Scholar 

  26. Pèrez P, Gangnet M, Blake A (2003) Poisson image editing. In: ACM Trans. Graphics, vol 22, pp 313–318

  27. Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin DH (1998) Synthesizing realistic facial expressions from photographs. In: Siggraph, pp 75–84

  28. Potamianos G, Neti C, Luettin J, Matthews I (2004) Issues in visual and audio-visual speech processing. Ch. Audio-visual automatic speech recognition: an overview. MIT Press, pp 121–148

  29. Salvi G, Beskow J, Moubayed SA, Granstrom B (2009) Synface–speech-driven facial animation for virtual speech-reading support. In: EURASIP journal on audio, speech, and music processing, vol 2009

  30. Shinji Sako KT, Masuko T, Kobayashi T, Kitamura T (2000) Hmm-based text-to-audio-visual speech synthesis. In: Interspeech

  31. Summereld AQ (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. Lawrence Erlbaum Associates, Ch. Hearing by Eye: The Psychology of Lip-Reading, pp 97–113

  32. Tamura M, Kondo S, Masuko T, Kobayashi T (1999) Text to audio-visual speech synthesis based on parameter generation from HMM. In: Eurospeech, pp 959–962

  33. Theobald B-J, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. In: AVSP

  34. Theobald B-J, Fagel S, Bailly G, Elisei F (2008) Lips2008: visual speech synthesis challenge. In: Proc. of interspeech

  35. Theobald B, Matthews I, Wilkinson N, Cohn JF, Boker S (2007) Animating faces using appearance models. In: Proceedings of the workshop on vision, video and graphics

  36. Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorigthms for hmm-based speech synthesis. In: ICASSP, pp 1315–1318

  37. Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Interspeech

  38. Wang L, Han W, Soong FK, Huo Q (2011) Text driven 3d photo-realistic talking head. In: Interspeech, pp 3307–3310

  39. Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. In: Siggraph

  40. Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In: Proc. Interspeech, pp 1802–1805

  41. Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(23):500–510

    Google Scholar 

  42. Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden markov models. Speech Comm 26(1–2):105–115

    Article  Google Scholar 

  43. Yamagishi J, Masuko T, Tokuda K, Kobayashi T (2003) A training method for average voice model based on shared decision tree context clustering and speaker adaptive training. In: ICASSP, pp 716–719

  44. Zeng Z, Tu J, Pianfetti BM, Huang TS (2008) Audio-visual affective expression recognition through multistream fused hmm. IEEE Trans Multimed 10(4):570–577

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (61175018), the Natural Science Basic Research Plan of Shaanxi Province (2011JM8009) and the Fok Ying Tung Education Foundation (131059).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Xie.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, L., Sun, N. & Fan, B. A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl 73, 377–396 (2014). https://doi.org/10.1007/s11042-013-1633-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1633-3

Keywords

Navigation