Abstract
This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.
Similar content being viewed by others
Notes
Base vector is added in.
We tested 20, 40, 100, 150, 200, 250, 300 and 350 training sentences and we denote them as S20,..., S350.
References
Berger MA, Hofer G, Shimodaira H (2011) Carnival—combining speech technology and computer animation. IEEE Comput Graph Appl 80–89
Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Siggraph, pp 187–194
Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Eurographics, pp 641–650
Brand M (1999) Voice puppetry. In: Siggraph, pp 21–28
Bregler C, Covell M, Slaney M (2007) Video rewrite: driving visual speech with audio. In: Siggraph, pp 353–360
Chen T (2001) Audiovisual speech processing: lip reading and lip synchronization. IEEE Signal Proc Mag 18(1):9–21
Choi K, Hwang JN (1999) Baum–welch hidden markov model inversion for reliable audio-to-visual conversion. In: Proc. IEEE 3rd workshop multimedia signal processing, pp 175–180
Cootes TG, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428
Deng Z, Neumann U (eds) (2008) Data-driven 3D facial animation. Springer, New York
Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38(1):45–57
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Siggraph, pp 388–397
Fagel S, Bailly GB, Theobald B-J (2009) Animating virtual speakers or singers fromaudio: lip-synching facial animation. In: EURASIP journal on audio, speech, and music processing 2009, pp 1–2
Fu S, Gutierrez-Osuna R, Esposito A, Kakumanu KP, Garcia ON (2005) Audio/visual mapping with cross-modal hidden markov models. IEEE Trans Multimedia 7:243–251
Hofer G, Yamagishi J, Shimodaira H (2008) Speech-driven lip motion generation with a trajectory hmm. In: Proc. of interspeech
Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol 30–32
Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on pad. EURASIP J Audio Speech Music Process 19(3):570–582
Jia J, Wu Z, Zhang S, Meng H, Cai L (2013) Head and facial gestures synthesis using pad model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/S11042-013-1604-8
Kessentini Y, Paquet T, Hamadou AB (2010) Off-line handwritten word recognition using multi-stream hidden markov models. Pattern Recogn Lett 31(1):60–70
Liu K, Ostermann J (2009) Optimization of an image-based talking head system. In: EURASIP journal on audio, speech, and music processing, vol 2009
Meng F, Wu Z, Jia J, Meng H, Cai L (2013) Synthesizing english emphatic speech for multimodal correstive feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Ohman T, Salvi G (1999) Using hmms and anns formapping acoustic to visual speech. TMH-QPSR 40(1–2):45–50
Ostermann J, Weissenfeld A (2004) Talking faces - technologies and applications. In: Proc. of ICPR, vol 3, pp 826–833
Pandzic IS, Forchheimer R (eds) (2002) MPEG-4 facial animation the standard, implementation and applications. Wiley, New York
Pèrez P, Gangnet M, Blake A (2003) Poisson image editing. In: ACM Trans. Graphics, vol 22, pp 313–318
Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin DH (1998) Synthesizing realistic facial expressions from photographs. In: Siggraph, pp 75–84
Potamianos G, Neti C, Luettin J, Matthews I (2004) Issues in visual and audio-visual speech processing. Ch. Audio-visual automatic speech recognition: an overview. MIT Press, pp 121–148
Salvi G, Beskow J, Moubayed SA, Granstrom B (2009) Synface–speech-driven facial animation for virtual speech-reading support. In: EURASIP journal on audio, speech, and music processing, vol 2009
Shinji Sako KT, Masuko T, Kobayashi T, Kitamura T (2000) Hmm-based text-to-audio-visual speech synthesis. In: Interspeech
Summereld AQ (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. Lawrence Erlbaum Associates, Ch. Hearing by Eye: The Psychology of Lip-Reading, pp 97–113
Tamura M, Kondo S, Masuko T, Kobayashi T (1999) Text to audio-visual speech synthesis based on parameter generation from HMM. In: Eurospeech, pp 959–962
Theobald B-J, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. In: AVSP
Theobald B-J, Fagel S, Bailly G, Elisei F (2008) Lips2008: visual speech synthesis challenge. In: Proc. of interspeech
Theobald B, Matthews I, Wilkinson N, Cohn JF, Boker S (2007) Animating faces using appearance models. In: Proceedings of the workshop on vision, video and graphics
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorigthms for hmm-based speech synthesis. In: ICASSP, pp 1315–1318
Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Interspeech
Wang L, Han W, Soong FK, Huo Q (2011) Text driven 3d photo-realistic talking head. In: Interspeech, pp 3307–3310
Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. In: Siggraph
Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In: Proc. Interspeech, pp 1802–1805
Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(23):500–510
Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden markov models. Speech Comm 26(1–2):105–115
Yamagishi J, Masuko T, Tokuda K, Kobayashi T (2003) A training method for average voice model based on shared decision tree context clustering and speaker adaptive training. In: ICASSP, pp 716–719
Zeng Z, Tu J, Pianfetti BM, Huang TS (2008) Audio-visual affective expression recognition through multistream fused hmm. IEEE Trans Multimed 10(4):570–577
Acknowledgements
This work is supported by the National Natural Science Foundation of China (61175018), the Natural Science Basic Research Plan of Shaanxi Province (2011JM8009) and the Fok Ying Tung Education Foundation (131059).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xie, L., Sun, N. & Fan, B. A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl 73, 377–396 (2014). https://doi.org/10.1007/s11042-013-1633-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1633-3