Abstract
This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.
Similar content being viewed by others
References
Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Siggraph, pp 187–194
Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Eurographics, pp 641–650
Bregler C, Covell M, Slaney M (2007) Video rewrite: driving visual speech with audio. In: Siggraph, pp 353–360
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: ICML, pp 160–167
Cootes T F, Edwards G J, Taylor C J (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Cosatto E, Graf H P (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2(3):152–163
Cosatto E, Ostermann J, Graf H P, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428
Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38(1):45–57
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Siggraph, pp 388–397
Fan Y, Qian Y, Xie F, Soong FK (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp 1964–196
Graves A (2012) Supervised sequence labelling with recurrent neural networks, In: Springer, p 385
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: ICASSP, pp 6645–6649
http://marketplace.xbox.com/en-US/Product/Avatar-Kinect/66acd000-77fe-1000-9115-d8025848081a (accessed on 13 June 2015)
http://www.nwpu-aslp.org/lips2008 (accessed on 13 June 2015)
Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEESignal Proc Mag 29(6):82–97
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 6(02):107–116
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hofer G, Yamagishi J, Shimodaira H (2008) Speech-driven lip motion generation with a trajectory hmm. In: Proceedings of interspeech
Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on pad. EURASIP J Audio Speech Music Process 19(3):570–582
Jia J, Wu Z, Zhang S, Meng H, Cai L (2013) Head and facial gestures synthesis using pad model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/S11042-013-1604-8
Kang S, Qian X, Meng HY Multi-distribution deep belief network for speech synthesis. In: ICASSP, pp 8012–8016
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Leggetter C J, Woodland P C (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, vol 9, pp 171–185
Li B, Xie L, Zhou X, Zhang Y (2011) Real-time speech driven talking avatar. J Tsinghua Univ Sci Tech 51(9):1180–1186
Liu K, Ostermann J (2009) Optimization of an image-based talking head system. In: EURASIP journal on audio, speech, and music processing, vol 2009
Meng F, Wu Z, Jia J, Meng H, Cai L (2013) Synthesizing english emphatic speech for multimodal correstive feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y
Mike S, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(6):2673–2681
Ohman T, Salvi G (1999) Using hmms and anns formapping acoustic to visual speech. TMH-QPSR 40(1):45–50
Pérez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph 22(3):313–318
Pighin F, Hecker J, Lischinski J, Lischinski D, Szeliski D, Salesinet R (2006) Synthesizing realistic facial expressions from photographs. In: Siggraph, p 19
Qian Y, Fan Y (2014) On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: ICASSP, Soong FK, pp 3829–3833
Roweis S (1998) EM algorithms for PCA and SPCA. In: Advances in neural information processing systems, pp 626–632
Sako S, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2000) HMM-based text-to-audio-visual speech synthesis. In: In ICSLP, pp 25–28
Salvi G, Beskow J, Moubayed S A, Granstrom B (2009) Synface: speech-driven facial animation for virtual speech-reading support. In: EURASIP journal on Audio, speech, andmusic processing, vol 2009
Theobald BJ, Fagel S, Bailly G (2008) LIPS2008: Visual Speech Synthesis Challenge. In: Interspeech, pp 2310–2313
Tomoki T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans Inf Syst 90(5):816–824
Wang L, Soong FK (2014) HMM trajectory-guided sample selection for photo-realistic talking head. In: Multimedia Tools and Applications, pp 1–21
Wang Q, Zhang W, Tang X, Shum H Y (2006) Real-time bayesian 3-d pose tracking. IEEE Trans Circ Syst Video Tech 16(12):1533–1541
Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectoryguided sample selection. In: Interspeech, pp 446–449
Wang L, Han W, Soong FK, Huo Q (2011) Text driven 3d photo-realistic talking head. In: Interspeech, pp 3307–3310
Wang L, Qian Y, Scott M R, Chen G, Soong F K (2012) Computer-assisted audiovisual language learning. Computer 45(6):38–47
Williams R J, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2):270–280
Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In: Proc. Interspeech, pp 1802–1805
Wu Z, Valentini-Botinhao C, Watts O, King S (2015) Deep neural networks employing multi-task learining and stacked bottleneck features for speech synthesis. In: ICASSP, pp 4460–4464
Xie L, Liu Z (2006) Speech animation using coupled hidden Markov models. In: ICPR, pp 1128–1131
Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(23):500–510
Xie L, Sun N, Fan B (2014) A statistical parametric approach to video-realistic text-driven talking avatar. Multimedia Tool Appl 73(1):377–396
Zen H, Tokuda K, Black A (2009) Statistical parametric speech synthesis. Speech Comm 51(11):1039–1064
Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: ICASSP, pp 7962–7966
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant No. 61175018 and 61571363).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Fan, B., Xie, L., Yang, S. et al. A deep bidirectional LSTM approach for video-realistic talking head. Multimed Tools Appl 75, 5287–5309 (2016). https://doi.org/10.1007/s11042-015-2944-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2944-3