Skip to main content
Log in

A deep bidirectional LSTM approach for video-realistic talking head

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Siggraph, pp 187–194

  2. Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Eurographics, pp 641–650

  3. Bregler C, Covell M, Slaney M (2007) Video rewrite: driving visual speech with audio. In: Siggraph, pp 353–360

  4. Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: ICML, pp 160–167

  5. Cootes T F, Edwards G J, Taylor C J (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

    Article  Google Scholar 

  6. Cosatto E, Graf H P (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2(3):152–163

    Article  Google Scholar 

  7. Cosatto E, Ostermann J, Graf H P, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428

    Article  Google Scholar 

  8. Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38(1):45–57

    Article  MATH  Google Scholar 

  9. Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Siggraph, pp 388–397

  10. Fan Y, Qian Y, Xie F, Soong FK (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp 1964–196

  11. Graves A (2012) Supervised sequence labelling with recurrent neural networks, In: Springer, p 385

  12. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: ICASSP, pp 6645–6649

  13. http://marketplace.xbox.com/en-US/Product/Avatar-Kinect/66acd000-77fe-1000-9115-d8025848081a (accessed on 13 June 2015)

  14. http://www.nwpu-aslp.org/lips2008 (accessed on 13 June 2015)

  15. Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEESignal Proc Mag 29(6):82–97

    Article  Google Scholar 

  16. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 6(02):107–116

    Article  MATH  Google Scholar 

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Hofer G, Yamagishi J, Shimodaira H (2008) Speech-driven lip motion generation with a trajectory hmm. In: Proceedings of interspeech

  19. Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on pad. EURASIP J Audio Speech Music Process 19(3):570–582

    Article  Google Scholar 

  20. Jia J, Wu Z, Zhang S, Meng H, Cai L (2013) Head and facial gestures synthesis using pad model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/S11042-013-1604-8

  21. Kang S, Qian X, Meng HY Multi-distribution deep belief network for speech synthesis. In: ICASSP, pp 8012–8016

  22. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  23. Leggetter C J, Woodland P C (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, vol 9, pp 171–185

  24. Li B, Xie L, Zhou X, Zhang Y (2011) Real-time speech driven talking avatar. J Tsinghua Univ Sci Tech 51(9):1180–1186

    Google Scholar 

  25. Liu K, Ostermann J (2009) Optimization of an image-based talking head system. In: EURASIP journal on audio, speech, and music processing, vol 2009

  26. Meng F, Wu Z, Jia J, Meng H, Cai L (2013) Synthesizing english emphatic speech for multimodal correstive feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y

  27. Mike S, Paliwal K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(6):2673–2681

    Google Scholar 

  28. Ohman T, Salvi G (1999) Using hmms and anns formapping acoustic to visual speech. TMH-QPSR 40(1):45–50

    Google Scholar 

  29. Pérez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph 22(3):313–318

    Article  Google Scholar 

  30. Pighin F, Hecker J, Lischinski J, Lischinski D, Szeliski D, Salesinet R (2006) Synthesizing realistic facial expressions from photographs. In: Siggraph, p 19

  31. Qian Y, Fan Y (2014) On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In: ICASSP, Soong FK, pp 3829–3833

  32. Roweis S (1998) EM algorithms for PCA and SPCA. In: Advances in neural information processing systems, pp 626–632

  33. Sako S, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2000) HMM-based text-to-audio-visual speech synthesis. In: In ICSLP, pp 25–28

  34. Salvi G, Beskow J, Moubayed S A, Granstrom B (2009) Synface: speech-driven facial animation for virtual speech-reading support. In: EURASIP journal on Audio, speech, andmusic processing, vol 2009

  35. Theobald BJ, Fagel S, Bailly G (2008) LIPS2008: Visual Speech Synthesis Challenge. In: Interspeech, pp 2310–2313

  36. Tomoki T, Tokuda K (2007) A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans Inf Syst 90(5):816–824

    Google Scholar 

  37. Wang L, Soong FK (2014) HMM trajectory-guided sample selection for photo-realistic talking head. In: Multimedia Tools and Applications, pp 1–21

  38. Wang Q, Zhang W, Tang X, Shum H Y (2006) Real-time bayesian 3-d pose tracking. IEEE Trans Circ Syst Video Tech 16(12):1533–1541

    Article  Google Scholar 

  39. Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectoryguided sample selection. In: Interspeech, pp 446–449

  40. Wang L, Han W, Soong FK, Huo Q (2011) Text driven 3d photo-realistic talking head. In: Interspeech, pp 3307–3310

  41. Wang L, Qian Y, Scott M R, Chen G, Soong F K (2012) Computer-assisted audiovisual language learning. Computer 45(6):38–47

    Article  Google Scholar 

  42. Williams R J, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2):270–280

    Article  Google Scholar 

  43. Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In: Proc. Interspeech, pp 1802–1805

  44. Wu Z, Valentini-Botinhao C, Watts O, King S (2015) Deep neural networks employing multi-task learining and stacked bottleneck features for speech synthesis. In: ICASSP, pp 4460–4464

  45. Xie L, Liu Z (2006) Speech animation using coupled hidden Markov models. In: ICPR, pp 1128–1131

  46. Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(23):500–510

    Google Scholar 

  47. Xie L, Sun N, Fan B (2014) A statistical parametric approach to video-realistic text-driven talking avatar. Multimedia Tool Appl 73(1):377–396

    Article  Google Scholar 

  48. Zen H, Tokuda K, Black A (2009) Statistical parametric speech synthesis. Speech Comm 51(11):1039–1064

    Article  Google Scholar 

  49. Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: ICASSP, pp 7962–7966

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 61175018 and 61571363).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bo Fan or Lei Xie.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fan, B., Xie, L., Yang, S. et al. A deep bidirectional LSTM approach for video-realistic talking head. Multimed Tools Appl 75, 5287–5309 (2016). https://doi.org/10.1007/s11042-015-2944-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2944-3

Keywords

Navigation