Skip to main content
Log in

Speech driven realistic mouth animation based on multi-modal unit selection

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted sum of three items: 1) logarithm of concatenation smoothness of the synthesized mouth trajectory; 2) logarithm of the pronunciation distance; 3) logarithm of the audio intensity distance between the candidate diviseme instance and the target diviseme segment in the incoming speech. The selected diviseme instances are time warped and blended to construct the mouth animation. Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite. Clear, accurate, smooth mouth animations can be obtained matching well with the pronunciation and intensity changes in the incoming speech. Moreover, with the logarithm function in the accumulative cost, it is easy to set the weights to obtain optimal mouth animations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. McGurk H, MacDonaldd J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  2. Massaro D (1998) Perceiving talking faces. MIT Press, Cambridge

    Google Scholar 

  3. Theobald BJ, Bangham JA, Matthews IA, Cawley GC (2004) Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Commun. 44:127–140

    Article  Google Scholar 

  4. Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar. In: Proc of the international conference on spoken language processing (ICSLP), Pittsburg, USA, Sep 17–21

  5. Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Computer graphics annual conference series (SIGGRAPH), pp 353–360, Los Angeles, California

  6. Cosatto E, Graf H (1998) Sample-based synthesis of photorealistic talking heads. In: Proc. of computer animation, pp 103–110, Philadelphia, Pennsylvania

  7. Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38:45–57

    Article  MATH  Google Scholar 

  8. Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Proc of the international conference on computer graphics and interactive techniques (SIGGRAPH), pp 388–398, San Antonio, Texas

  9. Huang F, Cosatto E, Graf H (2002) Triphone based unit selection for concatenative visual speech synthesis. In: Proc of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol II, pp 2037–2040, Orlando, Florida, USA

  10. Fagel S (2004) Video-realistic synthetic speech with a parametric visual speech synthesizer. In: Proc of the 8th international conference on spoken language processing (INTERSPEECH), pp 2033–2036

  11. Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden Markov models. Speech Commun 26(1):105–115

    Article  Google Scholar 

  12. Nakamura S, Yamamoto E, Shikano K (1998) Speech-to-lip movement synthesis by maximizing audio-visual joint probability based on the EM algorithm. In: Proc of the IEEE second workshop on multimedia signal processing (MMSP), pp 53–58

  13. Choi K, Luo Y, Hwang J (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J VLSI Signal Process 29:51–61

    Article  MATH  Google Scholar 

  14. Aleksic PS, Katsaggelos AK (2003) Speech-to-video synthesis using facial animation parameters. In: Proc of the 2003 international conference on image processing (ICIP03), vol 2, issue III, pp 1–4

  15. Cosker D, Marshall D, Rosin P, Hicks Y (2004) Speech driven facial animation using a hidden Markov coarticulation model. In: Proc of the 17th international conference on pattern recognition 2004 (ICPR2004), vol 1, pp 128–131

  16. Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510

    Article  Google Scholar 

  17. Jiang D, Xie L, Ravyse I, Zhao R, Sahli H, Cornelis J (2002) Triseme decision trees in the continuous speech recognition system for a talking head. In: Proc of the 1st IEEE international conference on machine learning and cybernetics, pp 2097–2100

  18. Verma A, Rajput N, Subramaniam L (2003) Using viseme based acoustic models for speech driven lip synthesis. In: Proc of the IEEE international conference on acoustic speech and signal processing (ICASSP), pp 720–723

  19. Deng Z, Neumann U, Lewiss JP et al. (2006) Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans Vis Comput Graph 12(6):1–12

    Article  Google Scholar 

  20. Cao Y, Faloutsos P, Kohler E, Pighin F (2004) Real-time speech motion synthesis from recorded motions. In: Proc of the ACM SIGGRAPH/Eurographics symposium on computer animation, pp 347–355

  21. Ma J, Cole R, Pellom B, Ward W, Wise B (2006) Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Trans Vis Comput Graph 12:1–11

    Article  Google Scholar 

  22. Ravyse I, Enescu V, Sahli H (2005) Kernel-based head tracker for videophony. In: Proc of the IEEE international conference on image processing 2005 (ICIP2005), Genoa, Italy, vol 3, pp 1068–1071

  23. Hou Y, Sahli H, Ravyse I, Zhang Y, Zhao R (2007) Robust shape-based head tracking. In: Proc of the advanced concepts for intelligent vision systems. LNCS, vol 4678, pp 340–351

  24. Ma J, Cole R, Pellom B, Ward W, Wise B (2004) Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data. Comput Animat Virtual Worlds 15:485–500

    Article  Google Scholar 

  25. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1. Accessed 5 December 2008

  26. Young SJ (1993) The HTK hidden Markov model toolkit: design and philosophy. Technical Report, University of Cambridge, Department of Engineering, Cambridge, UK

  27. Jiang D, Ravyse I, Sahli H, Zhang Y (2008) Accurate visual speech synthesis based on diviseme unit selection and concatenation. In: Proc of the IEEE 10th workshop on multimedia signal processing (MMSP2008), pp 906–909

  28. Schaback R (1995) Computer aided geometric design III. Vanderbilt University Press, Nashville, pp 477–496

    Google Scholar 

  29. Ravyse I (2006) Facial analysis and synthesis. Ph.D. Thesis, Dept. Electronics and Informatics, Vrije Universiteit Brussel, Belgium

  30. Cosatto E, Potamianos G, Graf HP (2000) Audio visual unit selection for the synthesis of photo-realistic talking heads. In: Proc of the IEEE international conference on multimedia and expo (ICME), vol 2, pp 619–622

  31. Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio-visual speech recognition. In: Proc of the IEEE international symposium on multimedia (ISM), pp 37–44

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongmei Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, D., Ravyse, I., Sahli, H. et al. Speech driven realistic mouth animation based on multi-modal unit selection. J Multimodal User Interfaces 2, 157 (2008). https://doi.org/10.1007/s12193-009-0015-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12193-009-0015-7

Keywords

Navigation