Speech driven realistic mouth animation based on multi-modal unit selection

Jiang, Dongmei; Ravyse, Ilse; Sahli, Hichem; Verhelst, Werner

doi:10.1007/s12193-009-0015-7

Speech driven realistic mouth animation based on multi-modal unit selection

Original Paper
Published: 24 June 2009

Volume 2, article number 157, (2008)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Dongmei Jiang¹,
Ilse Ravyse²,
Hichem Sahli² &
…
Werner Verhelst²

138 Accesses
6 Citations
Explore all metrics

Abstract

This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted sum of three items: 1) logarithm of concatenation smoothness of the synthesized mouth trajectory; 2) logarithm of the pronunciation distance; 3) logarithm of the audio intensity distance between the candidate diviseme instance and the target diviseme segment in the incoming speech. The selected diviseme instances are time warped and blended to construct the mouth animation. Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite. Clear, accurate, smooth mouth animations can be obtained matching well with the pronunciation and intensity changes in the incoming speech. Moreover, with the logarithm function in the accumulative cost, it is easy to set the weights to obtain optimal mouth animations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Article Open access 02 January 2020

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Article 26 March 2024

A novel facial emotion recognition model using segmentation VGG-19 architecture

Article 24 March 2023

References

McGurk H, MacDonaldd J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Massaro D (1998) Perceiving talking faces. MIT Press, Cambridge
Google Scholar
Theobald BJ, Bangham JA, Matthews IA, Cawley GC (2004) Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Commun. 44:127–140
Article Google Scholar
Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar. In: Proc of the international conference on spoken language processing (ICSLP), Pittsburg, USA, Sep 17–21
Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In: Computer graphics annual conference series (SIGGRAPH), pp 353–360, Los Angeles, California
Cosatto E, Graf H (1998) Sample-based synthesis of photorealistic talking heads. In: Proc. of computer animation, pp 103–110, Philadelphia, Pennsylvania
Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38:45–57
Article MATH Google Scholar
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Proc of the international conference on computer graphics and interactive techniques (SIGGRAPH), pp 388–398, San Antonio, Texas
Huang F, Cosatto E, Graf H (2002) Triphone based unit selection for concatenative visual speech synthesis. In: Proc of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol II, pp 2037–2040, Orlando, Florida, USA
Fagel S (2004) Video-realistic synthetic speech with a parametric visual speech synthesizer. In: Proc of the 8th international conference on spoken language processing (INTERSPEECH), pp 2033–2036
Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden Markov models. Speech Commun 26(1):105–115
Article Google Scholar
Nakamura S, Yamamoto E, Shikano K (1998) Speech-to-lip movement synthesis by maximizing audio-visual joint probability based on the EM algorithm. In: Proc of the IEEE second workshop on multimedia signal processing (MMSP), pp 53–58
Choi K, Luo Y, Hwang J (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J VLSI Signal Process 29:51–61
Article MATH Google Scholar
Aleksic PS, Katsaggelos AK (2003) Speech-to-video synthesis using facial animation parameters. In: Proc of the 2003 international conference on image processing (ICIP03), vol 2, issue III, pp 1–4
Cosker D, Marshall D, Rosin P, Hicks Y (2004) Speech driven facial animation using a hidden Markov coarticulation model. In: Proc of the 17th international conference on pattern recognition 2004 (ICPR2004), vol 1, pp 128–131
Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510
Article Google Scholar
Jiang D, Xie L, Ravyse I, Zhao R, Sahli H, Cornelis J (2002) Triseme decision trees in the continuous speech recognition system for a talking head. In: Proc of the 1st IEEE international conference on machine learning and cybernetics, pp 2097–2100
Verma A, Rajput N, Subramaniam L (2003) Using viseme based acoustic models for speech driven lip synthesis. In: Proc of the IEEE international conference on acoustic speech and signal processing (ICASSP), pp 720–723
Deng Z, Neumann U, Lewiss JP et al. (2006) Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans Vis Comput Graph 12(6):1–12
Article Google Scholar
Cao Y, Faloutsos P, Kohler E, Pighin F (2004) Real-time speech motion synthesis from recorded motions. In: Proc of the ACM SIGGRAPH/Eurographics symposium on computer animation, pp 347–355
Ma J, Cole R, Pellom B, Ward W, Wise B (2006) Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Trans Vis Comput Graph 12:1–11
Article Google Scholar
Ravyse I, Enescu V, Sahli H (2005) Kernel-based head tracker for videophony. In: Proc of the IEEE international conference on image processing 2005 (ICIP2005), Genoa, Italy, vol 3, pp 1068–1071
Hou Y, Sahli H, Ravyse I, Zhang Y, Zhao R (2007) Robust shape-based head tracking. In: Proc of the advanced concepts for intelligent vision systems. LNCS, vol 4678, pp 340–351
Ma J, Cole R, Pellom B, Ward W, Wise B (2004) Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data. Comput Animat Virtual Worlds 15:485–500
Article Google Scholar
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1. Accessed 5 December 2008
Young SJ (1993) The HTK hidden Markov model toolkit: design and philosophy. Technical Report, University of Cambridge, Department of Engineering, Cambridge, UK
Jiang D, Ravyse I, Sahli H, Zhang Y (2008) Accurate visual speech synthesis based on diviseme unit selection and concatenation. In: Proc of the IEEE 10th workshop on multimedia signal processing (MMSP2008), pp 906–909
Schaback R (1995) Computer aided geometric design III. Vanderbilt University Press, Nashville, pp 477–496
Google Scholar
Ravyse I (2006) Facial analysis and synthesis. Ph.D. Thesis, Dept. Electronics and Informatics, Vrije Universiteit Brussel, Belgium
Cosatto E, Potamianos G, Graf HP (2000) Audio visual unit selection for the synthesis of photo-realistic talking heads. In: Proc of the IEEE international conference on multimedia and expo (ICME), vol 2, pp 619–622
Lv G, Jiang D, Zhao R, Hou Y (2007) Multi-stream asynchrony modeling for audio-visual speech recognition. In: Proc of the IEEE international symposium on multimedia (ISM), pp 37–44

Download references

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, 127 Youyi Xilu, Xi’an, 710072, P.R. China
Dongmei Jiang
Department ETRO, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
Ilse Ravyse, Hichem Sahli & Werner Verhelst

Authors

Dongmei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ilse Ravyse
View author publications
You can also search for this author in PubMed Google Scholar
Hichem Sahli
View author publications
You can also search for this author in PubMed Google Scholar
Werner Verhelst
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongmei Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, D., Ravyse, I., Sahli, H. et al. Speech driven realistic mouth animation based on multi-modal unit selection. J Multimodal User Interfaces 2, 157 (2008). https://doi.org/10.1007/s12193-009-0015-7

Download citation

Received: 18 December 2008
Accepted: 10 June 2009
Published: 24 June 2009
DOI: https://doi.org/10.1007/s12193-009-0015-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech driven realistic mouth animation based on multi-modal unit selection

Abstract

Access this article

Similar content being viewed by others

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

A novel facial emotion recognition model using segmentation VGG-19 architecture

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech driven realistic mouth animation based on multi-modal unit selection

Abstract

Access this article

Similar content being viewed by others

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

A novel facial emotion recognition model using segmentation VGG-19 architecture

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation