Abstract
Because of extensive use of different computer devices, human-computer interaction design nowadays moves towards creating user centric interfaces. It assumes incorporating different modalities that humans use in everyday communication. Virtual humans, who look and behave believably, fit perfectly in the concept of designing interfaces in more natural, effective, as well as social oriented way. In this paper we present a novel method for automatic speech driven facial gesturing for virtual humans capable of real time performance. Facial gestures included are various nods and head movements, blinks, eyebrow gestures and gaze. A mapping from speech to facial gestures is based on the prosodic information obtained from the speech signal. It is realized using a hybrid approach—Hidden Markov Models, rules and global statistics. Further, we test the method using an application prototype—a system for speech driven facial gesturing suitable for virtual presenters. Subjective evaluation of the system confirmed that the synthesized facial movements are consistent and time aligned with the underlying speech, and thus provide natural behavior of the whole face.
Similar content being viewed by others
References
Albrecht I, Haber J, Seidel H (2002) Automatic generation of non-verbal facial expressions from speech. In Proceedings of Computer Graphics International 2002 (CGI 2002), pages 283–293
Cavé C, Guaïtella I, Bertrand R, Santi S, Harlay F, Espesser R (1996) About the relationship between eyebrow movements and F0 variations, In Proceedings of Int’l Conf. Spoken Language Processing
Chovil N (1991) Discourse-oriented facial displays in conversation. Research on Language and Social Interaction
Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 2:331–347
Condon WS, Ogston WD (1971) Speech and body motion synchrony of speaker-hearer. In Horton DL, Jenkins JJ (eds) Perception of language, 150–184
Cosnier J (1991) Les gestes de la question. In Kerbrat-Orecchioni, editor, La Question, 163–171, Presses Universitaires de Lyon
Deng Z, Busso C, Narayanan S, Neumann U (2004) Audio-based head motion synthesis for avatar-based telepresence systems. Proc ACM SIGMM Workshop on Effective Telepresence (ETP), 24–30
Ekman P (1979) About brows: Emotional and conversational signals. In von Cranach M, Foppa K, Lepenies W, Ploog D (eds) Human ethology: Claims and limits of a new discipline.
Graf HP, Cosatto E, Strom V, Huang FJ (2002) Visual prosody: facial movements accompanying speech. In Proceedings of AFGR 2002, 381–386
Granström B, House D, Lundeberg M (1999) Eyebrow movements as a cue to prominence. In The Third Swedish Symposium on Multimodal Communication
Hofer G, Shimodaira H (2007) Automatic head motion prediction from speech data, in In Proceedings of Interspeech
Honda K (2000) Interactions between vowel articulation and F0 control. In Proceedings of Linguistics and Phonetics: Item Order in Language and Speech (LP’98). Fujimura BDJO, Palek B (eds)
House D, Beskow J, Granström B (2001) Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proceedings of Eurospeech
HTK, The Hidden Markov Model Toolkit, http://htk.eng.cam.ac.uk/
Kuratate T, Munhall KG, Rubin PE, Vatikiotis-Bateson E, Yehia HC (1999) Audio-visual synthesis of talking faces from speech production correlates. EuroSpeech99 3:1279–1282
Levine S, Theobalt C, Koltun V (2009) Real-time prosody-driven synthesis of body language. In proceedings of ACM SIGGRAPH Asia
Munhall KG, Jones JA, Callan DE, Kuratate T, Bateson EV (2004) Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol Sci 15(2):133–137
Pandzic IS, Forchheimer R Editors (2002) MPEG-4 Facial Animation—The Standard, Implementation and Applications, John Wiley & Sons Ltd, ISBN 0-470-84465-5
Pelachaud C, Badler N, Steedman M (1996) Generating facial expressions for speech. Cogn Sci 20(1):1–46
Salvi G, Beskow J, Al Moubayed S, Granström B (2009) SynFace—speech-driven facial animation for virtual speech-reading support. EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009
Sargin ME, Erzin E, Yemez Y, Tekalp AM, Erdem AT, Erdem C, Ozkan M (2007) Prosody-Driven Head-Gesture Animation, ICASSP’07
SPTK, The Speech Signal Processing Toolkit, http://sp-tk.sourceforge.net/
Visage Technologies, http://www.visagetechnologies.com/
Yehia H, Kuratate T, Vatikiotis-Bateson E (2000) Facial animation and head motion driven by speech acoustics, 5th Seminar on Speech Production: Models and Data. Hoole P, (ed) Kloster Seeon
Zoric G (2005) Automatic Lip Synchronization by Speech Signal Analysis, Master Thesis (03-Ac-17/2002-z) on Faculty of Electrical Engineering and Computing, University of Zagreb
Zoric G, Smid K, Pandzic I (2007) Facial gestures: Taxonomy and application of nonverbal, nonemotional facial displays for emodied conversational agents. In Toyoaki Nishida (ed) Conversational Informatics—An Engineering Approach. John Wiley & Sons, pp. 161–182, ISBN 978-0-470-02699-1
Zoric G, Smid K, Pandzic IS (2009) Towards facial gestures generation by speech signal analysis using HUGE architecture. Multimodal Signals Cogn Algorithmic Issues Lect Notes Comput Sci LNCS 5398:112–120
Acknowledgments
The work was partly carried out within the research project “Embodied Conversational Agents as interface for networked and mobile services” supported by the Ministry of Science, Education and Sports of the Republic of Croatia. This work was partly supported by grants from The National Foundation for Science, Higher Education and Technological Development of the Republic of Croatia and The Swedish Institute, Sweden.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zoric, G., Forchheimer, R. & Pandzic, I.S. On creating multimodal virtual humans—real time speech driven facial gesturing. Multimed Tools Appl 54, 165–179 (2011). https://doi.org/10.1007/s11042-010-0526-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-010-0526-y