Skip to main content
Log in

On creating multimodal virtual humans—real time speech driven facial gesturing

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Because of extensive use of different computer devices, human-computer interaction design nowadays moves towards creating user centric interfaces. It assumes incorporating different modalities that humans use in everyday communication. Virtual humans, who look and behave believably, fit perfectly in the concept of designing interfaces in more natural, effective, as well as social oriented way. In this paper we present a novel method for automatic speech driven facial gesturing for virtual humans capable of real time performance. Facial gestures included are various nods and head movements, blinks, eyebrow gestures and gaze. A mapping from speech to facial gestures is based on the prosodic information obtained from the speech signal. It is realized using a hybrid approach—Hidden Markov Models, rules and global statistics. Further, we test the method using an application prototype—a system for speech driven facial gesturing suitable for virtual presenters. Subjective evaluation of the system confirmed that the synthesized facial movements are consistent and time aligned with the underlying speech, and thus provide natural behavior of the whole face.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Albrecht I, Haber J, Seidel H (2002) Automatic generation of non-verbal facial expressions from speech. In Proceedings of Computer Graphics International 2002 (CGI 2002), pages 283–293

  2. Cavé C, Guaïtella I, Bertrand R, Santi S, Harlay F, Espesser R (1996) About the relationship between eyebrow movements and F0 variations, In Proceedings of Int’l Conf. Spoken Language Processing

  3. Chovil N (1991) Discourse-oriented facial displays in conversation. Research on Language and Social Interaction

  4. Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 2:331–347

    Article  Google Scholar 

  5. Condon WS, Ogston WD (1971) Speech and body motion synchrony of speaker-hearer. In Horton DL, Jenkins JJ (eds) Perception of language, 150–184

  6. Cosnier J (1991) Les gestes de la question. In Kerbrat-Orecchioni, editor, La Question, 163–171, Presses Universitaires de Lyon

  7. Deng Z, Busso C, Narayanan S, Neumann U (2004) Audio-based head motion synthesis for avatar-based telepresence systems. Proc ACM SIGMM Workshop on Effective Telepresence (ETP), 24–30

  8. Ekman P (1979) About brows: Emotional and conversational signals. In von Cranach M, Foppa K, Lepenies W, Ploog D (eds) Human ethology: Claims and limits of a new discipline.

  9. Graf HP, Cosatto E, Strom V, Huang FJ (2002) Visual prosody: facial movements accompanying speech. In Proceedings of AFGR 2002, 381–386

  10. Granström B, House D, Lundeberg M (1999) Eyebrow movements as a cue to prominence. In The Third Swedish Symposium on Multimodal Communication

  11. Hofer G, Shimodaira H (2007) Automatic head motion prediction from speech data, in In Proceedings of Interspeech

  12. Honda K (2000) Interactions between vowel articulation and F0 control. In Proceedings of Linguistics and Phonetics: Item Order in Language and Speech (LP’98). Fujimura BDJO, Palek B (eds)

  13. House D, Beskow J, Granström B (2001) Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proceedings of Eurospeech

  14. HTK, The Hidden Markov Model Toolkit, http://htk.eng.cam.ac.uk/

  15. Kuratate T, Munhall KG, Rubin PE, Vatikiotis-Bateson E, Yehia HC (1999) Audio-visual synthesis of talking faces from speech production correlates. EuroSpeech99 3:1279–1282

    Google Scholar 

  16. Levine S, Theobalt C, Koltun V (2009) Real-time prosody-driven synthesis of body language. In proceedings of ACM SIGGRAPH Asia

  17. Munhall KG, Jones JA, Callan DE, Kuratate T, Bateson EV (2004) Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol Sci 15(2):133–137

    Article  Google Scholar 

  18. Pandzic IS, Forchheimer R Editors (2002) MPEG-4 Facial Animation—The Standard, Implementation and Applications, John Wiley & Sons Ltd, ISBN 0-470-84465-5

  19. Pelachaud C, Badler N, Steedman M (1996) Generating facial expressions for speech. Cogn Sci 20(1):1–46

    Article  Google Scholar 

  20. Salvi G, Beskow J, Al Moubayed S, Granström B (2009) SynFace—speech-driven facial animation for virtual speech-reading support. EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009

  21. Sargin ME, Erzin E, Yemez Y, Tekalp AM, Erdem AT, Erdem C, Ozkan M (2007) Prosody-Driven Head-Gesture Animation, ICASSP’07

  22. SPTK, The Speech Signal Processing Toolkit, http://sp-tk.sourceforge.net/

  23. Visage Technologies, http://www.visagetechnologies.com/

  24. Yehia H, Kuratate T, Vatikiotis-Bateson E (2000) Facial animation and head motion driven by speech acoustics, 5th Seminar on Speech Production: Models and Data. Hoole P, (ed) Kloster Seeon

  25. Zoric G (2005) Automatic Lip Synchronization by Speech Signal Analysis, Master Thesis (03-Ac-17/2002-z) on Faculty of Electrical Engineering and Computing, University of Zagreb

  26. Zoric G, Smid K, Pandzic I (2007) Facial gestures: Taxonomy and application of nonverbal, nonemotional facial displays for emodied conversational agents. In Toyoaki Nishida (ed) Conversational Informatics—An Engineering Approach. John Wiley & Sons, pp. 161–182, ISBN 978-0-470-02699-1

  27. Zoric G, Smid K, Pandzic IS (2009) Towards facial gestures generation by speech signal analysis using HUGE architecture. Multimodal Signals Cogn Algorithmic Issues Lect Notes Comput Sci LNCS 5398:112–120

    Article  Google Scholar 

Download references

Acknowledgments

The work was partly carried out within the research project “Embodied Conversational Agents as interface for networked and mobile services” supported by the Ministry of Science, Education and Sports of the Republic of Croatia. This work was partly supported by grants from The National Foundation for Science, Higher Education and Technological Development of the Republic of Croatia and The Swedish Institute, Sweden.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Goranka Zoric.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zoric, G., Forchheimer, R. & Pandzic, I.S. On creating multimodal virtual humans—real time speech driven facial gesturing. Multimed Tools Appl 54, 165–179 (2011). https://doi.org/10.1007/s11042-010-0526-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-010-0526-y

Keywords

Navigation