On creating multimodal virtual humans—real time speech driven facial gesturing

Zoric, Goranka; Forchheimer, Rober; Pandzic, Igor S.

doi:10.1007/s11042-010-0526-y

On creating multimodal virtual humans—real time speech driven facial gesturing

Published: 29 April 2010

Volume 54, pages 165–179, (2011)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Goranka Zoric¹,
Rober Forchheimer² &
Igor S. Pandzic¹

343 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

Because of extensive use of different computer devices, human-computer interaction design nowadays moves towards creating user centric interfaces. It assumes incorporating different modalities that humans use in everyday communication. Virtual humans, who look and behave believably, fit perfectly in the concept of designing interfaces in more natural, effective, as well as social oriented way. In this paper we present a novel method for automatic speech driven facial gesturing for virtual humans capable of real time performance. Facial gestures included are various nods and head movements, blinks, eyebrow gestures and gaze. A mapping from speech to facial gestures is based on the prosodic information obtained from the speech signal. It is realized using a hybrid approach—Hidden Markov Models, rules and global statistics. Further, we test the method using an application prototype—a system for speech driven facial gesturing suitable for virtual presenters. Subjective evaluation of the system confirmed that the synthesized facial movements are consistent and time aligned with the underlying speech, and thus provide natural behavior of the whole face.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

Article 13 April 2024

The struggle for recognition in the age of facial recognition technology

Article Open access 08 March 2022

Usability Evaluation of Artificial Intelligence-Based Voice Assistants: The Case of Amazon Alexa

Article 11 January 2021

References

Albrecht I, Haber J, Seidel H (2002) Automatic generation of non-verbal facial expressions from speech. In Proceedings of Computer Graphics International 2002 (CGI 2002), pages 283–293
Cavé C, Guaïtella I, Bertrand R, Santi S, Harlay F, Espesser R (1996) About the relationship between eyebrow movements and F0 variations, In Proceedings of Int’l Conf. Spoken Language Processing
Chovil N (1991) Discourse-oriented facial displays in conversation. Research on Language and Social Interaction
Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 2:331–347
Article Google Scholar
Condon WS, Ogston WD (1971) Speech and body motion synchrony of speaker-hearer. In Horton DL, Jenkins JJ (eds) Perception of language, 150–184
Cosnier J (1991) Les gestes de la question. In Kerbrat-Orecchioni, editor, La Question, 163–171, Presses Universitaires de Lyon
Deng Z, Busso C, Narayanan S, Neumann U (2004) Audio-based head motion synthesis for avatar-based telepresence systems. Proc ACM SIGMM Workshop on Effective Telepresence (ETP), 24–30
Ekman P (1979) About brows: Emotional and conversational signals. In von Cranach M, Foppa K, Lepenies W, Ploog D (eds) Human ethology: Claims and limits of a new discipline.
Graf HP, Cosatto E, Strom V, Huang FJ (2002) Visual prosody: facial movements accompanying speech. In Proceedings of AFGR 2002, 381–386
Granström B, House D, Lundeberg M (1999) Eyebrow movements as a cue to prominence. In The Third Swedish Symposium on Multimodal Communication
Hofer G, Shimodaira H (2007) Automatic head motion prediction from speech data, in In Proceedings of Interspeech
Honda K (2000) Interactions between vowel articulation and F0 control. In Proceedings of Linguistics and Phonetics: Item Order in Language and Speech (LP’98). Fujimura BDJO, Palek B (eds)
House D, Beskow J, Granström B (2001) Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proceedings of Eurospeech
HTK, The Hidden Markov Model Toolkit, http://htk.eng.cam.ac.uk/
Kuratate T, Munhall KG, Rubin PE, Vatikiotis-Bateson E, Yehia HC (1999) Audio-visual synthesis of talking faces from speech production correlates. EuroSpeech99 3:1279–1282
Google Scholar
Levine S, Theobalt C, Koltun V (2009) Real-time prosody-driven synthesis of body language. In proceedings of ACM SIGGRAPH Asia
Munhall KG, Jones JA, Callan DE, Kuratate T, Bateson EV (2004) Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol Sci 15(2):133–137
Article Google Scholar
Pandzic IS, Forchheimer R Editors (2002) MPEG-4 Facial Animation—The Standard, Implementation and Applications, John Wiley & Sons Ltd, ISBN 0-470-84465-5
Pelachaud C, Badler N, Steedman M (1996) Generating facial expressions for speech. Cogn Sci 20(1):1–46
Article Google Scholar
Salvi G, Beskow J, Al Moubayed S, Granström B (2009) SynFace—speech-driven facial animation for virtual speech-reading support. EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009
Sargin ME, Erzin E, Yemez Y, Tekalp AM, Erdem AT, Erdem C, Ozkan M (2007) Prosody-Driven Head-Gesture Animation, ICASSP’07
SPTK, The Speech Signal Processing Toolkit, http://sp-tk.sourceforge.net/
Visage Technologies, http://www.visagetechnologies.com/
Yehia H, Kuratate T, Vatikiotis-Bateson E (2000) Facial animation and head motion driven by speech acoustics, 5th Seminar on Speech Production: Models and Data. Hoole P, (ed) Kloster Seeon
Zoric G (2005) Automatic Lip Synchronization by Speech Signal Analysis, Master Thesis (03-Ac-17/2002-z) on Faculty of Electrical Engineering and Computing, University of Zagreb
Zoric G, Smid K, Pandzic I (2007) Facial gestures: Taxonomy and application of nonverbal, nonemotional facial displays for emodied conversational agents. In Toyoaki Nishida (ed) Conversational Informatics—An Engineering Approach. John Wiley & Sons, pp. 161–182, ISBN 978-0-470-02699-1
Zoric G, Smid K, Pandzic IS (2009) Towards facial gestures generation by speech signal analysis using HUGE architecture. Multimodal Signals Cogn Algorithmic Issues Lect Notes Comput Sci LNCS 5398:112–120
Article Google Scholar

Download references

Acknowledgments

The work was partly carried out within the research project “Embodied Conversational Agents as interface for networked and mobile services” supported by the Ministry of Science, Education and Sports of the Republic of Croatia. This work was partly supported by grants from The National Foundation for Science, Higher Education and Technological Development of the Republic of Croatia and The Swedish Institute, Sweden.

Author information

Authors and Affiliations

Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR-10 000, Zagreb, Croatia
Goranka Zoric & Igor S. Pandzic
Department of Electrical Engineering, Linköping University, 581 83, Linköping, Sweden
Rober Forchheimer

Authors

Goranka Zoric
View author publications
You can also search for this author in PubMed Google Scholar
Rober Forchheimer
View author publications
You can also search for this author in PubMed Google Scholar
Igor S. Pandzic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Goranka Zoric.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zoric, G., Forchheimer, R. & Pandzic, I.S. On creating multimodal virtual humans—real time speech driven facial gesturing. Multimed Tools Appl 54, 165–179 (2011). https://doi.org/10.1007/s11042-010-0526-y

Download citation

Published: 29 April 2010
Issue Date: August 2011
DOI: https://doi.org/10.1007/s11042-010-0526-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On creating multimodal virtual humans—real time speech driven facial gesturing

Abstract

Access this article

Similar content being viewed by others

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

The struggle for recognition in the age of facial recognition technology

Usability Evaluation of Artificial Intelligence-Based Voice Assistants: The Case of Amazon Alexa

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On creating multimodal virtual humans—real time speech driven facial gesturing

Abstract

Access this article

Similar content being viewed by others

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

The struggle for recognition in the age of facial recognition technology

Usability Evaluation of Artificial Intelligence-Based Voice Assistants: The Case of Amazon Alexa

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation