A multimodal approach of generating 3D human-like talking agent

Yang, Minghao; Tao, Jianhua; Mu, Kaihui; Li, Ya; Che, Jianfeng

doi:10.1007/s12193-011-0073-5

A multimodal approach of generating 3D human-like talking agent

Original Paper
Published: 10 November 2011

Volume 5, pages 61–68, (2012)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Minghao Yang¹,
Jianhua Tao¹,
Kaihui Mu¹,
Ya Li¹ &
…
Jianfeng Che²

164 Accesses
1 Citation
Explore all metrics

Abstract

This paper introduces a multimodal framework of generating a 3D human-like talking agent which can communicate with user through speech, lip movement, head motion, facial expression and body animation. In this framework, lip movements are obtained by searching and matching acoustic features which are represented by Mel-frequency cepstral coefficients (MFCC) in audio-visual bimodal database. Head motion is synthesized by visual prosody which maps textual prosodic features into rotational and translational parameters. Facial expression and body animation are generated by transferring motion data to skeleton. A simplified high level Multimodal Marker Language (MML), in which only a few fields are used to coordinate the agent channels, is introduced to drive the agent. The experiments validate the effectiveness of the proposed multimodal framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition: A Comprehensive Survey

Article 08 March 2023

Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa

Embodied Intelligence

References

http://en.wikipedia.org/wiki/Ananova (2011) Accessed 16 August
Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037
Article Google Scholar
http://www.mmdagent.jp/ (2011) Accessed 16 August
http://www.speech.kth.se/multimodal/ (2011) Accessed 16 August
Badler N, Steedman M, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH, pp 73–80
Google Scholar
Van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284 ISSN 1783-7677
Article Google Scholar
Cerekovic A, Pejsa T, Pandzic IS (2009) RealActor: character animation and multimodal behavior realization system. In: IVA, pp 486–487
Google Scholar
Kipp M, Heloir A, Gebhard P, Schroeder M (2010) Realizing multimodal behavior: closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer, Berlin
Google Scholar
Courgeon M, Rebillat M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: Proceedings of the 5th meeting of the French association for virtual reality
Google Scholar
Park SI, Shin HJ, Shin SY (2002) On-line locomotion generation based on motion blending. In: Proc of the ACM SIGGRAPH/eurographics symposium on computer animation, New York, NY, USA. ACM Press, New York, pp 105–111
Chapter Google Scholar
Baerlocher P (2001) Inverse kinematics techniques for the interactive posture control of articulated figures. PhD thesis, Swiss Federal Institute of Technology, EPFL
Cassell J, Vilhjalmsson HH, Bickmore TW (2001) Beat: the behavior expression animation toolkit. In: Proceedings of SIGGRAPH, pp 477–486
Google Scholar
Gu E, Badler N (2006) Visual attention and eye gaze during multipartite conversations with distractions. In: Proc of intelligent virtual agents (IVA’06), Marina del Rey, CA
Google Scholar
Faloutsos P, van de Panne M, Terzopoulos D (2001) Composable controllers for physics-based character animation. In: SIGGRAPH ’01: proceedings of the 28th annual conference on computer graphics and interactive techniques, New York, NY, USA. ACM Press, New York, pp 251–260
Chapter Google Scholar
Kuffner JJ, Latombe JC (2000) Interactive manipulation planning for animated characters. In: Proc of pacic graphics’00, Hong Kong
Google Scholar
Kallmann M (2005) Scalable solutions for interactive virtual humans that can manipulate objects. In: Artificial intelligence and interactive digital entertainment (AIIDE), Marina del Rey, CA
Google Scholar
Carolis BD, Pelachaud C, Poggi I, de Rosis F (2001) Behavior planning for a reflexive agent. In: Proceedings of the international joint conference on artificial intelligence (IJCAI’01), Seattle
Google Scholar
Graf HP, Cosatto E, Ostermann J, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91:1406–1429
Article Google Scholar
Graf HP, Cosatto S., Huang F (2002) Visual prosody: facial movements accompanying speech. In: Fifth IEEE international conference on automatic face and gesture recognition
Google Scholar
Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 24:331–347
Article Google Scholar
Bodenheimer B, Rose C, Rosenthal S, Pella J (1997) The process of motion capture: Dealing with the data. In: Thalmann, computer animation and simulation. Eurographics Animation Workshop. Springer, New York, p 318
Google Scholar
Boulic R, Becheiraz P, Emering L, Thalmann D (1997) Integration of motion control techniques for virtual human and avatar real-time animation. In: Proc of virtual reality software and technology, Switzerland, pp 111–118
Chapter Google Scholar
Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Trans Graph 23(3):506–513
Article Google Scholar
Thiebaux M, Marsella S, Marshall AN, Kallmann M (2008) SmartBody behavior realization for embodied conversational agents AAMAS. In: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems, international foundation for autonomous agents and multiagent systems, 2008, pp 151–158
Google Scholar
Mancini Greta M, Pelachaud C (2007) Dynamic behavior qualifiers for conversational agents. In: Intelligent virtual agents, IVA’07, Paris
Google Scholar
Lewis J (1991) Automated lip-sync: Background and techniques. J Vis Comput Animat 2:118–122
Article Google Scholar
Wen Z, Hong P, Huang TS (2002) Real-time speech-driven face animation with expressions using neural networks. IEEE Trans Neural Netw 13:916–927
Article Google Scholar
Xin L, Tao J, Yin P (2009) Realistic visual speech synthesis based on hybrid concatenation method. IEEE Trans Audio Speech Lang Process 17:469–477
Article Google Scholar
Che J, Yang M, Mu K, Tao J (2010) Real-time speech-driven lip synchronization. In: 4th International universal communication symposium, pp 377–381
Google Scholar
Oki BM, Goldberg D, Nichols D, Terry D (1992) Using collaborative filtering to weave an information tapestry. Commun ACM 35:61–70
Google Scholar
Tekalp AM, Ostermann J (2000) Face and 2-d mesh animation in mpeg-4. Signal Process Image Commun 15:387–421
Article Google Scholar
Young S et al. (2000) The HTK book (v3.0). Cambridge University Engineering Department, Cambridge
Google Scholar
Xu M, Duan L-Y, Cai J et al. (2004) HMM-based audio keyword generation. In: Advances in multimedia information processing, 5th Pacific rim conference on multimedia
Google Scholar
Jia H, Liu F, Tao J (2008) A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin. In: Proceedings of 6th international symposium on Chinese spoken language processing
Google Scholar
Mu K, Tao J, Che J, Yang M (2010) Mood Avatar: Automatic Text-Driven Head Motion Synthesis. In: 12th international conference on multimodal interfaces and 7th workshop on machine learning for multimodal interaction, Beijing, China
Google Scholar
Deng Z, Neumann U, Busso C, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Animat Virtual Worlds 16:283–290
Article Google Scholar
http://home.gna.org/cal3d/ (2011) Accessed 16 August

Download references

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, 95♯ Zhongguancun East Road, Beijing, 100190, P.R. China
Minghao Yang, Jianhua Tao, Kaihui Mu & Ya Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jianfeng Che

Authors

Minghao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Kaihui Mu
View author publications
You can also search for this author in PubMed Google Scholar
Ya Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Che
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minghao Yang.

Additional information

This work is supported in part by National Science Foundation of China (No. 60873160 and No.90820303) and China-Singapore Institute of Digital Media (CSIDM).

Electronic Supplementary Material

Below are the links to the electronic supplementary material.

(AVI 1.50 MB)

(AVI 667 kB)

(AVI 1.63 MB)

(AVI 2.74 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, M., Tao, J., Mu, K. et al. A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5, 61–68 (2012). https://doi.org/10.1007/s12193-011-0073-5

Download citation

Received: 24 April 2011
Accepted: 18 October 2011
Published: 10 November 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s12193-011-0073-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multimodal approach of generating 3D human-like talking agent

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition: A Comprehensive Survey

Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa

Embodied Intelligence

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

(AVI 2.74 MB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A multimodal approach of generating 3D human-like talking agent

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition: A Comprehensive Survey

Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa

Embodied Intelligence

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

(AVI 2.74 MB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation