Skip to main content
Log in

A multimodal approach of generating 3D human-like talking agent

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

This paper introduces a multimodal framework of generating a 3D human-like talking agent which can communicate with user through speech, lip movement, head motion, facial expression and body animation. In this framework, lip movements are obtained by searching and matching acoustic features which are represented by Mel-frequency cepstral coefficients (MFCC) in audio-visual bimodal database. Head motion is synthesized by visual prosody which maps textual prosodic features into rotational and translational parameters. Facial expression and body animation are generated by transferring motion data to skeleton. A simplified high level Multimodal Marker Language (MML), in which only a few fields are used to coordinate the agent channels, is introduced to drive the agent. The experiments validate the effectiveness of the proposed multimodal framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. http://en.wikipedia.org/wiki/Ananova (2011) Accessed 16 August

  2. Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted language learning. Speech Commun 51(10):1024–1037

    Article  Google Scholar 

  3. http://www.mmdagent.jp/ (2011) Accessed 16 August

  4. http://www.speech.kth.se/multimodal/ (2011) Accessed 16 August

  5. Badler N, Steedman M, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH, pp 73–80

    Google Scholar 

  6. Van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interfaces 3(4):271–284 ISSN 1783-7677

    Article  Google Scholar 

  7. Cerekovic A, Pejsa T, Pandzic IS (2009) RealActor: character animation and multimodal behavior realization system. In: IVA, pp 486–487

    Google Scholar 

  8. Kipp M, Heloir A, Gebhard P, Schroeder M (2010) Realizing multimodal behavior: closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer, Berlin

    Google Scholar 

  9. Courgeon M, Rebillat M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: Proceedings of the 5th meeting of the French association for virtual reality

    Google Scholar 

  10. Park SI, Shin HJ, Shin SY (2002) On-line locomotion generation based on motion blending. In: Proc of the ACM SIGGRAPH/eurographics symposium on computer animation, New York, NY, USA. ACM Press, New York, pp 105–111

    Chapter  Google Scholar 

  11. Baerlocher P (2001) Inverse kinematics techniques for the interactive posture control of articulated figures. PhD thesis, Swiss Federal Institute of Technology, EPFL

  12. Cassell J, Vilhjalmsson HH, Bickmore TW (2001) Beat: the behavior expression animation toolkit. In: Proceedings of SIGGRAPH, pp 477–486

    Google Scholar 

  13. Gu E, Badler N (2006) Visual attention and eye gaze during multipartite conversations with distractions. In: Proc of intelligent virtual agents (IVA’06), Marina del Rey, CA

    Google Scholar 

  14. Faloutsos P, van de Panne M, Terzopoulos D (2001) Composable controllers for physics-based character animation. In: SIGGRAPH ’01: proceedings of the 28th annual conference on computer graphics and interactive techniques, New York, NY, USA. ACM Press, New York, pp 251–260

    Chapter  Google Scholar 

  15. Kuffner JJ, Latombe JC (2000) Interactive manipulation planning for animated characters. In: Proc of pacic graphics’00, Hong Kong

    Google Scholar 

  16. Kallmann M (2005) Scalable solutions for interactive virtual humans that can manipulate objects. In: Artificial intelligence and interactive digital entertainment (AIIDE), Marina del Rey, CA

    Google Scholar 

  17. Carolis BD, Pelachaud C, Poggi I, de Rosis F (2001) Behavior planning for a reflexive agent. In: Proceedings of the international joint conference on artificial intelligence (IJCAI’01), Seattle

    Google Scholar 

  18. Graf HP, Cosatto E, Ostermann J, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91:1406–1429

    Article  Google Scholar 

  19. Graf HP, Cosatto S., Huang F (2002) Visual prosody: facial movements accompanying speech. In: Fifth IEEE international conference on automatic face and gesture recognition

    Google Scholar 

  20. Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 24:331–347

    Article  Google Scholar 

  21. Bodenheimer B, Rose C, Rosenthal S, Pella J (1997) The process of motion capture: Dealing with the data. In: Thalmann, computer animation and simulation. Eurographics Animation Workshop. Springer, New York, p 318

    Google Scholar 

  22. Boulic R, Becheiraz P, Emering L, Thalmann D (1997) Integration of motion control techniques for virtual human and avatar real-time animation. In: Proc of virtual reality software and technology, Switzerland, pp 111–118

    Chapter  Google Scholar 

  23. Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Trans Graph 23(3):506–513

    Article  Google Scholar 

  24. Thiebaux M, Marsella S, Marshall AN, Kallmann M (2008) SmartBody behavior realization for embodied conversational agents AAMAS. In: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems, international foundation for autonomous agents and multiagent systems, 2008, pp 151–158

    Google Scholar 

  25. Mancini Greta M, Pelachaud C (2007) Dynamic behavior qualifiers for conversational agents. In: Intelligent virtual agents, IVA’07, Paris

    Google Scholar 

  26. Lewis J (1991) Automated lip-sync: Background and techniques. J Vis Comput Animat 2:118–122

    Article  Google Scholar 

  27. Wen Z, Hong P, Huang TS (2002) Real-time speech-driven face animation with expressions using neural networks. IEEE Trans Neural Netw 13:916–927

    Article  Google Scholar 

  28. Xin L, Tao J, Yin P (2009) Realistic visual speech synthesis based on hybrid concatenation method. IEEE Trans Audio Speech Lang Process 17:469–477

    Article  Google Scholar 

  29. Che J, Yang M, Mu K, Tao J (2010) Real-time speech-driven lip synchronization. In: 4th International universal communication symposium, pp 377–381

    Google Scholar 

  30. Oki BM, Goldberg D, Nichols D, Terry D (1992) Using collaborative filtering to weave an information tapestry. Commun ACM 35:61–70

    Google Scholar 

  31. Tekalp AM, Ostermann J (2000) Face and 2-d mesh animation in mpeg-4. Signal Process Image Commun 15:387–421

    Article  Google Scholar 

  32. Young S et al. (2000) The HTK book (v3.0). Cambridge University Engineering Department, Cambridge

    Google Scholar 

  33. Xu M, Duan L-Y, Cai J et al. (2004) HMM-based audio keyword generation. In: Advances in multimedia information processing, 5th Pacific rim conference on multimedia

    Google Scholar 

  34. Jia H, Liu F, Tao J (2008) A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin. In: Proceedings of 6th international symposium on Chinese spoken language processing

    Google Scholar 

  35. Mu K, Tao J, Che J, Yang M (2010) Mood Avatar: Automatic Text-Driven Head Motion Synthesis. In: 12th international conference on multimodal interfaces and 7th workshop on machine learning for multimodal interaction, Beijing, China

    Google Scholar 

  36. Deng Z, Neumann U, Busso C, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Animat Virtual Worlds 16:283–290

    Article  Google Scholar 

  37. http://home.gna.org/cal3d/ (2011) Accessed 16 August

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minghao Yang.

Additional information

This work is supported in part by National Science Foundation of China (No. 60873160 and No.90820303) and China-Singapore Institute of Digital Media (CSIDM).

Electronic Supplementary Material

Below are the links to the electronic supplementary material.

(AVI 1.50 MB)

(AVI 667 kB)

(AVI 1.63 MB)

(AVI 2.74 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, M., Tao, J., Mu, K. et al. A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5, 61–68 (2012). https://doi.org/10.1007/s12193-011-0073-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-011-0073-5

Keywords

Navigation