Skip to main content

Advertisement

Log in

Cospeech body motion generation using a transformer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Body language is a method for communicating across languages and cultures. Making good use of body motions in speech can enhance persuasiveness, improve personal charisma, and make speech more effective. Generating matching body motions for digital avatars and social robots based on content has become an important topic. In this paper, we propose a transformer-based network model to generate body motions from input speech. Our model includes an audio transformer encoder, motion transformer encoder, template variational autoencoder, cross-modal transformer encoder, and motion decoder. Additionally, we propose a novel evaluation metric for describing motion change trends in terms of distance. The experimental results show that the proposed model provides higher-quality motion generation results than state-of-the-art models. As indicated by visual skeleton motions, our results are more natural and realistic than those of other methods. Additionally, the generated motions yield superior results in terms of multiple evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated and analyzed in this study are available from the corresponding author on reasonable request.

References

  1. Kopp S, Krenn B, Marsella S, Marshall AN, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson H (2006) Towards a common framework for multimodal generation: The behavior markup language. In: International workshop on intelligent virtual agents, pp 205–217. Springer

  2. Wagner P, Malisz Z, Kopp S (2014) Gesture and speech in interaction: An overview. Elsevier

  3. Levine S, Krähenbühl P, Thrun S, Koltun V (2010) Gesture controllers. In: ACM SIGGRAPH 2010 Papers, pp 1–11

  4. Kucherenko T, Hasegawa D, Henter G.E, Kaneko N, Kjellström H (2019) Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM international conference on intelligent virtual agents, pp 97–104

  5. Ferstl Y, Neff M, McDonnell R (2019) Multi-objective adversarial gesture generation. In: Motion, interaction and games, pp 1–10

  6. Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3497–3506

  7. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F,et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137. Springer

  8. Qian S, Tu Z, Zhi Y, Liu W, Gao S (2021) Speech drives templates: Co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11077–11086

  9. Li R, Yang S, Ross D.A, Kanazawa A (2021) Ai choreographer: Music conditioned 3d dance generation with aist++. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13401–13412

  10. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114

  11. Cassell J, Vilhjálmsson HH, Bickmore T (2001) Beat: the behavior expression animation toolkit. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp 477–486

  12. Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Trans Graph (ToG) 36(4):1–13

    Article  Google Scholar 

  13. Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C (2020) A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia, pp 484–492

  14. Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: European conference on computer vision, pp 35–51. Springer

  15. Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E, Goldman DB, Genova K, Jin Z, Theobalt C, Agrawala M (2019) Text-based editing of talking-head video. ACM Trans Graph (TOG) 38(4):1–14

    Article  Google Scholar 

  16. Yi R, Ye Z, Zhang J, Bao H, Liu Y-J (2020) Audio-driven talking face video generation with learning-based personalized head pose. arXiv:2002.10137

  17. Sadoughi N, Busso C (2018) Novel realizations of speech-driven head movements with generative adversarial networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6169–6173. IEEE

  18. Bergmann K, Kopp S (2009) Gnetic–using bayesian decision networks for iconic gesture generation. In: International workshop on intelligent virtual agents, pp 76–89. Springer

  19. Sadoughi N, Busso C (2019) Speech-driven animation with meaningful behaviors. Speech Commun 110:90–100

    Article  Google Scholar 

  20. Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International conference on robotics and automation (ICRA), pp 4303–4309. IEEE

  21. Shlizerman E, Dery L, Schoen H, Kemelmacher-Shlizerman I (2018) Audio to body dynamics. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7574–7583

  22. Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4884–4888

  23. Han C, Sun J, Bian Y, Que W, Shi L (2023) Automated detection and localization of myocardial infarction with interpretability analysis based on deep learning. IEEE Trans Instrum Meas 72:1–12

    Google Scholar 

  24. Koppensteiner M, Grammer K (2010) Motion patterns in political speech and their influence on personality ratings. J Res Pers 44(3):374–379

    Article  Google Scholar 

  25. Smith HJ, Neff M (2017) Understanding the impact of animated gesture performance on personality perceptions. ACM Trans Graph (TOG) 36(4):1–12

    Google Scholar 

  26. Castillo G, Neff M (2019) What do we express without knowing? emotion in gesture. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp 702–710

  27. Hsu E, Pulli K, Popović J (2005) Style translation for human motion. In: ACM SIGGRAPH 2005 Papers, pp 1082–1089

  28. Xia S, Wang C, Chai J, Hodgins J (2015) Realtime style transfer for unlabeled heterogeneous human motion. ACM Trans Graph (TOG) 34(4):1–10

    Article  Google Scholar 

  29. Smith HJ, Cao C, Neff M, Wang Y (2019) Efficient neural networks for real-time motion style transfer. Proc ACM Comput Graph Interac Tech 2(2):1–17

    Article  Google Scholar 

  30. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  Google Scholar 

  31. Brock A, Donahue J, Simonyan K (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096

  32. Pham HX, Wang Y, Pavlovic V (2018) Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv:1803.07716

  33. Pumarola A, Agudo A, Martinez AM, Sanfeliu A, Moreno-Noguer F (2018) Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the european conference on computer vision (ECCV), pp 818–833

  34. Vougioukas K, Petridis S, Pantic M (2020) Realistic speech-driven facial animation with gans. Int J Comput Vis 128(5):1398–1413

    Article  Google Scholar 

  35. Lucic M, Kurach K, Michalski M, Gelly S, Bousquet O (2018) Are gans created equal? a large-scale study. Advances in Neural Information Processing Systems 31

  36. Alexanderson S, Henter GE, Kucherenko T, Beskow J (2020) Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer graphics forum, vol 39, pp 487–496. Wiley Online Library

  37. Ahuja C, Lee DW, Nakano YI, Morency L-P (2020) Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In: European conference on computer vision, pp 248–265. Springer

  38. Yoon Y, Cha B, Lee J-H, Jang M, Lee J, Kim J, Lee G (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans Graph (TOG) 39(6):1–16

    Article  Google Scholar 

  39. Sahidullah M, Saha G (2012) Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition. Speech Commun 54(4):543–565

    Article  Google Scholar 

  40. Fang H-S, Xie S, Tai Y-W, Lu C (2017) Rmpe: Regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 2334–2343

  41. Xiu Y, Li J, Wang H, Fang Y, Lu C (2018) Pose flow: Efficient online pose tracking. arXiv:1802.00977

  42. Xu J, Zhang W, Bai Y, Sun Q, Mei T (2022) Freeform body motion generation from speech. arXiv:2203.02291

  43. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances In Neural Information Processing Systems 30

  44. Zhou Y, Han X, Shechtman E, Echevarria J, Kalogerakis E, Li D (2020) Makelttalk: speaker-aware talking-head animation. ACM Trans Graph (TOG) 39(6):1–15

    Google Scholar 

  45. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30

  46. Zhu L, Liu X, Liu X, Qian R, Liu Z, Yu L (2023) Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10544–10553

  47. Liu X, Wu Q, Zhou H, Xu Y, Qian R, Lin X, Zhou X, Wu W, Dai B, Zhou B (2022) Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10462–10472

  48. Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: Proc. of the international conference in robotics and automation (ICRA)

  49. Ahuja C, Morency L-P (2019) Language2pose: Natural language grounded pose forecasting. In: Proceedings of 2019 international conference on 3D vision (3DV), pp 719–728

  50. Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Conference on neural information processing systems (NeurIPS)

Download references

Acknowledgements

This work was jointly supported by the Basic and Applied Basic Research Program of Guangdong Province (No. 2020A1515110523), the Fundamental Research Funds for the Central Universities(No.QTZX22079), and Guangxi Key Laboratory of Trusted Software (No. KX202045).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Network design, data collection and analysis were performed by Zixiang Lu, Jaile Hong, and Zhiting He. The first draft of the manuscript was written by Zixiang Lu and Jiale Hong. And all authors commented on and edited subsequent versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zixiang Lu.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest with any individual or organization for this paper.

Ethical and informed consent for data used

Authors have cited any publicly available data on which the conclusions of the paper rely.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, Z., He, Z., Hong, J. et al. Cospeech body motion generation using a transformer. Appl Intell 54, 11525–11535 (2024). https://doi.org/10.1007/s10489-024-05769-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05769-4

Keywords