Abstract
Body language is a method for communicating across languages and cultures. Making good use of body motions in speech can enhance persuasiveness, improve personal charisma, and make speech more effective. Generating matching body motions for digital avatars and social robots based on content has become an important topic. In this paper, we propose a transformer-based network model to generate body motions from input speech. Our model includes an audio transformer encoder, motion transformer encoder, template variational autoencoder, cross-modal transformer encoder, and motion decoder. Additionally, we propose a novel evaluation metric for describing motion change trends in terms of distance. The experimental results show that the proposed model provides higher-quality motion generation results than state-of-the-art models. As indicated by visual skeleton motions, our results are more natural and realistic than those of other methods. Additionally, the generated motions yield superior results in terms of multiple evaluation metrics.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated and analyzed in this study are available from the corresponding author on reasonable request.
References
Kopp S, Krenn B, Marsella S, Marshall AN, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson H (2006) Towards a common framework for multimodal generation: The behavior markup language. In: International workshop on intelligent virtual agents, pp 205–217. Springer
Wagner P, Malisz Z, Kopp S (2014) Gesture and speech in interaction: An overview. Elsevier
Levine S, Krähenbühl P, Thrun S, Koltun V (2010) Gesture controllers. In: ACM SIGGRAPH 2010 Papers, pp 1–11
Kucherenko T, Hasegawa D, Henter G.E, Kaneko N, Kjellström H (2019) Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM international conference on intelligent virtual agents, pp 97–104
Ferstl Y, Neff M, McDonnell R (2019) Multi-objective adversarial gesture generation. In: Motion, interaction and games, pp 1–10
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3497–3506
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F,et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137. Springer
Qian S, Tu Z, Zhi Y, Liu W, Gao S (2021) Speech drives templates: Co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11077–11086
Li R, Yang S, Ross D.A, Kanazawa A (2021) Ai choreographer: Music conditioned 3d dance generation with aist++. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13401–13412
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114
Cassell J, Vilhjálmsson HH, Bickmore T (2001) Beat: the behavior expression animation toolkit. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp 477–486
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Trans Graph (ToG) 36(4):1–13
Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C (2020) A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia, pp 484–492
Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: European conference on computer vision, pp 35–51. Springer
Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E, Goldman DB, Genova K, Jin Z, Theobalt C, Agrawala M (2019) Text-based editing of talking-head video. ACM Trans Graph (TOG) 38(4):1–14
Yi R, Ye Z, Zhang J, Bao H, Liu Y-J (2020) Audio-driven talking face video generation with learning-based personalized head pose. arXiv:2002.10137
Sadoughi N, Busso C (2018) Novel realizations of speech-driven head movements with generative adversarial networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6169–6173. IEEE
Bergmann K, Kopp S (2009) Gnetic–using bayesian decision networks for iconic gesture generation. In: International workshop on intelligent virtual agents, pp 76–89. Springer
Sadoughi N, Busso C (2019) Speech-driven animation with meaningful behaviors. Speech Commun 110:90–100
Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International conference on robotics and automation (ICRA), pp 4303–4309. IEEE
Shlizerman E, Dery L, Schoen H, Kemelmacher-Shlizerman I (2018) Audio to body dynamics. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7574–7583
Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4884–4888
Han C, Sun J, Bian Y, Que W, Shi L (2023) Automated detection and localization of myocardial infarction with interpretability analysis based on deep learning. IEEE Trans Instrum Meas 72:1–12
Koppensteiner M, Grammer K (2010) Motion patterns in political speech and their influence on personality ratings. J Res Pers 44(3):374–379
Smith HJ, Neff M (2017) Understanding the impact of animated gesture performance on personality perceptions. ACM Trans Graph (TOG) 36(4):1–12
Castillo G, Neff M (2019) What do we express without knowing? emotion in gesture. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp 702–710
Hsu E, Pulli K, Popović J (2005) Style translation for human motion. In: ACM SIGGRAPH 2005 Papers, pp 1082–1089
Xia S, Wang C, Chai J, Hodgins J (2015) Realtime style transfer for unlabeled heterogeneous human motion. ACM Trans Graph (TOG) 34(4):1–10
Smith HJ, Cao C, Neff M, Wang Y (2019) Efficient neural networks for real-time motion style transfer. Proc ACM Comput Graph Interac Tech 2(2):1–17
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Brock A, Donahue J, Simonyan K (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096
Pham HX, Wang Y, Pavlovic V (2018) Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv:1803.07716
Pumarola A, Agudo A, Martinez AM, Sanfeliu A, Moreno-Noguer F (2018) Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the european conference on computer vision (ECCV), pp 818–833
Vougioukas K, Petridis S, Pantic M (2020) Realistic speech-driven facial animation with gans. Int J Comput Vis 128(5):1398–1413
Lucic M, Kurach K, Michalski M, Gelly S, Bousquet O (2018) Are gans created equal? a large-scale study. Advances in Neural Information Processing Systems 31
Alexanderson S, Henter GE, Kucherenko T, Beskow J (2020) Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer graphics forum, vol 39, pp 487–496. Wiley Online Library
Ahuja C, Lee DW, Nakano YI, Morency L-P (2020) Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In: European conference on computer vision, pp 248–265. Springer
Yoon Y, Cha B, Lee J-H, Jang M, Lee J, Kim J, Lee G (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans Graph (TOG) 39(6):1–16
Sahidullah M, Saha G (2012) Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition. Speech Commun 54(4):543–565
Fang H-S, Xie S, Tai Y-W, Lu C (2017) Rmpe: Regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 2334–2343
Xiu Y, Li J, Wang H, Fang Y, Lu C (2018) Pose flow: Efficient online pose tracking. arXiv:1802.00977
Xu J, Zhang W, Bai Y, Sun Q, Mei T (2022) Freeform body motion generation from speech. arXiv:2203.02291
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances In Neural Information Processing Systems 30
Zhou Y, Han X, Shechtman E, Echevarria J, Kalogerakis E, Li D (2020) Makelttalk: speaker-aware talking-head animation. ACM Trans Graph (TOG) 39(6):1–15
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30
Zhu L, Liu X, Liu X, Qian R, Liu Z, Yu L (2023) Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10544–10553
Liu X, Wu Q, Zhou H, Xu Y, Qian R, Lin X, Zhou X, Wu W, Dai B, Zhou B (2022) Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10462–10472
Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: Proc. of the international conference in robotics and automation (ICRA)
Ahuja C, Morency L-P (2019) Language2pose: Natural language grounded pose forecasting. In: Proceedings of 2019 international conference on 3D vision (3DV), pp 719–728
Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Conference on neural information processing systems (NeurIPS)
Acknowledgements
This work was jointly supported by the Basic and Applied Basic Research Program of Guangdong Province (No. 2020A1515110523), the Fundamental Research Funds for the Central Universities(No.QTZX22079), and Guangxi Key Laboratory of Trusted Software (No. KX202045).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Network design, data collection and analysis were performed by Zixiang Lu, Jaile Hong, and Zhiting He. The first draft of the manuscript was written by Zixiang Lu and Jiale Hong. And all authors commented on and edited subsequent versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors certify that there is no conflict of interest with any individual or organization for this paper.
Ethical and informed consent for data used
Authors have cited any publicly available data on which the conclusions of the paper rely.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, Z., He, Z., Hong, J. et al. Cospeech body motion generation using a transformer. Appl Intell 54, 11525–11535 (2024). https://doi.org/10.1007/s10489-024-05769-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05769-4