Cospeech body motion generation using a transformer

Lu, Zixiang; He, Zhitong; Hong, Jiale; Gao, Ping

doi:10.1007/s10489-024-05769-4

Cospeech body motion generation using a transformer

Published: 26 August 2024

Volume 54, pages 11525–11535, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Zixiang Lu ORCID: orcid.org/0000-0003-2743-2017^1,2,
Zhitong He^1,2,
Jiale Hong¹ &
…
Ping Gao³

179 Accesses
Explore all metrics

Abstract

Body language is a method for communicating across languages and cultures. Making good use of body motions in speech can enhance persuasiveness, improve personal charisma, and make speech more effective. Generating matching body motions for digital avatars and social robots based on content has become an important topic. In this paper, we propose a transformer-based network model to generate body motions from input speech. Our model includes an audio transformer encoder, motion transformer encoder, template variational autoencoder, cross-modal transformer encoder, and motion decoder. Additionally, we propose a novel evaluation metric for describing motion change trends in terms of distance. The experimental results show that the proposed model provides higher-quality motion generation results than state-of-the-art models. As indicated by visual skeleton motions, our results are more natural and realistic than those of other methods. Additionally, the generated motions yield superior results in terms of multiple evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

DIM: Dyadic Interaction Modeling for Social Behavior Generation

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Data Availability

The datasets generated and analyzed in this study are available from the corresponding author on reasonable request.

References

Kopp S, Krenn B, Marsella S, Marshall AN, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson H (2006) Towards a common framework for multimodal generation: The behavior markup language. In: International workshop on intelligent virtual agents, pp 205–217. Springer
Wagner P, Malisz Z, Kopp S (2014) Gesture and speech in interaction: An overview. Elsevier
Levine S, Krähenbühl P, Thrun S, Koltun V (2010) Gesture controllers. In: ACM SIGGRAPH 2010 Papers, pp 1–11
Kucherenko T, Hasegawa D, Henter G.E, Kaneko N, Kjellström H (2019) Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM international conference on intelligent virtual agents, pp 97–104
Ferstl Y, Neff M, McDonnell R (2019) Multi-objective adversarial gesture generation. In: Motion, interaction and games, pp 1–10
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3497–3506
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F,et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137. Springer
Qian S, Tu Z, Zhi Y, Liu W, Gao S (2021) Speech drives templates: Co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11077–11086
Li R, Yang S, Ross D.A, Kanazawa A (2021) Ai choreographer: Music conditioned 3d dance generation with aist++. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13401–13412
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114
Cassell J, Vilhjálmsson HH, Bickmore T (2001) Beat: the behavior expression animation toolkit. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp 477–486
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Trans Graph (ToG) 36(4):1–13
Article Google Scholar
Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C (2020) A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia, pp 484–492
Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: European conference on computer vision, pp 35–51. Springer
Fried O, Tewari A, Zollhöfer M, Finkelstein A, Shechtman E, Goldman DB, Genova K, Jin Z, Theobalt C, Agrawala M (2019) Text-based editing of talking-head video. ACM Trans Graph (TOG) 38(4):1–14
Article Google Scholar
Yi R, Ye Z, Zhang J, Bao H, Liu Y-J (2020) Audio-driven talking face video generation with learning-based personalized head pose. arXiv:2002.10137
Sadoughi N, Busso C (2018) Novel realizations of speech-driven head movements with generative adversarial networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6169–6173. IEEE
Bergmann K, Kopp S (2009) Gnetic–using bayesian decision networks for iconic gesture generation. In: International workshop on intelligent virtual agents, pp 76–89. Springer
Sadoughi N, Busso C (2019) Speech-driven animation with meaningful behaviors. Speech Commun 110:90–100
Article Google Scholar
Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International conference on robotics and automation (ICRA), pp 4303–4309. IEEE
Shlizerman E, Dery L, Schoen H, Kemelmacher-Shlizerman I (2018) Audio to body dynamics. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7574–7583
Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4884–4888
Han C, Sun J, Bian Y, Que W, Shi L (2023) Automated detection and localization of myocardial infarction with interpretability analysis based on deep learning. IEEE Trans Instrum Meas 72:1–12
Google Scholar
Koppensteiner M, Grammer K (2010) Motion patterns in political speech and their influence on personality ratings. J Res Pers 44(3):374–379
Article Google Scholar
Smith HJ, Neff M (2017) Understanding the impact of animated gesture performance on personality perceptions. ACM Trans Graph (TOG) 36(4):1–12
Google Scholar
Castillo G, Neff M (2019) What do we express without knowing? emotion in gesture. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp 702–710
Hsu E, Pulli K, Popović J (2005) Style translation for human motion. In: ACM SIGGRAPH 2005 Papers, pp 1082–1089
Xia S, Wang C, Chai J, Hodgins J (2015) Realtime style transfer for unlabeled heterogeneous human motion. ACM Trans Graph (TOG) 34(4):1–10
Article Google Scholar
Smith HJ, Cao C, Neff M, Wang Y (2019) Efficient neural networks for real-time motion style transfer. Proc ACM Comput Graph Interac Tech 2(2):1–17
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet Google Scholar
Brock A, Donahue J, Simonyan K (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096
Pham HX, Wang Y, Pavlovic V (2018) Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv:1803.07716
Pumarola A, Agudo A, Martinez AM, Sanfeliu A, Moreno-Noguer F (2018) Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the european conference on computer vision (ECCV), pp 818–833
Vougioukas K, Petridis S, Pantic M (2020) Realistic speech-driven facial animation with gans. Int J Comput Vis 128(5):1398–1413
Article Google Scholar
Lucic M, Kurach K, Michalski M, Gelly S, Bousquet O (2018) Are gans created equal? a large-scale study. Advances in Neural Information Processing Systems 31
Alexanderson S, Henter GE, Kucherenko T, Beskow J (2020) Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer graphics forum, vol 39, pp 487–496. Wiley Online Library
Ahuja C, Lee DW, Nakano YI, Morency L-P (2020) Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In: European conference on computer vision, pp 248–265. Springer
Yoon Y, Cha B, Lee J-H, Jang M, Lee J, Kim J, Lee G (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans Graph (TOG) 39(6):1–16
Article Google Scholar
Sahidullah M, Saha G (2012) Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition. Speech Commun 54(4):543–565
Article Google Scholar
Fang H-S, Xie S, Tai Y-W, Lu C (2017) Rmpe: Regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 2334–2343
Xiu Y, Li J, Wang H, Fang Y, Lu C (2018) Pose flow: Efficient online pose tracking. arXiv:1802.00977
Xu J, Zhang W, Bai Y, Sun Q, Mei T (2022) Freeform body motion generation from speech. arXiv:2203.02291
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances In Neural Information Processing Systems 30
Zhou Y, Han X, Shechtman E, Echevarria J, Kalogerakis E, Li D (2020) Makelttalk: speaker-aware talking-head animation. ACM Trans Graph (TOG) 39(6):1–15
Google Scholar
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30
Zhu L, Liu X, Liu X, Qian R, Liu Z, Yu L (2023) Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10544–10553
Liu X, Wu Q, Zhou H, Xu Y, Qian R, Lin X, Zhou X, Wu W, Dai B, Zhou B (2022) Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10462–10472
Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: Proc. of the international conference in robotics and automation (ICRA)
Ahuja C, Morency L-P (2019) Language2pose: Natural language grounded pose forecasting. In: Proceedings of 2019 international conference on 3D vision (3DV), pp 719–728
Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Conference on neural information processing systems (NeurIPS)

Download references

Acknowledgements

This work was jointly supported by the Basic and Applied Basic Research Program of Guangdong Province (No. 2020A1515110523), the Fundamental Research Funds for the Central Universities(No.QTZX22079), and Guangxi Key Laboratory of Trusted Software (No. KX202045).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xidian University, South Taibai Road, Xi’an, 710071, Shaanxi, China
Zixiang Lu, Zhitong He & Jiale Hong
Xi’an Key Laboratory of Big Data and Intelligent Vision, South Taibai Road, Xi’an, 710071, Shaanxi, China
Zixiang Lu & Zhitong He
School of Statistics, Xi’an University of Finance and Economics, Changning Street, Xi’an, 710100, Shaanxi, China
Ping Gao

Authors

Zixiang Lu
View author publications
You can also search for this author inPubMed Google Scholar
Zhitong He
View author publications
You can also search for this author inPubMed Google Scholar
Jiale Hong
View author publications
You can also search for this author inPubMed Google Scholar
Ping Gao
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Network design, data collection and analysis were performed by Zixiang Lu, Jaile Hong, and Zhiting He. The first draft of the manuscript was written by Zixiang Lu and Jiale Hong. And all authors commented on and edited subsequent versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zixiang Lu.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest with any individual or organization for this paper.

Ethical and informed consent for data used

Authors have cited any publicly available data on which the conclusions of the paper rely.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, Z., He, Z., Hong, J. et al. Cospeech body motion generation using a transformer. Appl Intell 54, 11525–11535 (2024). https://doi.org/10.1007/s10489-024-05769-4

Download citation

Accepted: 11 August 2024
Published: 26 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05769-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cospeech body motion generation using a transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

DIM: Dyadic Interaction Modeling for Social Behavior Generation

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now