Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation

Favali, Filippo; Schmuck, Viktor; Villani, Valeria; Celiktutan, Oya

doi:10.1007/978-3-031-81688-8_3

Filippo Favali^13,14,
Viktor Schmuck¹⁴,
Valeria Villani¹³ &
…
Oya Celiktutan¹⁴

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 35))

Included in the following conference series:

International Workshop on Human-Friendly Robotics

93 Accesses

Abstract

A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994)
Google Scholar
Cassell, J.: A framework for gesture generation and interpretation. Comput. Vis. Human-Mach. Interact., 191–215 (1998)
Google Scholar
Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, S.S., Sheikh, Y.: Talking with hands 16.2 m: a large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 763–772 (2019)
Google Scholar
Liu, H., et al.: Beat: a large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In: European Conference on Computer Vision, pp. 612–630. Springer (2022)
Google Scholar
Takeuchi, K., Hasegawa, D., Shirakawa, S., Kaneko, N., Sakuta, H., Sumi, K.: Speech-to-gesture generation: a challenge in deep learning approach with bi-directional LSTM. In: Proceedings of the 5th International Conference on Human Agent Interaction, pp. 365–369 (2017)
Google Scholar
Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 79–86 (2018)
Google Scholar
Korzun, V., Beloborodova, A., Ilin, A.: The FineMotion entry to the GENEA challenge 2023: DeepPhase for conversational gestures generation. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 786–791 (2023)
Google Scholar
Tang, H., Wang, W., Xu, D., Yan, Y., Sebe, N.: GestureGAN for hand gesture-to-gesture translation in the wild. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 774–782 (2018)
Google Scholar
Tuyen, N.T.V., Elibol, A., Chong, N.Y.: A GAN-based approach to communicative gesture generation for social robots. In: 2021 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO), pp. 58–64. IEEE (2021)
Google Scholar
Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., Neff, M.: A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum 42, 569–596. Wiley Online Library (2023)
Google Scholar
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Yang, S., et al.: DiffuseStyleGesture: stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919 (2023)
Yang, S., et al.: The DiffuseStyleGesture+ entry to the GENEA challenge 2023. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 779–785 (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Chang, C.-J., Zhang, S., Kapadia, M.: The IVI lab entry to the GENEA challenge 2022–a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 784–789 (2022)
Google Scholar
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405 (2017)
Kucherenko, T., et al.: The GENEA challenge 2023: a large-scale evaluation of gesture generation models in monadic and dyadic settings. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 792–801 (2023)
Google Scholar

Download references

Acknowledgment

This work was supported by Horizon Europe program under the Grant Agreement 101070351 (SERMAS).

Author information

Authors and Affiliations

Department of Science and Methods for Engineering, University of Modena and Reggio Emilia, Reggio Emilia, Italy
Filippo Favali & Valeria Villani
Department of Engineering, Kings College London, London, UK
Filippo Favali, Viktor Schmuck & Oya Celiktutan

Authors

Filippo Favali
View author publications
You can also search for this author in PubMed Google Scholar
Viktor Schmuck
View author publications
You can also search for this author in PubMed Google Scholar
Valeria Villani
View author publications
You can also search for this author in PubMed Google Scholar
Oya Celiktutan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filippo Favali .

Editor information

Editors and Affiliations

USI-SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA), Lugano, Switzerland
Antonio Paolillo
USI-SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA), Lugano, Switzerland
Alessandro Giusti
USI-SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA), Lugano, Switzerland
Gabriele Abbate

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Favali, F., Schmuck, V., Villani, V., Celiktutan, O. (2025). Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation. In: Paolillo, A., Giusti, A., Abbate, G. (eds) Human-Friendly Robotics 2024. HFR 2024. Springer Proceedings in Advanced Robotics, vol 35. Springer, Cham. https://doi.org/10.1007/978-3-031-81688-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-81688-8_3
Published: 26 February 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-81687-1
Online ISBN: 978-3-031-81688-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation