Abstract
A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994)
Cassell, J.: A framework for gesture generation and interpretation. Comput. Vis. Human-Mach. Interact., 191–215 (1998)
Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, S.S., Sheikh, Y.: Talking with hands 16.2 m: a large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 763–772 (2019)
Liu, H., et al.: Beat: a large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In: European Conference on Computer Vision, pp. 612–630. Springer (2022)
Takeuchi, K., Hasegawa, D., Shirakawa, S., Kaneko, N., Sakuta, H., Sumi, K.: Speech-to-gesture generation: a challenge in deep learning approach with bi-directional LSTM. In: Proceedings of the 5th International Conference on Human Agent Interaction, pp. 365–369 (2017)
Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 79–86 (2018)
Korzun, V., Beloborodova, A., Ilin, A.: The FineMotion entry to the GENEA challenge 2023: DeepPhase for conversational gestures generation. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 786–791 (2023)
Tang, H., Wang, W., Xu, D., Yan, Y., Sebe, N.: GestureGAN for hand gesture-to-gesture translation in the wild. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 774–782 (2018)
Tuyen, N.T.V., Elibol, A., Chong, N.Y.: A GAN-based approach to communicative gesture generation for social robots. In: 2021 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO), pp. 58–64. IEEE (2021)
Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., Neff, M.: A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum 42, 569–596. Wiley Online Library (2023)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Yang, S., et al.: DiffuseStyleGesture: stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919 (2023)
Yang, S., et al.: The DiffuseStyleGesture+ entry to the GENEA challenge 2023. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 779–785 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Chang, C.-J., Zhang, S., Kapadia, M.: The IVI lab entry to the GENEA challenge 2022–a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 784–789 (2022)
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405 (2017)
Kucherenko, T., et al.: The GENEA challenge 2023: a large-scale evaluation of gesture generation models in monadic and dyadic settings. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 792–801 (2023)
Acknowledgment
This work was supported by Horizon Europe program under the Grant Agreement 101070351 (SERMAS).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Favali, F., Schmuck, V., Villani, V., Celiktutan, O. (2025). Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation. In: Paolillo, A., Giusti, A., Abbate, G. (eds) Human-Friendly Robotics 2024. HFR 2024. Springer Proceedings in Advanced Robotics, vol 35. Springer, Cham. https://doi.org/10.1007/978-3-031-81688-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-81688-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-81687-1
Online ISBN: 978-3-031-81688-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)