Skip to main content

Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation

  • Conference paper
  • First Online:
Human-Friendly Robotics 2024 (HFR 2024)

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 35))

Included in the following conference series:

  • 93 Accesses

Abstract

A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420 (1994)

    Google Scholar 

  2. Cassell, J.: A framework for gesture generation and interpretation. Comput. Vis. Human-Mach. Interact., 191–215 (1998)

    Google Scholar 

  3. Lee, G., Deng, Z., Ma, S., Shiratori, T., Srinivasa, S.S., Sheikh, Y.: Talking with hands 16.2 m: a large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 763–772 (2019)

    Google Scholar 

  4. Liu, H., et al.: Beat: a large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In: European Conference on Computer Vision, pp. 612–630. Springer (2022)

    Google Scholar 

  5. Takeuchi, K., Hasegawa, D., Shirakawa, S., Kaneko, N., Sakuta, H., Sumi, K.: Speech-to-gesture generation: a challenge in deep learning approach with bi-directional LSTM. In: Proceedings of the 5th International Conference on Human Agent Interaction, pp. 365–369 (2017)

    Google Scholar 

  6. Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 79–86 (2018)

    Google Scholar 

  7. Korzun, V., Beloborodova, A., Ilin, A.: The FineMotion entry to the GENEA challenge 2023: DeepPhase for conversational gestures generation. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 786–791 (2023)

    Google Scholar 

  8. Tang, H., Wang, W., Xu, D., Yan, Y., Sebe, N.: GestureGAN for hand gesture-to-gesture translation in the wild. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 774–782 (2018)

    Google Scholar 

  9. Tuyen, N.T.V., Elibol, A., Chong, N.Y.: A GAN-based approach to communicative gesture generation for social robots. In: 2021 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO), pp. 58–64. IEEE (2021)

    Google Scholar 

  10. Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., Neff, M.: A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum 42, 569–596. Wiley Online Library (2023)

    Google Scholar 

  11. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  12. Yang, S., et al.: DiffuseStyleGesture: stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919 (2023)

  13. Yang, S., et al.: The DiffuseStyleGesture+ entry to the GENEA challenge 2023. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 779–785 (2023)

    Google Scholar 

  14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  15. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  16. Chang, C.-J., Zhang, S., Kapadia, M.: The IVI lab entry to the GENEA challenge 2022–a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 784–789 (2022)

    Google Scholar 

  17. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405 (2017)

  18. Kucherenko, T., et al.: The GENEA challenge 2023: a large-scale evaluation of gesture generation models in monadic and dyadic settings. In: Proceedings of the 25th International Conference on Multimodal Interaction, pp. 792–801 (2023)

    Google Scholar 

Download references

Acknowledgment

This work was supported by Horizon Europe program under the Grant Agreement 101070351 (SERMAS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filippo Favali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Favali, F., Schmuck, V., Villani, V., Celiktutan, O. (2025). Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation. In: Paolillo, A., Giusti, A., Abbate, G. (eds) Human-Friendly Robotics 2024. HFR 2024. Springer Proceedings in Advanced Robotics, vol 35. Springer, Cham. https://doi.org/10.1007/978-3-031-81688-8_3

Download citation

Publish with us

Policies and ethics