Skip to main content

GERT: Transformers for Co-speech Gesture Prediction in Social Robots

  • Conference paper
  • First Online:
Social Robotics (ICSR 2023)

Abstract

Social robots are becoming an important part of our society and should be recognised as viable interaction partners, which include being perceived as i) animate beings and ii) capable of establishing natural interactions with the user. One method of achieving both objectives is allowing the robot to perform gestures autonomously, which can become problematic when those gestures have to accompany verbal messages. If the robot uses predefined gestures, an issue that needs solving is selecting the most appropriate expression given the robot’s speech. In this work, we propose three transformer-based models called GERT, which stands for Gesture-Enhanced Robotics Transformer, that predict the co-speech gestures that better match the robot’s utterances. We have compared the performance of the three models of different sizes to prove their usability in the gesture prediction task and the trade-off between size and performance. The results show that all three models achieve satisfactory performance (F-score between 0.78 and 0.86).

J. Sevilla-Salcedo and E. Fernández-Rodicio—The first two authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this work, we will use both expression and gesture indistinctively for any coherent combination of multimodal information aimed at achieving a particular communicative goal.

  2. 2.

    https://huggingface.co/qfrodicio.

  3. 3.

    https://youtu.be/lvQGwfu8J50.

References

  1. Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15

  2. Baevski, A., Auli, M.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2206.04541 (2022)

  3. Bartneck, C., Kanda, T., Mubin, O., Mahmud, A.: Does the design of a robot influence its animacy and perceived intelligence? Int. J. Soc. Robot. 1, 195–204 (2009)

    Google Scholar 

  4. Chang, C.J., Zhang, S., Kapadia, M.: The IVI lab entry to the Genea challenge 2022-a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 784–789 (2022)

    Google Scholar 

  5. Chiu, C.C., Marsella, S.: Gesture generation with low-dimensional embeddings. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 781–788 (2014)

    Google Scholar 

  6. Chiu, C.-C., Morency, L.-P., Marsella, S.: Predicting co-verbal gestures: a deep and temporal modeling approach. In: Brinkman, W.-P., Broekens, J., Heylen, D. (eds.) IVA 2015. LNCS (LNAI), vol. 9238, pp. 152–166. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21996-7_17

  7. Danescu-Niculescu-Mizil, C., Lee, L.: Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (ACL 2011) (2011)

    Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  9. Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)

    Google Scholar 

  10. Kucherenko, T., Hasegawa, D., Henter, G.E., Kaneko, N., Kjellström, H.: Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 97–104 (2019)

    Google Scholar 

  11. Kucherenko, T., Nagy, R., Jonell, P., Neff, M., Kjellström, H., Henter, G.E.: Speech2properties2gestures: gesture-property prediction as a tool for generating representational gestures from speech. In: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pp. 145–147 (2021)

    Google Scholar 

  12. Li, R., Wang, Z., Wu, Y., Zhu, Y., Liu, C.L., Yang, Y.: Diffusion models beat GANS on image synthesis. arXiv preprint arXiv:2105.05233 (2021)

  13. Liang, Y., Feng, Q., Zhu, L., Hu, L., Pan, P., Yang, Y.: SEEG: semantic energized co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10473–10482 (2022)

    Google Scholar 

  14. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  15. Martín Galván, L., Fernández-Rodicio, E., Sevilla Salcedo, J., Castro-González, Á., Salichs, M.A.: Using deep learning for implementing paraphrasing in a social robot. In: Julián, V., Carneiro, J., Alonso, R.S., Chamoso, P., Novais, P. (eds.) Ambient Intelligence-Software and Applications–13th International Symposium on Ambient Intelligence. LNNS, vol. 603, pp. 219–228. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22356-3_21

  16. Microsoft: Bing chat (2023). https://www.bing.com/

  17. Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the Fall Joint Computer Conference, 9–11 December 1968, Part I, pp. 267–277 (1968)

    Google Scholar 

  18. Nakayama, H.: seqeval: a python framework for sequence labeling evaluation. Software available (2018). https://github.com/chakki-works/seqeval

  19. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  20. Pérez-Mayos, L., Farrús, M., Adell, J.: Part-of-speech and prosody-based approaches for robot speech and gesture synchronization. J. Intell. Robot. Syst. 1–11 (2019)

    Google Scholar 

  21. Powers, K.E., Worsham, A.L., Freeman, J.B., Wheatley, T., Heatherton, T.F.: Social connection modulates perceptions of animacy. Psychol. Sci. 25(10), 1943–1948 (2014)

    Article  Google Scholar 

  22. Rosenthal-von der Pütten, A.M., Krämer, N.C., Herrmann, J.: The effects of humanlike and robot-specific affective nonverbal behavior on perception, emotion, and behavior. Int. J. Soc. Robot. 10(5), 569–582 (2018)

    Google Scholar 

  23. Quigley, M., et al.: ROS: an open-source robot operating system. In: ICRA Workshop on Open Source Software, Kobe, vol. 3, p. 5 (2009)

    Google Scholar 

  24. Radford, A., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  25. Salichs, M.A., et al.: Mini: a new social robot for the elderly. Int. J. Soc. Robot. 12, 1231–1249 (2020)

    Google Scholar 

  26. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  27. Sevilla Salcedo, J., Martín Galván, L., Castillo, J.C., Castro-González, Á., Salichs, M.A.: User-adapted semantic description generation using natural language models. In: Julián, V., Carneiro, J., Alonso, R.S., Chamoso, P., Novais, P. (eds.) Ambient Intelligence—Software and Applications, ISAmI 2022. LNNS, vol. 603, pp. 134–144. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22356-3_13

  28. Shiwa, T., Kanda, T., Imai, M., Ishiguro, H., Hagita, N.: How quickly should communication robots respond? In: 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 153–160. IEEE (2008)

    Google Scholar 

  29. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  30. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461v3 (2018)

  31. Yoon, Y., Ko, W.R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)

    Google Scholar 

Download references

Acknowledgment

The research leading to these results has received funding from the grants PID2021-123941OA-I00, funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”; TED2021-132079B-I00 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Mejora del nivel de madurez tecnologica del robot Mini (MeNiR) funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. This work has been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M (“Fostering Young Doctors Research”, SMM4HRI-CM-UC3M), and in the context of the V PRICIT (Research and Technological Innovation Regional Programme).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Sevilla-Salcedo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sevilla-Salcedo, J., Fernández-Rodicio, E., Castillo, J.C., Castro-González, Á., Salichs, M.A. (2024). GERT: Transformers for Co-speech Gesture Prediction in Social Robots. In: Ali, A.A., et al. Social Robotics. ICSR 2023. Lecture Notes in Computer Science(), vol 14453 . Springer, Singapore. https://doi.org/10.1007/978-981-99-8715-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8715-3_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8714-6

  • Online ISBN: 978-981-99-8715-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics