GERT: Transformers for Co-speech Gesture Prediction in Social Robots

Sevilla-Salcedo, Javier; Fernández-Rodicio, Enrique; Castillo, José Carlos; Castro-González, Álvaro; Salichs, Miguel A.

doi:10.1007/978-981-99-8715-3_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14453 ))

Included in the following conference series:

International Conference on Social Robotics

269 Accesses

Abstract

Social robots are becoming an important part of our society and should be recognised as viable interaction partners, which include being perceived as i) animate beings and ii) capable of establishing natural interactions with the user. One method of achieving both objectives is allowing the robot to perform gestures autonomously, which can become problematic when those gestures have to accompany verbal messages. If the robot uses predefined gestures, an issue that needs solving is selecting the most appropriate expression given the robot’s speech. In this work, we propose three transformer-based models called GERT, which stands for Gesture-Enhanced Robotics Transformer, that predict the co-speech gestures that better match the robot’s utterances. We have compared the performance of the three models of different sizes to prove their usability in the gesture prediction task and the trade-off between size and performance. The results show that all three models achieve satisfactory performance (F-score between 0.78 and 0.86).

J. Sevilla-Salcedo and E. Fernández-Rodicio—The first two authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this work, we will use both expression and gesture indistinctively for any coherent combination of multimodal information aimed at achieving a particular communicative goal.
2.
https://huggingface.co/qfrodicio.
3.
https://youtu.be/lvQGwfu8J50.

References

Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15
Baevski, A., Auli, M.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2206.04541 (2022)
Bartneck, C., Kanda, T., Mubin, O., Mahmud, A.: Does the design of a robot influence its animacy and perceived intelligence? Int. J. Soc. Robot. 1, 195–204 (2009)
Google Scholar
Chang, C.J., Zhang, S., Kapadia, M.: The IVI lab entry to the Genea challenge 2022-a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 784–789 (2022)
Google Scholar
Chiu, C.C., Marsella, S.: Gesture generation with low-dimensional embeddings. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 781–788 (2014)
Google Scholar
Chiu, C.-C., Morency, L.-P., Marsella, S.: Predicting co-verbal gestures: a deep and temporal modeling approach. In: Brinkman, W.-P., Broekens, J., Heylen, D. (eds.) IVA 2015. LNCS (LNAI), vol. 9238, pp. 152–166. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21996-7_17
Danescu-Niculescu-Mizil, C., Lee, L.: Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (ACL 2011) (2011)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
Google Scholar
Kucherenko, T., Hasegawa, D., Henter, G.E., Kaneko, N., Kjellström, H.: Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 97–104 (2019)
Google Scholar
Kucherenko, T., Nagy, R., Jonell, P., Neff, M., Kjellström, H., Henter, G.E.: Speech2properties2gestures: gesture-property prediction as a tool for generating representational gestures from speech. In: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pp. 145–147 (2021)
Google Scholar
Li, R., Wang, Z., Wu, Y., Zhu, Y., Liu, C.L., Yang, Y.: Diffusion models beat GANS on image synthesis. arXiv preprint arXiv:2105.05233 (2021)
Liang, Y., Feng, Q., Zhu, L., Hu, L., Pan, P., Yang, Y.: SEEG: semantic energized co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10473–10482 (2022)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Martín Galván, L., Fernández-Rodicio, E., Sevilla Salcedo, J., Castro-González, Á., Salichs, M.A.: Using deep learning for implementing paraphrasing in a social robot. In: Julián, V., Carneiro, J., Alonso, R.S., Chamoso, P., Novais, P. (eds.) Ambient Intelligence-Software and Applications–13th International Symposium on Ambient Intelligence. LNNS, vol. 603, pp. 219–228. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22356-3_21
Microsoft: Bing chat (2023). https://www.bing.com/
Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the Fall Joint Computer Conference, 9–11 December 1968, Part I, pp. 267–277 (1968)
Google Scholar
Nakayama, H.: seqeval: a python framework for sequence labeling evaluation. Software available (2018). https://github.com/chakki-works/seqeval
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Pérez-Mayos, L., Farrús, M., Adell, J.: Part-of-speech and prosody-based approaches for robot speech and gesture synchronization. J. Intell. Robot. Syst. 1–11 (2019)
Google Scholar
Powers, K.E., Worsham, A.L., Freeman, J.B., Wheatley, T., Heatherton, T.F.: Social connection modulates perceptions of animacy. Psychol. Sci. 25(10), 1943–1948 (2014)
Article Google Scholar
Rosenthal-von der Pütten, A.M., Krämer, N.C., Herrmann, J.: The effects of humanlike and robot-specific affective nonverbal behavior on perception, emotion, and behavior. Int. J. Soc. Robot. 10(5), 569–582 (2018)
Google Scholar
Quigley, M., et al.: ROS: an open-source robot operating system. In: ICRA Workshop on Open Source Software, Kobe, vol. 3, p. 5 (2009)
Google Scholar
Radford, A., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Salichs, M.A., et al.: Mini: a new social robot for the elderly. Int. J. Soc. Robot. 12, 1231–1249 (2020)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sevilla Salcedo, J., Martín Galván, L., Castillo, J.C., Castro-González, Á., Salichs, M.A.: User-adapted semantic description generation using natural language models. In: Julián, V., Carneiro, J., Alonso, R.S., Chamoso, P., Novais, P. (eds.) Ambient Intelligence—Software and Applications, ISAmI 2022. LNNS, vol. 603, pp. 134–144. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22356-3_13
Shiwa, T., Kanda, T., Imai, M., Ishiguro, H., Hagita, N.: How quickly should communication robots respond? In: 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 153–160. IEEE (2008)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461v3 (2018)
Yoon, Y., Ko, W.R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)
Google Scholar

Download references

Acknowledgment

The research leading to these results has received funding from the grants PID2021-123941OA-I00, funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”; TED2021-132079B-I00 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Mejora del nivel de madurez tecnologica del robot Mini (MeNiR) funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. This work has been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M (“Fostering Young Doctors Research”, SMM4HRI-CM-UC3M), and in the context of the V PRICIT (Research and Technological Innovation Regional Programme).

Author information

Authors and Affiliations

RoboticsLab, Universidad Carlos III de Madrid, Madrid, Spain
Javier Sevilla-Salcedo, Enrique Fernández-Rodicio, José Carlos Castillo, Álvaro Castro-González & Miguel A. Salichs

Authors

Javier Sevilla-Salcedo
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Fernández-Rodicio
View author publications
You can also search for this author in PubMed Google Scholar
José Carlos Castillo
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro Castro-González
View author publications
You can also search for this author in PubMed Google Scholar
Miguel A. Salichs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Javier Sevilla-Salcedo .

Editor information

Editors and Affiliations

Qatar University, Doha, Qatar
Abdulaziz Al Ali
Qatar University, Doha, Qatar
John-John Cabibihan
Qatar University, Doha, Qatar
Nader Meskin
University of Naples Federico II, Napoli, Italy
Silvia Rossi
Qingdao University, Qingdao, China
Wanyue Jiang
The University of Alabama, Tuscaloosa, AL, USA
Hongsheng He
National University of Singapore, Queenstown, Singapore
Shuzhi Sam Ge

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sevilla-Salcedo, J., Fernández-Rodicio, E., Castillo, J.C., Castro-González, Á., Salichs, M.A. (2024). GERT: Transformers for Co-speech Gesture Prediction in Social Robots. In: Ali, A.A., et al. Social Robotics. ICSR 2023. Lecture Notes in Computer Science(), vol 14453 . Springer, Singapore. https://doi.org/10.1007/978-981-99-8715-3_8

Download citation

DOI: https://doi.org/10.1007/978-981-99-8715-3_8
Published: 03 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8714-6
Online ISBN: 978-981-99-8715-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GERT: Transformers for Co-speech Gesture Prediction in Social Robots