Abstract
Social robots are becoming an important part of our society and should be recognised as viable interaction partners, which include being perceived as i) animate beings and ii) capable of establishing natural interactions with the user. One method of achieving both objectives is allowing the robot to perform gestures autonomously, which can become problematic when those gestures have to accompany verbal messages. If the robot uses predefined gestures, an issue that needs solving is selecting the most appropriate expression given the robot’s speech. In this work, we propose three transformer-based models called GERT, which stands for Gesture-Enhanced Robotics Transformer, that predict the co-speech gestures that better match the robot’s utterances. We have compared the performance of the three models of different sizes to prove their usability in the gesture prediction task and the trade-off between size and performance. The results show that all three models achieve satisfactory performance (F-score between 0.78 and 0.86).
J. Sevilla-Salcedo and E. Fernández-Rodicio—The first two authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this work, we will use both expression and gesture indistinctively for any coherent combination of multimodal information aimed at achieving a particular communicative goal.
- 2.
- 3.
References
Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15
Baevski, A., Auli, M.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2206.04541 (2022)
Bartneck, C., Kanda, T., Mubin, O., Mahmud, A.: Does the design of a robot influence its animacy and perceived intelligence? Int. J. Soc. Robot. 1, 195–204 (2009)
Chang, C.J., Zhang, S., Kapadia, M.: The IVI lab entry to the Genea challenge 2022-a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In: Proceedings of the 2022 International Conference on Multimodal Interaction, pp. 784–789 (2022)
Chiu, C.C., Marsella, S.: Gesture generation with low-dimensional embeddings. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 781–788 (2014)
Chiu, C.-C., Morency, L.-P., Marsella, S.: Predicting co-verbal gestures: a deep and temporal modeling approach. In: Brinkman, W.-P., Broekens, J., Heylen, D. (eds.) IVA 2015. LNCS (LNAI), vol. 9238, pp. 152–166. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21996-7_17
Danescu-Niculescu-Mizil, C., Lee, L.: Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (ACL 2011) (2011)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
Kucherenko, T., Hasegawa, D., Henter, G.E., Kaneko, N., Kjellström, H.: Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 97–104 (2019)
Kucherenko, T., Nagy, R., Jonell, P., Neff, M., Kjellström, H., Henter, G.E.: Speech2properties2gestures: gesture-property prediction as a tool for generating representational gestures from speech. In: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pp. 145–147 (2021)
Li, R., Wang, Z., Wu, Y., Zhu, Y., Liu, C.L., Yang, Y.: Diffusion models beat GANS on image synthesis. arXiv preprint arXiv:2105.05233 (2021)
Liang, Y., Feng, Q., Zhu, L., Hu, L., Pan, P., Yang, Y.: SEEG: semantic energized co-speech gesture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10473–10482 (2022)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Martín Galván, L., Fernández-Rodicio, E., Sevilla Salcedo, J., Castro-González, Á., Salichs, M.A.: Using deep learning for implementing paraphrasing in a social robot. In: Julián, V., Carneiro, J., Alonso, R.S., Chamoso, P., Novais, P. (eds.) Ambient Intelligence-Software and Applications–13th International Symposium on Ambient Intelligence. LNNS, vol. 603, pp. 219–228. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22356-3_21
Microsoft: Bing chat (2023). https://www.bing.com/
Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the Fall Joint Computer Conference, 9–11 December 1968, Part I, pp. 267–277 (1968)
Nakayama, H.: seqeval: a python framework for sequence labeling evaluation. Software available (2018). https://github.com/chakki-works/seqeval
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pérez-Mayos, L., Farrús, M., Adell, J.: Part-of-speech and prosody-based approaches for robot speech and gesture synchronization. J. Intell. Robot. Syst. 1–11 (2019)
Powers, K.E., Worsham, A.L., Freeman, J.B., Wheatley, T., Heatherton, T.F.: Social connection modulates perceptions of animacy. Psychol. Sci. 25(10), 1943–1948 (2014)
Rosenthal-von der Pütten, A.M., Krämer, N.C., Herrmann, J.: The effects of humanlike and robot-specific affective nonverbal behavior on perception, emotion, and behavior. Int. J. Soc. Robot. 10(5), 569–582 (2018)
Quigley, M., et al.: ROS: an open-source robot operating system. In: ICRA Workshop on Open Source Software, Kobe, vol. 3, p. 5 (2009)
Radford, A., et al.: Improving language understanding by generative pre-training (2018)
Salichs, M.A., et al.: Mini: a new social robot for the elderly. Int. J. Soc. Robot. 12, 1231–1249 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sevilla Salcedo, J., Martín Galván, L., Castillo, J.C., Castro-González, Á., Salichs, M.A.: User-adapted semantic description generation using natural language models. In: Julián, V., Carneiro, J., Alonso, R.S., Chamoso, P., Novais, P. (eds.) Ambient Intelligence—Software and Applications, ISAmI 2022. LNNS, vol. 603, pp. 134–144. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-22356-3_13
Shiwa, T., Kanda, T., Imai, M., Ishiguro, H., Hagita, N.: How quickly should communication robots respond? In: 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 153–160. IEEE (2008)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461v3 (2018)
Yoon, Y., Ko, W.R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)
Acknowledgment
The research leading to these results has received funding from the grants PID2021-123941OA-I00, funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”; TED2021-132079B-I00 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Mejora del nivel de madurez tecnologica del robot Mini (MeNiR) funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. This work has been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M (“Fostering Young Doctors Research”, SMM4HRI-CM-UC3M), and in the context of the V PRICIT (Research and Technological Innovation Regional Programme).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sevilla-Salcedo, J., Fernández-Rodicio, E., Castillo, J.C., Castro-González, Á., Salichs, M.A. (2024). GERT: Transformers for Co-speech Gesture Prediction in Social Robots. In: Ali, A.A., et al. Social Robotics. ICSR 2023. Lecture Notes in Computer Science(), vol 14453 . Springer, Singapore. https://doi.org/10.1007/978-981-99-8715-3_8
Download citation
DOI: https://doi.org/10.1007/978-981-99-8715-3_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8714-6
Online ISBN: 978-981-99-8715-3
eBook Packages: Computer ScienceComputer Science (R0)