research-article

Towards Real-time Co-speech Gesture Generation in Online Interaction in Social XR

Authors:

Stefan KoppAuthors Info & Claims

IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

Article No.: 25, Pages 1 - 8

https://doi.org/10.1145/3570945.3607315

Published: 22 December 2023 Publication History

Abstract

Extended Reality (XR) has a potential to allow social interaction for people that are distant from one another, in educational, clinical or co-working applications, as well as for scientific studies. However, a full-blown embodied social presence and interaction via avatars in XR requires motion tracking hardware that many users do not have. At the same time, modern machine learning approaches enable the synthesis of natural and life-like nonverbal behavior, but only in offline settings and with considerable lag. We evaluate the applicability of current gesture generation systems for online interaction in social XR. We define a set of requirements for real-time-capable gesture generation and propose an approach to employ a state-of-the-art model in a real-time XR interaction pipeline. To test the model under conditions of online interaction, we divide an input audio stream into chunks of different lengths and stitch the resulting gesture animations together to form continuous motion. We evaluate the quality of the resulting multimodal avatar behavior in a user study. Our results show a significant trade-off between real-time generation capabilities and gesture quality. Suggestions for future improvement to retain model performance during online interaction in Social XR are made. A project page with videos of the generated gestures is available at https://nkrome.github.io/CAGE.html.

References

[1]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. Computer Graphics Forum 39, 2 (May 2020), 487--496. https://doi.org/10.1111/cgf.13946

[2]

Jeremy N. Balienson and James J. Blasovich. 2004. Avatars. In Encyclopedia of Human-Computer Interaction, William Sims Bainbridge (Ed.). Berkshire Publishing Group, Berkshire, 64--68.

[3]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters: Tools, Affective Functions, and Applications, Helmut Prendinger and Mitsuru Ishizuka (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 163--185. https://doi.org/10.1007/978-3-662-08373-4_8

[4]

Emiliana. 2023. BVH Tools for Unity. https://github.com/emilianavt/BVHToolsoriginal-date: 2019-04-17T21:13:21Z.

[5]

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, and Marc-André Carbonneau. 2022. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. http://arxiv.org/abs/2209.07556 arXiv:2209.07556 [cs].

[6]

SoSci Survey GmbH. 2023. SoSci Survey • professionelle Onlinebefragung made in Germany. https://www.soscisurvey.de/

[7]

Guilherme Gonçalves, Miguel Melo, Luís Barbosa, José Vasconcelos-Raposo, and Maximino Bessa. 2022. Evaluation of the impact of different levels of self-representation and body tracking on the sense of presence and embodiment in immersive VR. Virtual Reality 26, 1 (March 2022), 1--14. https://doi.org/10.1007/s10055-021-00530-5

Digital Library

[8]

Fernanda Herrera, Soo Youn Oh, and Jeremy N. Bailenson. 2018. Effect of Behavioral Realism on Social Interactions Inside Collaborative Virtual Environments. Presence: Teleoperators and Virtual Environments 27, 2 (02 2018), 163--182. https://doi.org/10.1162/pres_a_00324 arXiv:https://direct.mit.edu/pvar/article-pdf/27/2/163/2003610/pres_a_00324.pdf

[9]

Jan Holub and Ondrej Tomiska. 2009. Delay Effect on Conversational Quality in Telecommunication Networks: Do We Mind? In Wireless Technology, Steven Powell and J.P. Shim (Eds.). Vol. 44. Springer US, Boston, MA, 91--98. https://doi.org/10.1007/978-0-387-71787-6_6 Series Title: Lecture Notes in Electrical Engineering.

[10]

Naoshi Kaneko, Yuna Mitsubayashi, and Geng Mu. 2022. TransGesture: Autoregressive Gesture Generation with RNN-Transducer. (2022), 7.

[11]

Adam Kendon. 1980. Gesticulation and speech: Two aspects of the process of utterance in M. The Relationship of Verbal and Nonverbal Communication 25 (01 1980).

[12]

N. Kitawaki and K. Itoh. 1991. Pure delay effects on speech quality in telecommunications. IEEE Journal on Selected Areas in Communications 9, 4 (May 1991), 586--593. https://doi.org/10.1109/49.81952 Conference Name: IEEE Journal on Selected Areas in Communications.

Digital Library

[13]

Stefan Kopp, Bernhard Jung, Nadine Lessmann, and Ipke Wachsmuth. 2003. Max-a multimodal assistant in virtual reality construction. KI 17, 4 (2003), 11.

[14]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction.

Digital Library

[15]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces. ACM, College Station TX USA, 11--21. https://doi.org/10.1145/3397481.3450692

Digital Library

[16]

Christos Kyrlitsias and Despina Michael-Grigoriou. 2022. Social Interaction With Agents and Avatars in Immersive Virtual Environments: A Survey. Frontiers in Virtual Reality 2 (Jan. 2022), 786665. https://doi.org/10.3389/frvir.2021.786665

[17]

Jina Lee and Stacy Marsella. 2006. Nonverbal Behavior Generator for Embodied Conversational Agents. In Intelligent Virtual Agents, Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 243--255.

[18]

Prolific Academic Ltd. 2023. Prolific • Quickly find research participants you can trust. https://www.prolific.co/

[19]

David Mcneill. 1994. Hand and Mind: What Gestures Reveal About Thought. Bibliovault OAI Repository, the University of Chicago Press 27 (06 1994). https://doi.org/10.2307/1576015

[20]

Rajmund Nagy, Taras Kucherenko, Birger Moell, André Pereira, Hedvig Kjellström, and Ulysses Bernardet. 2021. A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (Virtual Event, United Kingdom) (AAMAS '21). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC.

Digital Library

[21]

Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. 2023. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. http://arxiv.org/abs/2301.05339 arXiv:2301.05339 [cs].

[22]

Hubert Pham. 2023. PyAudio: Cross-platform audio I/O for Python, with PortAudio. https://people.csail.mit.edu/hubert/pyaudio/

[23]

Khaled Saleh. 2022. Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation. (2022), 9.

[24]

Mel Slater. 2009. Place illusion and plausibility can lead to realistic behaviour in immersive virtual environments. Philosophical Transactions of the Royal Society B:Biological Sciences 364, 1535 (Dec. 2009), 3549--3557. https://doi.org/10.1098/rstb.2009.0138

[25]

sta. 2023. sta/websocket-sharp. https://github.com/sta/websocket-sharp original-date: 2010-10-18T12:51:34Z.

[26]

Hannes Högni Vilhjálmsson. 2003. Avatar Augmented Online Conversation. (2003).

[27]

Stephan Wenninger, Jascha Achenbach, Andrea Bartl, Marc Erich Latoschik, and Mario Botsch. 2020. Realistic Virtual Humans from Smartphone Videos. In 26th ACM Symposium on Virtual Reality Software and Technology. ACM, Virtual Event Canada, 1--11. https://doi.org/10.1145/3385956.3418940

Digital Library

[28]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. https://doi.org/10.1145/3536221.3558058 arXiv:2208.10441 [cs, eess].

Digital Library

Cited By

Krome NKopp S(2024)Minimal Latency Speech-Driven Gesture Generation for Continuous Interaction in Social XR2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR)10.1109/AIxVR59861.2024.00038(236-240)Online publication date: 17-Jan-2024
https://doi.org/10.1109/AIxVR59861.2024.00038

Index Terms

Towards Real-time Co-speech Gesture Generation in Online Interaction in Social XR
1. Computing methodologies
  1. Computer graphics
    1. Animation
      1. Procedural animation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction
IVA '22: Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents

Embodied Conversational Agents (ECAs) that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and ...
Head gesture sonification for supporting social interaction
AM '12: Proceedings of the 7th Audio Mostly Conference: A Conference on Interaction with Sound

In this paper we introduce two new methods for real-time sonification of head movements and head gestures. Head gestures such as nodding or shaking the head are important non-verbal back-channelling signals which facilitate coordination and alignment of ...
Gesticulator: A framework for semantically-aware speech-driven gesture generation
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

September 2023

376 pages

ISBN:9781450399944

DOI:10.1145/3570945

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2023

Received: 24 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Bundesministerium für Bildung und Forschung

Conference

IVA '23

Sponsor:

SIGAI

IVA '23: ACM International Conference on Intelligent Virtual Agents

September 19 - 22, 2023

Würzburg, Germany

Acceptance Rates

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)135
Downloads (Last 6 weeks)8

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Krome NKopp S(2024)Minimal Latency Speech-Driven Gesture Generation for Continuous Interaction in Social XR2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR)10.1109/AIxVR59861.2024.00038(236-240)Online publication date: 17-Jan-2024
https://doi.org/10.1109/AIxVR59861.2024.00038

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents