skip to main content
10.1145/3125739.3132594acmconferencesArticle/Chapter ViewAbstractPublication PageshaiConference Proceedingsconference-collections
research-article

Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM

Published: 27 October 2017 Publication History

Abstract

In this research, we take a first step in generating motion data for gestures directly from speech features. Such a method can make creating gesture animations for Embodied Conversational Agents much easier. We implemented a model using Bi-Directional LSTM taking phonemic features from speech audio data as input to output time sequence data of rotations of bone joints. We assessed the validity of the predicted gesture motion data by evaluating the final loss value of the network, and evaluating the impressions of the predicted gesture by comparing it with the actual motion data that accompanied the audio data used for input and motion data that accompanied a different audio data. The results showed that the accuracy of the prediction for the LSTM model was better than a simple RNN model. In contrast, the impressions evaluation of the predicted gesture was rated lower than the original and mismatched gestures, although individually some predicted gestures were rated the same degree as the mismatched gestures.

References

[1]
https://youtu.be/MAs4iKGToBU
[2]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org.
[3]
Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces:How humans and humanoids use speech and gesture to give directions. Conversational informatics (2007), 133--160.
[4]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat:the behavior expression animation toolkit. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 477--486.
[5]
Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar:A data driven approach to gesture generation. In Intelligent Virtual Agents. Springer, 127--140.
[6]
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures:a deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents. Springer, 152--166.
[7]
François Chollet and others. 2015. Keras. https://github.com/fchollet/keras. (2015).
[8]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and others. 2014. Deep speech:Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
[9]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[10]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization:Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. 448--456.
[11]
Diederik P. Kingma and Jimmy Ba. 2014. Adam:A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
[12]
David McNeill. 1992. Hand and mind:What gestures reveal about thought. University of Chicago press.
[13]
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout:a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[14]
Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation. Springer International Publishing, Cham, 198--202.

Cited By

View all
  • (2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
  • (2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
  • (2023)Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agentCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616547(228-237)Online publication date: 9-Oct-2023
  • Show More Cited By

Index Terms

  1. Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HAI '17: Proceedings of the 5th International Conference on Human Agent Interaction
      October 2017
      550 pages
      ISBN:9781450351133
      DOI:10.1145/3125739
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. bi-directional LSTM
      2. deep learning
      3. gesture generation
      4. speech features

      Qualifiers

      • Research-article

      Conference

      HAI '17
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 121 of 404 submissions, 30%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
      • (2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
      • (2023)Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agentCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616547(228-237)Online publication date: 9-Oct-2023
      • (2023)ACT2GProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069406:3(1-17)Online publication date: 24-Aug-2023
      • (2023)Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion ModelsACM Transactions on Graphics10.1145/359245842:4(1-20)Online publication date: 26-Jul-2023
      • (2023)Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture DatasetProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611705(3538-3549)Online publication date: 26-Oct-2023
      • (2023)Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio RepresentationProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616117(755-762)Online publication date: 9-Oct-2023
      • (2023)The UEA Digital Humans entry to the GENEA Challenge 2023Proceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616116(802-810)Online publication date: 9-Oct-2023
      • (2023)Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)10.1109/RO-MAN57019.2023.10309493(405-412)Online publication date: 28-Aug-2023
      • (2023)FLAG3D: A 3D Fitness Activity Dataset with Language Instruction2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02117(22106-22117)Online publication date: Jun-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media