research-article

Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM

Authors:

Kazuhiko SumiAuthors Info & Claims

HAI '17: Proceedings of the 5th International Conference on Human Agent Interaction

Pages 365 - 369

https://doi.org/10.1145/3125739.3132594

Published: 27 October 2017 Publication History

Get Access

Abstract

In this research, we take a first step in generating motion data for gestures directly from speech features. Such a method can make creating gesture animations for Embodied Conversational Agents much easier. We implemented a model using Bi-Directional LSTM taking phonemic features from speech audio data as input to output time sequence data of rotations of bone joints. We assessed the validity of the predicted gesture motion data by evaluating the final loss value of the network, and evaluating the impressions of the predicted gesture by comparing it with the actual motion data that accompanied the audio data used for input and motion data that accompanied a different audio data. The results showed that the accuracy of the prediction for the LSTM model was better than a simple RNN model. In contrast, the impressions evaluation of the predicted gesture was rated lower than the original and mismatched gestures, although individually some predicted gestures were rated the same degree as the mismatched gestures.

References

[1]

https://youtu.be/MAs4iKGToBU

Google Scholar

[2]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org.

Google Scholar

[3]

Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces:How humans and humanoids use speech and gesture to give directions. Conversational informatics (2007), 133--160.

Google Scholar

[4]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat:the behavior expression animation toolkit. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 477--486.

Digital Library

Google Scholar

[5]

Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar:A data driven approach to gesture generation. In Intelligent Virtual Agents. Springer, 127--140.

Crossref

Google Scholar

[6]

Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures:a deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents. Springer, 152--166.

Crossref

Google Scholar

[7]

François Chollet and others. 2015. Keras. https://github.com/fchollet/keras. (2015).

Google Scholar

[8]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and others. 2014. Deep speech:Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

Google Scholar

[9]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

Google Scholar

[10]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization:Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. 448--456.

Google Scholar

[11]

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980

Google Scholar

[12]

David McNeill. 1992. Hand and mind:What gestures reveal about thought. University of Chicago press.

Google Scholar

[13]

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout:a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

Google Scholar

[14]

Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation. Springer International Publishing, Cham, 198--202.

Crossref

Google Scholar

Cited By

View all

Favali FSchmuck VVillani VCeliktutan O(2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
https://doi.org/10.1007/978-3-031-81688-8_3
Zhang CWang CZhao YCheng SLuo LGuo X(2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00609
Delbosc AOchs MSabouret NRavenet BAyache S(2023)Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agentCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616547(228-237)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3616547
Show More Cited By

Index Terms

Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Analyzing Input and Output Representations for Speech-Driven Gesture Generation
IVA '19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-...
Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents

We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward ...
Gesticulator: A framework for semantically-aware speech-driven gesture generation
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture ...

Comments

Information & Contributors

Information

Published In

HAI '17: Proceedings of the 5th International Conference on Human Agent Interaction

October 2017

550 pages

ISBN:9781450351133

DOI:10.1145/3125739

General Chairs:
Britta Wrede
Bielefeld University, Germany
,
Yukie Nagai
Osaka University, Japan
,
Program Chairs:
Takanori Komatsu
Meiji University, Japan
,
Marc Hanheide
University of Lincoln, UK
,
Lorenzo Natale
Italian Institute of Technology, Italy

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HAI '17

Sponsor:

SIGCHI

HAI '17: The Fifth International Conference on Human-Agent Interaction

October 17 - 20, 2017

Bielefeld, Germany

Acceptance Rates

Overall Acceptance Rate 121 of 404 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
342
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Favali FSchmuck VVillani VCeliktutan O(2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
https://doi.org/10.1007/978-3-031-81688-8_3
Zhang CWang CZhao YCheng SLuo LGuo X(2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00609
Delbosc AOchs MSabouret NRavenet BAyache S(2023)Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agentCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616547(228-237)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3616547
Teshima HWake NThomas DNakashima YKawasaki HIkeuchi K(2023)ACT2GProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069406:3(1-17)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.1145/3606940
Alexanderson SNagy RBeskow JHenter G(2023)Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion ModelsACM Transactions on Graphics10.1145/359245842:4(1-20)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592458
Wu JChen SGan SLi WYang CSun LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture DatasetProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611705(3538-3549)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611705
Deichler AMehta SAlexanderson SBeskow J(2023)Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio RepresentationProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616117(755-762)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616117
Windle JMatthews IMilner BTaylor S(2023)The UEA Digital Humans entry to the GENEA Challenge 2023Proceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616116(802-810)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616116
Liu CMohammadi GSong YJohal W(2023)Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)10.1109/RO-MAN57019.2023.10309493(405-412)Online publication date: 28-Aug-2023
https://doi.org/10.1109/RO-MAN57019.2023.10309493
Tang YLiu JLiu AYang BDai WRao YLu JZhou JLi X(2023)FLAG3D: A 3D Fitness Activity Dataset with Language Instruction2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02117(22106-22117)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.02117
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations