short-paper

Development of an Interactive Human/Agent Loop using Multimodal Recurrent Neural Networks

Author:
Jieyeon Woo

Sorbonne Université, France

Sorbonne Université, France
View Profile

ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionOctober 2021Pages 822–826https://doi.org/10.1145/3462244.3481275

Published:18 October 2021Publication History

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Pages 822–826

ABSTRACT

The development of expressive embodied conversational agent (ECA) still remains a big challenge. During an interaction partners continuously adapt their behaviors one to the other [7]. Adaptation mechanisms may take different forms such as the choice of same vocabulary and grammatical form [31], imitation and synchronization [7]. The aim of my PhD project is to improve the interaction between human and agent. The key idea is to create an interactive loop between human and agent which allows the virtual agent to continuously adapt its behavior according to its partner’s behavior. The main idea is to learn how dyad of humans adapt their behaviors and implement it into human-agent interaction. My work, based on recurrent neural network, focuses on nonverbal behavior generation and addresses several scientific locks like the multimodality, the intra-personal temporality of multimodal signals or the temporality between partner’s social cues. We plan to build a model learned in an end-to-end fashion that generates behaviors considering both acoustic and visual modalities.

References

Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. 2019. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International Conference on Multimodal Interaction. 74–84.Google ScholarDigital Library
Simon Alexanderson, Gustav Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows. Computer Graphics Forum 39 (05 2020), 487–496. https://doi.org/10.1111/cgf.13946Google Scholar
Michael Argyle. 2013. Bodily communication. Routledge.Google Scholar
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–10.Google ScholarCross Ref
G. Bebis and M. Georgiopoulos. 1994. Feed-forward neural networks. IEEE Potentials 13, 4 (1994), 27–31. https://doi.org/10.1109/45.329294Google ScholarCross Ref
Carola Bloch, Kai Vogeley, Alexandra L Georgescu, and Christine M Falter-Wagner. 2019. INTRApersonal Synchrony as Constituent of INTERpersonal Synchrony and Its Relevance for Autism Spectrum Disorder. Frontiers in Robotics and AI 6 (2019), 73.Google ScholarCross Ref
Judee K Burgoon, Laura K Guerrero, and Valerie Manusov. 2011. Nonverbal signals. The SAGE handbook of interpersonal communication (2011), 239–280.Google Scholar
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth Andre, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. 350–359. https://doi.org/10.1145/3136755.3136780Google Scholar
Hang Chu, D. Li, and S. Fidler. 2018. A Face-to-Face Neural Conversation Model. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7113–7121.Google ScholarCross Ref
Axel Cleeremans, David Servan-Schreiber, and James Mcclelland. 1989. Finite State Automata and Simple Recurrent Networks. Neural Computation - NECO 1 (09 1989), 372–381. https://doi.org/10.1162/neco.1989.1.3.372Google Scholar
Emilie Delaherche, Mohamed Chetouani, Ammar Mahdhaoui, Catherine Saint-Georges, Sylvie Viaux, and David Cohen. 2012. Interpersonal synchrony: A survey of evaluation methods across disciplines. IEEE Transactions on Affective Computing 3, 3 (2012), 349–365.Google ScholarDigital Library
Soumia Dermouche and Catherine Pelachaud. 2019. Engagement Modeling in Dyadic Interaction. 440–445. https://doi.org/10.1145/3340555.3353765Google Scholar
Soumia Dermouche and Catherine Pelachaud. 2019. Generative model of agent’s behaviors in human-agent interaction. In 2019 International Conference on Multimodal Interaction. 375–384.Google ScholarDigital Library
Chuang Ding, Lei Xie, and Pengcheng Zhu. 2014. Head motion synthesis from speech using deep neural networks. Multimedia Tools and Applications 74 (07 2014). https://doi.org/10.1007/s11042-014-2156-2Google Scholar
Sidney S D’Mello, Patrick Chipman, and Art Graesser. 2007. Posture as a predictor of learner’s affective engagement. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 29.Google Scholar
Paul Ekman and Wallace V Friesen. 1976. Measuring facial movement. Environmental psychology and nonverbal behavior 1, 1 (1976), 56–75.Google Scholar
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.Google ScholarDigital Library
Will Feng, Anitha Kannan, Georgia Gkioxari, and C Lawrence Zitnick. 2017. Learn2Smile: Learning non-verbal interaction through observation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4131–4138.Google ScholarDigital Library
Terrence Fong, Illah Nourbakhsh, and Kerstin Dautenhahn. 2003. A survey of socially interactive robots. Robotics and autonomous systems 42, 3-4 (2003), 143–166.Google Scholar
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.Google ScholarCross Ref
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680.Google ScholarDigital Library
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042 IJCNN 2005.Google ScholarDigital Library
Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.Google ScholarDigital Library
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.Google ScholarDigital Library
Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.Google ScholarDigital Library
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.Google ScholarDigital Library
Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2010. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20, 1 (2010), 70–84.Google ScholarDigital Library
Yukiko I Nakano and Ryo Ishii. 2010. Estimating user’s engagement from eye-gaze behaviors in human-agent conversations. In Proceedings of the 15th international conference on Intelligent user interfaces. 139–148.Google ScholarDigital Library
Radoslaw Niewiadomski, Elisabetta Bevacqua, Maurizio Mancini, and Catherine Pelachaud. 2009. Greta: An interactive expressive ECA system. Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 2, 1399–1400. https://doi.org/10.1145/1558109.1558314Google Scholar
Ryota Nishimura, Norihide Kitaoka, and Seiji Nakagawa. 2007. A Spoken Dialog System for Chat-Like Conversations Considering Response Timing, Vol. 4629. 599–606. https://doi.org/10.1007/978-3-540-74628-7_77Google Scholar
Martin Pickering and Simon Garrod. 2004. Toward a Mechanistic Psychology of Dialogue. The Behavioral and brain sciences 27 (05 2004), 169–90; discussion 190. https://doi.org/10.1017/S0140525X04000056Google Scholar
Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrus̆aitis, and Roland Goecke. 2016. Extending Long Short-Term Memory for Multi-View Structured Learning, Vol. 9911. 338–353. https://doi.org/10.1007/978-3-319-46478-7_21Google Scholar
Brian Ravenet, Magalie Ochs, and Catherine Pelachaud. 2013. From a user-created corpus of virtual agent’s non-verbal behavior to a computational model of interpersonal attitudes. In International workshop on intelligent virtual agents. Springer, 263–274.Google ScholarCross Ref
Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6169–6173.Google ScholarDigital Library
Khiet Truong, Ronald Poppe, and Dirk Heylen. 2010. A rule-based backchannel prediction model using pitch and pause information. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 3058–3061.Google ScholarCross Ref
Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagementfrom facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98.Google ScholarCross Ref
Haimin Yang, Zhisong Pan, and Qing Tao. 2017. Robust and Adaptive Online Time Series Prediction with Long Short-Term Memory. Computational Intelligence and Neuroscience 2017 (12 2017), 1–9. https://doi.org/10.1155/2017/9478952Google Scholar
Amir Zadeh, Paul Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. (02 2018).Google Scholar

Recommendations

Interruptions in Human-Agent Interaction
IVA '21: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents

Turn management is one of the necessary social interactions skills. In human-human interactions, turn changes are naturally completed by interruption, "cooperatively" or "competitively". Interruptions are inherent in conversation. They can be considered ...
Read More
Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

The listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...
Read More
Multimodal human discourse: gesture and speech

Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
October 2021
876 pages
ISBN:9781450384810
DOI:10.1145/3462244
Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu
Copyright © 2021 ACM
© 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep Learning
Embodied Conversational Agent (ECA)
Multimodal Interaction
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 159
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Development of an Interactive Human/Agent Loop using Multimodal Recurrent Neural Networks

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Recommendations

Interruptions in Human-Agent Interaction

Prediction of Various Backchannel Utterances Based on Multimodal Information

Multimodal human discourse: gesture and speech