skip to main content
10.1145/3462244.3481275acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Development of an Interactive Human/Agent Loop using Multimodal Recurrent Neural Networks

Published:18 October 2021Publication History

ABSTRACT

The development of expressive embodied conversational agent (ECA) still remains a big challenge. During an interaction partners continuously adapt their behaviors one to the other [7]. Adaptation mechanisms may take different forms such as the choice of same vocabulary and grammatical form [31], imitation and synchronization [7]. The aim of my PhD project is to improve the interaction between human and agent. The key idea is to create an interactive loop between human and agent which allows the virtual agent to continuously adapt its behavior according to its partner’s behavior. The main idea is to learn how dyad of humans adapt their behaviors and implement it into human-agent interaction. My work, based on recurrent neural network, focuses on nonverbal behavior generation and addresses several scientific locks like the multimodality, the intra-personal temporality of multimodal signals or the temporality between partner’s social cues. We plan to build a model learned in an end-to-end fashion that generates behaviors considering both acoustic and visual modalities.

References

  1. Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. 2019. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International Conference on Multimodal Interaction. 74–84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Simon Alexanderson, Gustav Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows. Computer Graphics Forum 39 (05 2020), 487–496. https://doi.org/10.1111/cgf.13946Google ScholarGoogle Scholar
  3. Michael Argyle. 2013. Bodily communication. Routledge.Google ScholarGoogle Scholar
  4. Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  5. G. Bebis and M. Georgiopoulos. 1994. Feed-forward neural networks. IEEE Potentials 13, 4 (1994), 27–31. https://doi.org/10.1109/45.329294Google ScholarGoogle ScholarCross RefCross Ref
  6. Carola Bloch, Kai Vogeley, Alexandra L Georgescu, and Christine M Falter-Wagner. 2019. INTRApersonal Synchrony as Constituent of INTERpersonal Synchrony and Its Relevance for Autism Spectrum Disorder. Frontiers in Robotics and AI 6 (2019), 73.Google ScholarGoogle ScholarCross RefCross Ref
  7. Judee K Burgoon, Laura K Guerrero, and Valerie Manusov. 2011. Nonverbal signals. The SAGE handbook of interpersonal communication (2011), 239–280.Google ScholarGoogle Scholar
  8. Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth Andre, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. 350–359. https://doi.org/10.1145/3136755.3136780Google ScholarGoogle Scholar
  9. Hang Chu, D. Li, and S. Fidler. 2018. A Face-to-Face Neural Conversation Model. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7113–7121.Google ScholarGoogle ScholarCross RefCross Ref
  10. Axel Cleeremans, David Servan-Schreiber, and James Mcclelland. 1989. Finite State Automata and Simple Recurrent Networks. Neural Computation - NECO 1 (09 1989), 372–381. https://doi.org/10.1162/neco.1989.1.3.372Google ScholarGoogle Scholar
  11. Emilie Delaherche, Mohamed Chetouani, Ammar Mahdhaoui, Catherine Saint-Georges, Sylvie Viaux, and David Cohen. 2012. Interpersonal synchrony: A survey of evaluation methods across disciplines. IEEE Transactions on Affective Computing 3, 3 (2012), 349–365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Soumia Dermouche and Catherine Pelachaud. 2019. Engagement Modeling in Dyadic Interaction. 440–445. https://doi.org/10.1145/3340555.3353765Google ScholarGoogle Scholar
  13. Soumia Dermouche and Catherine Pelachaud. 2019. Generative model of agent’s behaviors in human-agent interaction. In 2019 International Conference on Multimodal Interaction. 375–384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chuang Ding, Lei Xie, and Pengcheng Zhu. 2014. Head motion synthesis from speech using deep neural networks. Multimedia Tools and Applications 74 (07 2014). https://doi.org/10.1007/s11042-014-2156-2Google ScholarGoogle Scholar
  15. Sidney S D’Mello, Patrick Chipman, and Art Graesser. 2007. Posture as a predictor of learner’s affective engagement. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 29.Google ScholarGoogle Scholar
  16. Paul Ekman and Wallace V Friesen. 1976. Measuring facial movement. Environmental psychology and nonverbal behavior 1, 1 (1976), 56–75.Google ScholarGoogle Scholar
  17. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Will Feng, Anitha Kannan, Georgia Gkioxari, and C Lawrence Zitnick. 2017. Learn2Smile: Learning non-verbal interaction through observation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4131–4138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Terrence Fong, Illah Nourbakhsh, and Kerstin Dautenhahn. 2003. A survey of socially interactive robots. Robotics and autonomous systems 42, 3-4 (2003), 143–166.Google ScholarGoogle Scholar
  20. Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042 IJCNN 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2010. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20, 1 (2010), 70–84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yukiko I Nakano and Ryo Ishii. 2010. Estimating user’s engagement from eye-gaze behaviors in human-agent conversations. In Proceedings of the 15th international conference on Intelligent user interfaces. 139–148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Radoslaw Niewiadomski, Elisabetta Bevacqua, Maurizio Mancini, and Catherine Pelachaud. 2009. Greta: An interactive expressive ECA system. Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 2, 1399–1400. https://doi.org/10.1145/1558109.1558314Google ScholarGoogle Scholar
  30. Ryota Nishimura, Norihide Kitaoka, and Seiji Nakagawa. 2007. A Spoken Dialog System for Chat-Like Conversations Considering Response Timing, Vol. 4629. 599–606. https://doi.org/10.1007/978-3-540-74628-7_77Google ScholarGoogle Scholar
  31. Martin Pickering and Simon Garrod. 2004. Toward a Mechanistic Psychology of Dialogue. The Behavioral and brain sciences 27 (05 2004), 169–90; discussion 190. https://doi.org/10.1017/S0140525X04000056Google ScholarGoogle Scholar
  32. Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrus̆aitis, and Roland Goecke. 2016. Extending Long Short-Term Memory for Multi-View Structured Learning, Vol. 9911. 338–353. https://doi.org/10.1007/978-3-319-46478-7_21Google ScholarGoogle Scholar
  33. Brian Ravenet, Magalie Ochs, and Catherine Pelachaud. 2013. From a user-created corpus of virtual agent’s non-verbal behavior to a computational model of interpersonal attitudes. In International workshop on intelligent virtual agents. Springer, 263–274.Google ScholarGoogle ScholarCross RefCross Ref
  34. Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6169–6173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Khiet Truong, Ronald Poppe, and Dirk Heylen. 2010. A rule-based backchannel prediction model using pitch and pause information. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 3058–3061.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagementfrom facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98.Google ScholarGoogle ScholarCross RefCross Ref
  37. Haimin Yang, Zhisong Pan, and Qing Tao. 2017. Robust and Adaptive Online Time Series Prediction with Long Short-Term Memory. Computational Intelligence and Neuroscience 2017 (12 2017), 1–9. https://doi.org/10.1155/2017/9478952Google ScholarGoogle Scholar
  38. Amir Zadeh, Paul Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. (02 2018).Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
    October 2021
    876 pages
    ISBN:9781450384810
    DOI:10.1145/3462244

    Copyright © 2021 ACM

    © 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 18 October 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate453of1,080submissions,42%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format