ABSTRACT
The development of expressive embodied conversational agent (ECA) still remains a big challenge. During an interaction partners continuously adapt their behaviors one to the other [7]. Adaptation mechanisms may take different forms such as the choice of same vocabulary and grammatical form [31], imitation and synchronization [7]. The aim of my PhD project is to improve the interaction between human and agent. The key idea is to create an interactive loop between human and agent which allows the virtual agent to continuously adapt its behavior according to its partner’s behavior. The main idea is to learn how dyad of humans adapt their behaviors and implement it into human-agent interaction. My work, based on recurrent neural network, focuses on nonverbal behavior generation and addresses several scientific locks like the multimodality, the intra-personal temporality of multimodal signals or the temporality between partner’s social cues. We plan to build a model learned in an end-to-end fashion that generates behaviors considering both acoustic and visual modalities.
- Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. 2019. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International Conference on Multimodal Interaction. 74–84.Google ScholarDigital Library
- Simon Alexanderson, Gustav Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows. Computer Graphics Forum 39 (05 2020), 487–496. https://doi.org/10.1111/cgf.13946Google Scholar
- Michael Argyle. 2013. Bodily communication. Routledge.Google Scholar
- Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–10.Google ScholarCross Ref
- G. Bebis and M. Georgiopoulos. 1994. Feed-forward neural networks. IEEE Potentials 13, 4 (1994), 27–31. https://doi.org/10.1109/45.329294Google ScholarCross Ref
- Carola Bloch, Kai Vogeley, Alexandra L Georgescu, and Christine M Falter-Wagner. 2019. INTRApersonal Synchrony as Constituent of INTERpersonal Synchrony and Its Relevance for Autism Spectrum Disorder. Frontiers in Robotics and AI 6 (2019), 73.Google ScholarCross Ref
- Judee K Burgoon, Laura K Guerrero, and Valerie Manusov. 2011. Nonverbal signals. The SAGE handbook of interpersonal communication (2011), 239–280.Google Scholar
- Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth Andre, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. 350–359. https://doi.org/10.1145/3136755.3136780Google Scholar
- Hang Chu, D. Li, and S. Fidler. 2018. A Face-to-Face Neural Conversation Model. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7113–7121.Google ScholarCross Ref
- Axel Cleeremans, David Servan-Schreiber, and James Mcclelland. 1989. Finite State Automata and Simple Recurrent Networks. Neural Computation - NECO 1 (09 1989), 372–381. https://doi.org/10.1162/neco.1989.1.3.372Google Scholar
- Emilie Delaherche, Mohamed Chetouani, Ammar Mahdhaoui, Catherine Saint-Georges, Sylvie Viaux, and David Cohen. 2012. Interpersonal synchrony: A survey of evaluation methods across disciplines. IEEE Transactions on Affective Computing 3, 3 (2012), 349–365.Google ScholarDigital Library
- Soumia Dermouche and Catherine Pelachaud. 2019. Engagement Modeling in Dyadic Interaction. 440–445. https://doi.org/10.1145/3340555.3353765Google Scholar
- Soumia Dermouche and Catherine Pelachaud. 2019. Generative model of agent’s behaviors in human-agent interaction. In 2019 International Conference on Multimodal Interaction. 375–384.Google ScholarDigital Library
- Chuang Ding, Lei Xie, and Pengcheng Zhu. 2014. Head motion synthesis from speech using deep neural networks. Multimedia Tools and Applications 74 (07 2014). https://doi.org/10.1007/s11042-014-2156-2Google Scholar
- Sidney S D’Mello, Patrick Chipman, and Art Graesser. 2007. Posture as a predictor of learner’s affective engagement. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 29.Google Scholar
- Paul Ekman and Wallace V Friesen. 1976. Measuring facial movement. Environmental psychology and nonverbal behavior 1, 1 (1976), 56–75.Google Scholar
- Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.Google ScholarDigital Library
- Will Feng, Anitha Kannan, Georgia Gkioxari, and C Lawrence Zitnick. 2017. Learn2Smile: Learning non-verbal interaction through observation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4131–4138.Google ScholarDigital Library
- Terrence Fong, Illah Nourbakhsh, and Kerstin Dautenhahn. 2003. A survey of socially interactive robots. Robotics and autonomous systems 42, 3-4 (2003), 143–166.Google Scholar
- Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.Google ScholarCross Ref
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680.Google ScholarDigital Library
- Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042 IJCNN 2005.Google ScholarDigital Library
- Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.Google ScholarDigital Library
- Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.Google ScholarDigital Library
- Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.Google ScholarDigital Library
- Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.Google ScholarDigital Library
- Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2010. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20, 1 (2010), 70–84.Google ScholarDigital Library
- Yukiko I Nakano and Ryo Ishii. 2010. Estimating user’s engagement from eye-gaze behaviors in human-agent conversations. In Proceedings of the 15th international conference on Intelligent user interfaces. 139–148.Google ScholarDigital Library
- Radoslaw Niewiadomski, Elisabetta Bevacqua, Maurizio Mancini, and Catherine Pelachaud. 2009. Greta: An interactive expressive ECA system. Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 2, 1399–1400. https://doi.org/10.1145/1558109.1558314Google Scholar
- Ryota Nishimura, Norihide Kitaoka, and Seiji Nakagawa. 2007. A Spoken Dialog System for Chat-Like Conversations Considering Response Timing, Vol. 4629. 599–606. https://doi.org/10.1007/978-3-540-74628-7_77Google Scholar
- Martin Pickering and Simon Garrod. 2004. Toward a Mechanistic Psychology of Dialogue. The Behavioral and brain sciences 27 (05 2004), 169–90; discussion 190. https://doi.org/10.1017/S0140525X04000056Google Scholar
- Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrus̆aitis, and Roland Goecke. 2016. Extending Long Short-Term Memory for Multi-View Structured Learning, Vol. 9911. 338–353. https://doi.org/10.1007/978-3-319-46478-7_21Google Scholar
- Brian Ravenet, Magalie Ochs, and Catherine Pelachaud. 2013. From a user-created corpus of virtual agent’s non-verbal behavior to a computational model of interpersonal attitudes. In International workshop on intelligent virtual agents. Springer, 263–274.Google ScholarCross Ref
- Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6169–6173.Google ScholarDigital Library
- Khiet Truong, Ronald Poppe, and Dirk Heylen. 2010. A rule-based backchannel prediction model using pitch and pause information. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 3058–3061.Google ScholarCross Ref
- Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagementfrom facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98.Google ScholarCross Ref
- Haimin Yang, Zhisong Pan, and Qing Tao. 2017. Robust and Adaptive Online Time Series Prediction with Long Short-Term Memory. Computational Intelligence and Neuroscience 2017 (12 2017), 1–9. https://doi.org/10.1155/2017/9478952Google Scholar
- Amir Zadeh, Paul Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. (02 2018).Google Scholar
Recommendations
Interruptions in Human-Agent Interaction
IVA '21: Proceedings of the 21st ACM International Conference on Intelligent Virtual AgentsTurn management is one of the necessary social interactions skills. In human-human interactions, turn changes are naturally completed by interruption, "cooperatively" or "competitively". Interruptions are inherent in conversation. They can be considered ...
Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual AgentsThe listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...
Multimodal human discourse: gesture and speech
Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the ...
Comments