skip to main content
10.1145/3678957.3685742acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Online Multimodal End-of-Turn Prediction for Three-party Conversations

Published: 04 November 2024 Publication History

Abstract

Predicting end-of-turn in multiparty conversations is crucial to increase the usability and natural flow of spoken dialogue systems, offering substantial enhancements to conversational agents. We present a novel window-based method to predict end-of-turn moments in real-time in multiparty conversations, by leveraging the capabilities of cutting-edge pre-trained language models (PLMs) and recurrent neural networks (RNN). Our method fuses the distilBERT language model with a Gated Recurrent Unit (GRU) to accurately predict end-of-turn points in an online fashion. Our approach can significantly outperform conventional Inter-Pausal Unit (IPU)-based prediction methods that often overlook the nuances of overlap and interruption during dynamic conversations. Potential applications of this study are significant, particularly in the domains of virtual agents and human-robot interactions. Our accurate online end-of-turn prediction model can be facilitated to enhance the user experience in these applications, making them more natural and seamlessly integrated into real-world conversations.

References

[1]
Linda Bell, Johan Boye, and Joakim Gustafson. 2001. Real-time handling of fragmented utterances. In Proc. NAACL workshop on adaptation in dialogue systems. 2–8.
[2]
Dan Bohus and Eric Horvitz. 2011. Decisions about turns in multiparty conversation: from perception to action. In Proceedings of the 13th international conference on multimodal interfaces. 153–160.
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[4]
Geert Brône, Bert Oben, Annelies Jehoul, Jelena Vranjes, and Kurt Feyaerts. 2017. Eye gaze and viewpoint in multimodal interaction management. Cognitive Linguistics 28, 3 (2017), 449–483.
[5]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
[6]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[7]
Iwan De Kok and Dirk Heylen. 2009. Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of the 2009 international conference on Multimodal interfaces. 91–98.
[8]
Jan-Peter De Ruiter, Holger Mitterer, and Nick J Enfield. 2006. Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation. Language 82, 3 (2006), 515–535.
[9]
Ziedune Degutyte and Arlene Astell. 2021. The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings. Frontiers in Psychology 12 (2021), 616471.
[10]
Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review 54 (2021), 755–810.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. Association for Computational Linguistics, 4171–4186.
[12]
Yu Ding, Yuting Zhang, Meihua Xiao, and Zhigang Deng. 2017. A multifaceted study on eye contact based speaker identification in three-party conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 3011–3021.
[13]
Nia MM Dowell, Tristan M Nixon, and Arthur C Graesser. 2019. Group communication analysis: A computational linguistics approach for detecting sociocognitive roles in multiparty interactions. Behavior research methods 51 (2019), 1007–1041.
[14]
Starkey Duncan. 1972. Some signals and rules for taking speaking turns in conversations.Journal of personality and social psychology 23, 2 (1972), 283.
[15]
Cecilia E Ford and Sandra A Thompson. 1996. Interactional units in conversation: Syntactic, intonational, and pragmatic resources for the management of turns. Studies in interactional sociolinguistics 13 (1996), 134–184.
[16]
Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra. 2022. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16102–16112.
[17]
Agustín Gravano and Julia Hirschberg. 2011. Turn-taking cues in task-oriented dialogue. Computer Speech & Language 25, 3 (2011), 601–634.
[18]
Dilek Hakkani-Tür, Malcolm Slaney, Asli Celikyilmaz, and Larry Heck. 2014. Eye gaze for spoken language understanding in multi-modal conversational interactions. In Proceedings of the 16th International Conference on Multimodal Interaction. 263–266.
[19]
Kohei Hara, Koji Inoue, Katsuya Takanashi, and Tatsuya Kawahara. 2018. Prediction of turn-taking using multitask learning with prediction of backchannels and fillers. Listener 162 (2018), 364.
[20]
Simon Ho, Tom Foulsham, and Alan Kingstone. 2015. Speaking and listening with the eyes: Gaze signaling during dyadic interactions. PloS one 10, 8 (2015), e0136905.
[21]
Lixing Huang, Louis-Philippe Morency, and Jonathan Gratch. 2011. A multimodal end-of-turn prediction model: learning from parasocial consensus sampling. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3. Citeseer, 1289–1290.
[22]
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Multimodal fusion using respiration and gaze for predicting next speaker in multi-party meetings. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 99–106.
[23]
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2319–2323.
[24]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Masafumi Matsuda, and Junji Yamato. 2013. Predicting next speaker and timing from gaze transition patterns in multi-party meetings. In Proceedings of the 15th ACM on International conference on multimodal interaction. 79–86.
[25]
Aobo Jin, Qixin Deng, and Zhigang Deng. 2020. A Live Speech-Driven Avatar-Mediated Three-Party Telepresence System: Design and Evaluation. PRESENCE: Virtual and Augmented Reality 29 (2020), 113–139.
[26]
Aobo Jin, Qixin Deng, and Zhigang Deng. 2022. S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks. In Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games. 2:1–2:10.
[27]
Aobo Jin, Qixin Deng, Yuting Zhang, and Zhigang Deng. 2019. A deep learning-based model for head and eye motion generation in three-party conversations. Proceedings of the ACM on Computer Graphics and Interactive Techniques 2, 2 (2019), 1–19.
[28]
Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. 2013. Gaze and turn-taking behavior in casual conversational interactions. ACM Transactions on Interactive Intelligent Systems (TiiS) 3, 2 (2013), 1–30.
[29]
Tatsuya Kawahara, Takuma Iwatate, and Katsuya Takanashi. 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. In Thirteenth Annual Conference of the International Speech Communication Association.
[30]
S. Kawato and J. Ohya. 2000. Real-time detection of nodding and head-shaking by directly detecting and tracking the "between-eyes". In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). 40–45. https://doi.org/10.1109/AFGR.2000.840610
[31]
Hanae Koiso, Yasuo Horiuchi, Syun Tutiya, Akira Ichikawa, and Yasuharu Den. 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Language and speech 41, 3-4 (1998), 295–321.
[32]
Divesh Lala, Koji Inoue, and Tatsuya Kawahara. 2019. Smooth turn-taking by a robot using an online continuous model to generate turn-taking cues. In 2019 International Conference on Multimodal Interaction. 226–234.
[33]
Meng-Chen Lee, Wu Angela Li, and Zhigang Deng. 2024. A Computational Study on Sentence-based Next Speaker Prediction in Multiparty Conversations. In Proceedings of ACM International Conference on Intelligent Virtual Agents 2024.
[34]
Meng-Chen Lee, Mai Trinh, and Zhigang Deng. 2023. Multimodal Turn Analysis and Prediction for Multi-party Conversations. In Proceedings of the 25th International Conference on Multimodal Interaction. 436–444.
[35]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[36]
Raveesh Meena, Gabriel Skantze, and Joakim Gustafson. 2014. Data-driven models for timing feedback responses in a Map Task dialogue system. Computer Speech & Language 28, 4 (2014), 903–922.
[37]
Kazuhiro Otsuka. 2011. Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings. In Proceedings of Symposium on Human Interface 2011. Springer, 171–179.
[38]
Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Investigating speech features for continuous turn-taking prediction using lstms. arXiv preprint arXiv:1806.11461 (2018).
[39]
Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Multimodal continuous turn-taking prediction using multiscale rnns. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 186–190.
[40]
Harvey Sacks, Emanuel A Schegloff, and Gail Jefferson. 1978. A simplest systematics for the organization of turn taking for conversation. In Studies in the organization of conversational interaction. Elsevier, 7–55.
[41]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[42]
Ryo Sato, Ryuichiro Higashinaka, Masafumi Tamoto, Mikio Nakano, and Kiyoaki Aikawa. 2002. Learning decision trees to determine turn-taking by spoken dialogue systems. In INTERSPEECH.
[43]
David Schlangen. 2006. From reaction to prediction: Experiments with computational models of turn-taking. Proceedings of Interspeech 2006, Panel on Prosody of Dialogue Acts and Turn-Taking (2006).
[44]
Gabriel Skantze. 2017. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. 220–230.
[45]
Gabriel Skantze. 2021. Turn-taking in conversational systems and human-robot interaction: a review. Computer Speech & Language 67 (2021), 101178.
[46]
Gabriel Skantze, Martin Johansson, and Jonas Beskow. 2015. Exploring turn-taking cues in multi-party human-robot discussions about objects. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 67–74.
[47]
Nigel G Ward. 2019. Prosodic patterns in English conversation. Cambridge University Press.

Index Terms

  1. Online Multimodal End-of-Turn Prediction for Three-party Conversations

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction
      November 2024
      725 pages
      ISBN:9798400704628
      DOI:10.1145/3678957
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 November 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. human-human interaction
      2. multi-party conversations
      3. multimodal interaction
      4. non-verbal gesture
      5. turn prediction

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      ICMI '24
      ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
      November 4 - 8, 2024
      San Jose, Costa Rica

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 142
        Total Downloads
      • Downloads (Last 12 months)142
      • Downloads (Last 6 weeks)26
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media