research-article

Online Multimodal End-of-Turn Prediction for Three-party Conversations

Authors:

Zhigang DengAuthors Info & Claims

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

Pages 57 - 65

https://doi.org/10.1145/3678957.3685742

Published: 04 November 2024 Publication History

Abstract

Predicting end-of-turn in multiparty conversations is crucial to increase the usability and natural flow of spoken dialogue systems, offering substantial enhancements to conversational agents. We present a novel window-based method to predict end-of-turn moments in real-time in multiparty conversations, by leveraging the capabilities of cutting-edge pre-trained language models (PLMs) and recurrent neural networks (RNN). Our method fuses the distilBERT language model with a Gated Recurrent Unit (GRU) to accurately predict end-of-turn points in an online fashion. Our approach can significantly outperform conventional Inter-Pausal Unit (IPU)-based prediction methods that often overlook the nuances of overlap and interruption during dynamic conversations. Potential applications of this study are significant, particularly in the domains of virtual agents and human-robot interactions. Our accurate online end-of-turn prediction model can be facilitated to enhance the user experience in these applications, making them more natural and seamlessly integrated into real-world conversations.

References

[1]

Linda Bell, Johan Boye, and Joakim Gustafson. 2001. Real-time handling of fragmented utterances. In Proc. NAACL workshop on adaptation in dialogue systems. 2–8.

[2]

Dan Bohus and Eric Horvitz. 2011. Decisions about turns in multiparty conversation: from perception to action. In Proceedings of the 13th international conference on multimodal interfaces. 153–160.

Digital Library

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.

[4]

Geert Brône, Bert Oben, Annelies Jehoul, Jelena Vranjes, and Kurt Feyaerts. 2017. Eye gaze and viewpoint in multimodal interaction management. Cognitive Linguistics 28, 3 (2017), 449–483.

[5]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.

[6]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).

Digital Library

[7]

Iwan De Kok and Dirk Heylen. 2009. Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of the 2009 international conference on Multimodal interfaces. 91–98.

Digital Library

[8]

Jan-Peter De Ruiter, Holger Mitterer, and Nick J Enfield. 2006. Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation. Language 82, 3 (2006), 515–535.

[9]

Ziedune Degutyte and Arlene Astell. 2021. The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings. Frontiers in Psychology 12 (2021), 616471.

[10]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review 54 (2021), 755–810.

Digital Library

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. Association for Computational Linguistics, 4171–4186.

[12]

Yu Ding, Yuting Zhang, Meihua Xiao, and Zhigang Deng. 2017. A multifaceted study on eye contact based speaker identification in three-party conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 3011–3021.

Digital Library

[13]

Nia MM Dowell, Tristan M Nixon, and Arthur C Graesser. 2019. Group communication analysis: A computational linguistics approach for detecting sociocognitive roles in multiparty interactions. Behavior research methods 51 (2019), 1007–1041.

[14]

Starkey Duncan. 1972. Some signals and rules for taking speaking turns in conversations.Journal of personality and social psychology 23, 2 (1972), 283.

[15]

Cecilia E Ford and Sandra A Thompson. 1996. Interactional units in conversation: Syntactic, intonational, and pragmatic resources for the management of turns. Studies in interactional sociolinguistics 13 (1996), 134–184.

[16]

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra. 2022. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16102–16112.

[17]

Agustín Gravano and Julia Hirschberg. 2011. Turn-taking cues in task-oriented dialogue. Computer Speech & Language 25, 3 (2011), 601–634.

Digital Library

[18]

Dilek Hakkani-Tür, Malcolm Slaney, Asli Celikyilmaz, and Larry Heck. 2014. Eye gaze for spoken language understanding in multi-modal conversational interactions. In Proceedings of the 16th International Conference on Multimodal Interaction. 263–266.

Digital Library

[19]

Kohei Hara, Koji Inoue, Katsuya Takanashi, and Tatsuya Kawahara. 2018. Prediction of turn-taking using multitask learning with prediction of backchannels and fillers. Listener 162 (2018), 364.

[20]

Simon Ho, Tom Foulsham, and Alan Kingstone. 2015. Speaking and listening with the eyes: Gaze signaling during dyadic interactions. PloS one 10, 8 (2015), e0136905.

[21]

Lixing Huang, Louis-Philippe Morency, and Jonathan Gratch. 2011. A multimodal end-of-turn prediction model: learning from parasocial consensus sampling. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3. Citeseer, 1289–1290.

[22]

Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Multimodal fusion using respiration and gaze for predicting next speaker in multi-party meetings. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 99–106.

Digital Library

[23]

Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2319–2323.

[24]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Masafumi Matsuda, and Junji Yamato. 2013. Predicting next speaker and timing from gaze transition patterns in multi-party meetings. In Proceedings of the 15th ACM on International conference on multimodal interaction. 79–86.

Digital Library

[25]

Aobo Jin, Qixin Deng, and Zhigang Deng. 2020. A Live Speech-Driven Avatar-Mediated Three-Party Telepresence System: Design and Evaluation. PRESENCE: Virtual and Augmented Reality 29 (2020), 113–139.

[26]

Aobo Jin, Qixin Deng, and Zhigang Deng. 2022. S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks. In Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games. 2:1–2:10.

[27]

Aobo Jin, Qixin Deng, Yuting Zhang, and Zhigang Deng. 2019. A deep learning-based model for head and eye motion generation in three-party conversations. Proceedings of the ACM on Computer Graphics and Interactive Techniques 2, 2 (2019), 1–19.

Digital Library

[28]

Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. 2013. Gaze and turn-taking behavior in casual conversational interactions. ACM Transactions on Interactive Intelligent Systems (TiiS) 3, 2 (2013), 1–30.

Digital Library

[29]

Tatsuya Kawahara, Takuma Iwatate, and Katsuya Takanashi. 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. In Thirteenth Annual Conference of the International Speech Communication Association.

[30]

S. Kawato and J. Ohya. 2000. Real-time detection of nodding and head-shaking by directly detecting and tracking the "between-eyes". In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). 40–45. https://doi.org/10.1109/AFGR.2000.840610

[31]

Hanae Koiso, Yasuo Horiuchi, Syun Tutiya, Akira Ichikawa, and Yasuharu Den. 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Language and speech 41, 3-4 (1998), 295–321.

[32]

Divesh Lala, Koji Inoue, and Tatsuya Kawahara. 2019. Smooth turn-taking by a robot using an online continuous model to generate turn-taking cues. In 2019 International Conference on Multimodal Interaction. 226–234.

Digital Library

[33]

Meng-Chen Lee, Wu Angela Li, and Zhigang Deng. 2024. A Computational Study on Sentence-based Next Speaker Prediction in Multiparty Conversations. In Proceedings of ACM International Conference on Intelligent Virtual Agents 2024.

[34]

Meng-Chen Lee, Mai Trinh, and Zhigang Deng. 2023. Multimodal Turn Analysis and Prediction for Multi-party Conversations. In Proceedings of the 25th International Conference on Multimodal Interaction. 436–444.

Digital Library

[35]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[36]

Raveesh Meena, Gabriel Skantze, and Joakim Gustafson. 2014. Data-driven models for timing feedback responses in a Map Task dialogue system. Computer Speech & Language 28, 4 (2014), 903–922.

[37]

Kazuhiro Otsuka. 2011. Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings. In Proceedings of Symposium on Human Interface 2011. Springer, 171–179.

[38]

Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Investigating speech features for continuous turn-taking prediction using lstms. arXiv preprint arXiv:1806.11461 (2018).

[39]

Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Multimodal continuous turn-taking prediction using multiscale rnns. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 186–190.

Digital Library

[40]

Harvey Sacks, Emanuel A Schegloff, and Gail Jefferson. 1978. A simplest systematics for the organization of turn taking for conversation. In Studies in the organization of conversational interaction. Elsevier, 7–55.

[41]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

[42]

Ryo Sato, Ryuichiro Higashinaka, Masafumi Tamoto, Mikio Nakano, and Kiyoaki Aikawa. 2002. Learning decision trees to determine turn-taking by spoken dialogue systems. In INTERSPEECH.

[43]

David Schlangen. 2006. From reaction to prediction: Experiments with computational models of turn-taking. Proceedings of Interspeech 2006, Panel on Prosody of Dialogue Acts and Turn-Taking (2006).

[44]

Gabriel Skantze. 2017. Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. 220–230.

[45]

Gabriel Skantze. 2021. Turn-taking in conversational systems and human-robot interaction: a review. Computer Speech & Language 67 (2021), 101178.

[46]

Gabriel Skantze, Martin Johansson, and Jonas Beskow. 2015. Exploring turn-taking cues in multi-party human-robot discussions about objects. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 67–74.

Digital Library

[47]

Nigel G Ward. 2019. Prosodic patterns in English conversation. Cambridge University Press.

Index Terms

Online Multimodal End-of-Turn Prediction for Three-party Conversations
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI
    2. Interaction paradigms
      1. Collaborative interaction

Recommendations

Multimodal Turn Analysis and Prediction for Multi-party Conversations
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal ...
Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

The listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...
Telepresence robot that exaggerates non-verbal cues for taking turns in multi-party teleconferences
HAI '14: Proceedings of the second international conference on Human-agent interaction

In this paper, we propose a telepresence robot that exaggerates non-verbal cues for taking turns in multi- party teleconferences. In multi-party teleconferences, it is more difficult that the remote participants to take their turns than face-to-face. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

November 2024

725 pages

ISBN:9798400704628

DOI:10.1145/3678957

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICMI '24

ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 4 - 8, 2024

San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
142
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)26

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten