research-article

Public Access

Multimodal Turn Analysis and Prediction for Multi-party Conversations

Authors:

Zhigang DengAuthors Info & Claims

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

Pages 436 - 444

https://doi.org/10.1145/3577190.3614139

Published: 09 October 2023 Publication History

All formats PDF

Abstract

This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal natural conversational behaviors of interlocutors in three-party conversations, including gazes, head movements, body movements, speech, etc. Based on the inter-pausal units (IPUs) extracted from the in-house acquired dataset, we propose a transformer-based computational model to predict the turns based on the interlocutor states (speaking/back-channeling/silence) and the gaze targets. Our model can robustly achieve more than 80% accuracy, and the generalizability of our model was extensively validated through cross-group experiments. Also, we introduce a novel computational metric called “relative engagement level" (REL) of IPUs, and further validate its statistical significance between turn-keeping IPUs and turn-taking IPUs, and between different conversational groups. Our experimental results also found that the patterns of the interlocutor states can be used as a more effective cue than their gaze behaviors for predicting turns in multiparty conversations.

References

[1]

Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. 2018. Improving end-of-turn detection in spoken dialogues by detecting speaker intentions as a secondary task. In Proceeding of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6159–6163.

Digital Library

[2]

Peter Auer. 2018. Gaze, addressee selection and turn-taking in three-party interaction. In Eye-tracking in interaction: Studies on the role of eye gaze in dialogue. John Benjamins Amsterdam, 197–231.

[3]

Geert Brône, Bert Oben, Annelies Jehoul, Jelena Vranjes, and Kurt Feyaerts. 2017. Eye gaze and viewpoint in multimodal interaction management. Cognitive Linguistics 28, 3 (2017), 449–483.

[4]

Francois Chollet. 2021. Deep learning with Python. Simon and Schuster.

Digital Library

[5]

Iwan De Kok and Dirk Heylen. 2009. Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of the 2009 international conference on Multimodal interfaces. 91–98.

Digital Library

[6]

Yu Ding, Yuting Zhang, Meihua Xiao, and Zhigang Deng. 2017. A multifaceted study on eye contact based speaker identification in three-party conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 3011–3021.

Digital Library

[7]

Starkey Duncan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology 23 (1972), 283–292.

[8]

Starkey Duncan. 1974. On the structure of speaker–auditor interaction during speaking turns. Language in Society 3, 2 (1974), 161–180.

[9]

John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Proceeding of IEEE International Conference on Acoustics, speech, and signal processing, Vol. 1. IEEE Computer Society, 517–520.

[10]

Dilek Hakkani-Tür, Malcolm Slaney, Asli Celikyilmaz, and Larry Heck. 2014. Eye gaze for spoken language understanding in multi-modal conversational interactions. In Proceedings of the 16th International Conference on Multimodal Interaction. 263–266.

Digital Library

[11]

Roy S. Hessels. 2020. How does gaze to faces support face-to-face interaction? A review and perspective. Psychonomic Bulletin & Review 27, 5 (may 2020), 856–881.

[12]

Koki Ijuin, Ichiro Umata, Tsuneo Kato, and Seiichi Yamamoto. 2018. Difference in Eye Gaze for Floor Apportionment in Native- and Second-Language Conversations. Journal of Nonverbal Behavior 42 (03 2018), 1–16.

[13]

Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Multimodal fusion using respiration and gaze for predicting next speaker in multi-party meetings. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 99–106.

Digital Library

[14]

Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2319–2323.

[15]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2018. Analyzing gaze behavior and dialogue act during turn-taking for estimating empathy skill level. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 31–39.

Digital Library

[16]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2019. Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation. Multimodal Technologies and Interaction 3, 4 (2019).

[17]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Masafumi Matsuda, and Junji Yamato. 2013. Predicting next speaker and timing from gaze transition patterns in multi-party meetings. In Proceedings of the 15th ACM on International conference on multimodal interaction. 79–86.

Digital Library

[18]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, and Junji Yamato. 2016. Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Transactions on Interactive Intelligent Systems (TIIS) 6, 1 (2016), 1–31.

Digital Library

[19]

Aobo Jin, Qixin Deng, and Zhigang Deng. 2020. A Live Speech-Driven Avatar-Mediated Three-Party Telepresence System: Design and Evaluation. PRESENCE: Virtual and Augmented Reality 29 (2020), 113–139.

[20]

Aobo Jin, Qixin Deng, and Zhigang Deng. 2022. S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks. In Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games. 2:1–2:10.

[21]

Aobo Jin, Qixin Deng, Yuting Zhang, and Zhigang Deng. 2019. A deep learning-based model for head and eye motion generation in three-party conversations. Proceedings of the ACM on Computer Graphics and Interactive Techniques 2, 2 (2019), 1–19.

Digital Library

[22]

Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. 2013. Gaze and turn-taking behavior in casual conversational interactions. ACM Transactions on Interactive Intelligent Systems (TiiS) 3, 2 (2013), 1–30.

Digital Library

[23]

Kristiina Jokinen, Kazuaki Harada, Masafumi Nishida, and Seiichi Yamamoto. 2010. Turn-alignment using eye-gaze and speech in conversational interaction. In Eleventh Annual Conference of the International Speech Communication Association.

[24]

Kristiina Jokinen, Masafumi Nishida, and Seiichi Yamamoto. 2009. Eye-gaze experiments for conversation monitoring. In Proceedings of the 3rd international universal communication symposium. 303–308.

Digital Library

[25]

Tatsuya Kawahara, Takuma Iwatate, and Katsuya Takanashi. 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. In Thirteenth Annual Conference of the International Speech Communication Association.

[26]

Adam Kendon. 1967. Some functions of gaze-direction in social interaction. Acta Psychologica 26 (1967), 22–63.

[27]

Hanae Koiso, Yasuo Horiuchi, Syun Tutiya, Akira Ichikawa, and Yasuharu Den. 1998. An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs. Language and Speech 41, 3-4 (1998), 295–321.

[28]

Divesh Lala, Koji Inoue, and Tatsuya Kawahara. 2019. Smooth turn-taking by a robot using an online continuous model to generate turn-taking cues. In Proceeding of 2019 International Conference on Multimodal Interaction. 226–234.

Digital Library

[29]

Gene H. Lerner. 2003. Selecting next speaker: The context-sensitive operation of a context-free organization. Language in Society 32, 2 (2003), 177–201.

[30]

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, 2023. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics 11 (2023), 250–266.

[31]

Kazuhiro Otsuka. 2011. Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings. In Proceedings of Symposium on Human Interface 2011. Springer, 171–179.

[32]

Lawrence Rabiner and Biinghwang Juang. 1986. An introduction to hidden Markov models. IEEE ASSP magazine 3, 1 (1986), 4–16.

[33]

Gabriel Skantze, Martin Johansson, and Jonas Beskow. 2015. Exploring turn-taking cues in multi-party human-robot discussions about objects. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 67–74.

Digital Library

[34]

Clarissa Weiss. 2018. When gaze-selected next speakers do not take the turn. Journal of Pragmatics 133 (2018), 28–44.

[35]

Elisabeth Zima, Clarissa Weiß, and Geert Brône. 2019. Gaze and overlap resolution in triadic interactions. Journal of Pragmatics 140 (2019), 49–69.

Cited By

Chen JGu CZhang JLiu ZMa BKonomi S(2024)Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware StrategiesApplied Sciences10.3390/app14241207114:24(12071)Online publication date: 23-Dec-2024
https://doi.org/10.3390/app142412071
Lee MDeng Z(2024)Online Multimodal End-of-Turn Prediction for Three-party ConversationsProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685742(57-65)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685742
Lee MLi WDeng Z(2024)A Computational Study on Sentence-based Next Speaker Prediction in Multiparty ConversationsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673915(1-4)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3673915
Show More Cited By

Index Terms

Multimodal Turn Analysis and Prediction for Multi-party Conversations
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI
    2. HCI theory, concepts and models

Recommendations

Online Multimodal End-of-Turn Prediction for Three-party Conversations
ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

Predicting end-of-turn in multiparty conversations is crucial to increase the usability and natural flow of spoken dialogue systems, offering substantial enhancements to conversational agents. We present a novel window-based method to predict end-of-...
Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

The listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...
Telepresence robot that exaggerates non-verbal cues for taking turns in multi-party teleconferences
HAI '14: Proceedings of the second international conference on Human-agent interaction

In this paper, we propose a telepresence robot that exaggerates non-verbal cues for taking turns in multi- party teleconferences. In multi-party teleconferences, it is more difficult that the remote participants to take their turns than face-to-face. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

October 2023

858 pages

ISBN:9798400700552

DOI:10.1145/3577190

Editors:
Elisabeth André
University of Augsburg
,
Mohamed Chetouani
Sorbonne University
,
Dominique Vaufreydaz
Univ. Grenoble Alpes
,
Gale Lucas
USC Institute for Creative Technologies
,
Tanja Schultz
University of Bremen
,
Louis-Philippe Morency
Carnegie Mellon University
,
Alessandro Vinciarelli
University of Glasgow

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)

Conference

ICMI '23

Sponsor:

SIGCHI

ICMI '23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 9 - 13, 2023

Paris, France

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
262
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)43

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen JGu CZhang JLiu ZMa BKonomi S(2024)Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware StrategiesApplied Sciences10.3390/app14241207114:24(12071)Online publication date: 23-Dec-2024
https://doi.org/10.3390/app142412071
Lee MDeng Z(2024)Online Multimodal End-of-Turn Prediction for Three-party ConversationsProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685742(57-65)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3678957.3685742
Lee MLi WDeng Z(2024)A Computational Study on Sentence-based Next Speaker Prediction in Multiparty ConversationsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673915(1-4)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3673915
Xie YSun CLiu YJi ZLiu BSerra ESpezzano F(2024)UniMPC: Towards a Unified Framework for Multi-Party ConversationsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679864(2639-2649)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679864

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten