skip to main content
10.1145/3577190.3614139acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Public Access

Multimodal Turn Analysis and Prediction for Multi-party Conversations

Published: 09 October 2023 Publication History

Abstract

This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal natural conversational behaviors of interlocutors in three-party conversations, including gazes, head movements, body movements, speech, etc. Based on the inter-pausal units (IPUs) extracted from the in-house acquired dataset, we propose a transformer-based computational model to predict the turns based on the interlocutor states (speaking/back-channeling/silence) and the gaze targets. Our model can robustly achieve more than 80% accuracy, and the generalizability of our model was extensively validated through cross-group experiments. Also, we introduce a novel computational metric called “relative engagement level" (REL) of IPUs, and further validate its statistical significance between turn-keeping IPUs and turn-taking IPUs, and between different conversational groups. Our experimental results also found that the patterns of the interlocutor states can be used as a more effective cue than their gaze behaviors for predicting turns in multiparty conversations.

References

[1]
Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. 2018. Improving end-of-turn detection in spoken dialogues by detecting speaker intentions as a secondary task. In Proceeding of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6159–6163.
[2]
Peter Auer. 2018. Gaze, addressee selection and turn-taking in three-party interaction. In Eye-tracking in interaction: Studies on the role of eye gaze in dialogue. John Benjamins Amsterdam, 197–231.
[3]
Geert Brône, Bert Oben, Annelies Jehoul, Jelena Vranjes, and Kurt Feyaerts. 2017. Eye gaze and viewpoint in multimodal interaction management. Cognitive Linguistics 28, 3 (2017), 449–483.
[4]
Francois Chollet. 2021. Deep learning with Python. Simon and Schuster.
[5]
Iwan De Kok and Dirk Heylen. 2009. Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of the 2009 international conference on Multimodal interfaces. 91–98.
[6]
Yu Ding, Yuting Zhang, Meihua Xiao, and Zhigang Deng. 2017. A multifaceted study on eye contact based speaker identification in three-party conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 3011–3021.
[7]
Starkey Duncan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology 23 (1972), 283–292.
[8]
Starkey Duncan. 1974. On the structure of speaker–auditor interaction during speaking turns. Language in Society 3, 2 (1974), 161–180.
[9]
John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Proceeding of IEEE International Conference on Acoustics, speech, and signal processing, Vol. 1. IEEE Computer Society, 517–520.
[10]
Dilek Hakkani-Tür, Malcolm Slaney, Asli Celikyilmaz, and Larry Heck. 2014. Eye gaze for spoken language understanding in multi-modal conversational interactions. In Proceedings of the 16th International Conference on Multimodal Interaction. 263–266.
[11]
Roy S. Hessels. 2020. How does gaze to faces support face-to-face interaction? A review and perspective. Psychonomic Bulletin & Review 27, 5 (may 2020), 856–881.
[12]
Koki Ijuin, Ichiro Umata, Tsuneo Kato, and Seiichi Yamamoto. 2018. Difference in Eye Gaze for Floor Apportionment in Native- and Second-Language Conversations. Journal of Nonverbal Behavior 42 (03 2018), 1–16.
[13]
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Multimodal fusion using respiration and gaze for predicting next speaker in multi-party meetings. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 99–106.
[14]
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2319–2323.
[15]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2018. Analyzing gaze behavior and dialogue act during turn-taking for estimating empathy skill level. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 31–39.
[16]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2019. Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation. Multimodal Technologies and Interaction 3, 4 (2019).
[17]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Masafumi Matsuda, and Junji Yamato. 2013. Predicting next speaker and timing from gaze transition patterns in multi-party meetings. In Proceedings of the 15th ACM on International conference on multimodal interaction. 79–86.
[18]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, and Junji Yamato. 2016. Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Transactions on Interactive Intelligent Systems (TIIS) 6, 1 (2016), 1–31.
[19]
Aobo Jin, Qixin Deng, and Zhigang Deng. 2020. A Live Speech-Driven Avatar-Mediated Three-Party Telepresence System: Design and Evaluation. PRESENCE: Virtual and Augmented Reality 29 (2020), 113–139.
[20]
Aobo Jin, Qixin Deng, and Zhigang Deng. 2022. S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks. In Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games. 2:1–2:10.
[21]
Aobo Jin, Qixin Deng, Yuting Zhang, and Zhigang Deng. 2019. A deep learning-based model for head and eye motion generation in three-party conversations. Proceedings of the ACM on Computer Graphics and Interactive Techniques 2, 2 (2019), 1–19.
[22]
Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. 2013. Gaze and turn-taking behavior in casual conversational interactions. ACM Transactions on Interactive Intelligent Systems (TiiS) 3, 2 (2013), 1–30.
[23]
Kristiina Jokinen, Kazuaki Harada, Masafumi Nishida, and Seiichi Yamamoto. 2010. Turn-alignment using eye-gaze and speech in conversational interaction. In Eleventh Annual Conference of the International Speech Communication Association.
[24]
Kristiina Jokinen, Masafumi Nishida, and Seiichi Yamamoto. 2009. Eye-gaze experiments for conversation monitoring. In Proceedings of the 3rd international universal communication symposium. 303–308.
[25]
Tatsuya Kawahara, Takuma Iwatate, and Katsuya Takanashi. 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. In Thirteenth Annual Conference of the International Speech Communication Association.
[26]
Adam Kendon. 1967. Some functions of gaze-direction in social interaction. Acta Psychologica 26 (1967), 22–63.
[27]
Hanae Koiso, Yasuo Horiuchi, Syun Tutiya, Akira Ichikawa, and Yasuharu Den. 1998. An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs. Language and Speech 41, 3-4 (1998), 295–321.
[28]
Divesh Lala, Koji Inoue, and Tatsuya Kawahara. 2019. Smooth turn-taking by a robot using an online continuous model to generate turn-taking cues. In Proceeding of 2019 International Conference on Multimodal Interaction. 226–234.
[29]
Gene H. Lerner. 2003. Selecting next speaker: The context-sensitive operation of a context-free organization. Language in Society 32, 2 (2003), 177–201.
[30]
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, 2023. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics 11 (2023), 250–266.
[31]
Kazuhiro Otsuka. 2011. Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings. In Proceedings of Symposium on Human Interface 2011. Springer, 171–179.
[32]
Lawrence Rabiner and Biinghwang Juang. 1986. An introduction to hidden Markov models. IEEE ASSP magazine 3, 1 (1986), 4–16.
[33]
Gabriel Skantze, Martin Johansson, and Jonas Beskow. 2015. Exploring turn-taking cues in multi-party human-robot discussions about objects. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 67–74.
[34]
Clarissa Weiss. 2018. When gaze-selected next speakers do not take the turn. Journal of Pragmatics 133 (2018), 28–44.
[35]
Elisabeth Zima, Clarissa Weiß, and Geert Brône. 2019. Gaze and overlap resolution in triadic interactions. Journal of Pragmatics 140 (2019), 49–69.

Cited By

View all
  • (2024)Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware StrategiesApplied Sciences10.3390/app14241207114:24(12071)Online publication date: 23-Dec-2024
  • (2024)Online Multimodal End-of-Turn Prediction for Three-party ConversationsProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685742(57-65)Online publication date: 4-Nov-2024
  • (2024)A Computational Study on Sentence-based Next Speaker Prediction in Multiparty ConversationsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673915(1-4)Online publication date: 16-Sep-2024
  • Show More Cited By

Index Terms

  1. Multimodal Turn Analysis and Prediction for Multi-party Conversations

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
      October 2023
      858 pages
      ISBN:9798400700552
      DOI:10.1145/3577190
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Empirical studies
      2. Human-human communication
      3. Machine learning
      4. Multi-party conversations
      5. Multimodal interaction
      6. conversational gesture understanding

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      ICMI '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)201
      • Downloads (Last 6 weeks)43
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware StrategiesApplied Sciences10.3390/app14241207114:24(12071)Online publication date: 23-Dec-2024
      • (2024)Online Multimodal End-of-Turn Prediction for Three-party ConversationsProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685742(57-65)Online publication date: 4-Nov-2024
      • (2024)A Computational Study on Sentence-based Next Speaker Prediction in Multiparty ConversationsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673915(1-4)Online publication date: 16-Sep-2024
      • (2024)UniMPC: Towards a Unified Framework for Multi-Party ConversationsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679864(2639-2649)Online publication date: 21-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media