extended-abstract

A Computational Study on Sentence-based Next Speaker Prediction in Multiparty Conversations

Authors:

Zhigang DengAuthors Info & Claims

IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents

Article No.: 23, Pages 1 - 4

https://doi.org/10.1145/3652988.3673915

Published: 26 December 2024 Publication History

Abstract

In this paper we present a computational study to quantitatively examine the task of predicting the next speaker in multi-party conversations using machine learning models. To accomplish this, we create features that accurately represent information relevant to speaker changes in such conversations. We utilize sentence-based models, rather than the widely-used InterPausal Unit (IPU)-based models, and extend the definition of verbal backchanneling to include additional reactions that signify listeners’ attention or interest. Through extensive experiments with various machine learning models and inputs, we show that our sentence-based models outperform existing IPU-based models, with the best model achieving 61.39% accuracy. Our study provides design implications and recommendations for the development of virtual agents or humanoid robots with interactive social interaction capabilities.

References

[1]

Brigitte Bigi and Béatrice Priego-Valverde. 2019. Search for Inter-Pausal Units: application to Cheese! corpus. In 9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznań, Poland, 289–293. https://hal.archives-ouvertes.fr/hal-02428485

[2]

L Breiman. 2001. Random Forests. Machine Learning 45 (10 2001), 5–32. https://doi.org/10.1023/A:1010950718922

Digital Library

[3]

Geert Brône, Bert Oben, Annelies Jehoul, Jelena Vranjes, and Kurt Feyaerts. 2017. Eye gaze and viewpoint in multimodal interaction management. Cognitive Linguistics 28, 3 (2017), 449–483. https://doi.org/10.1515/cog-2016-0119

[4]

Iwan de Kok and Dirk Heylen. 2009. Multimodal End-of-Turn Prediction in Multi-Party Meetings. In Proceedings of the 2009 International Conference on Multimodal Interfaces (Cambridge, Massachusetts, USA) (ICMI-MLMI ’09). Association for Computing Machinery, New York, NY, USA, 91–98. https://doi.org/10.1145/1647314.1647332

Digital Library

[5]

Alfred Dielmann, Giulia Garau, and Herve Bourlard. 2010. Floor Holder Detection and End of Speaker Turn Prediction in Meetings. (01 2010).

[6]

Yu Ding, Yuting Zhang, Meihua Xiao, and Zhigang Deng. 2017. A Multifaceted Study on Eye Contact Based Speaker Identification in Three-Party Conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 3011–3021. https://doi.org/10.1145/3025453.3025644

Digital Library

[7]

Starkey Duncan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology 23 (1972), 283–292.

[8]

Starkey Duncan. 1974. On the structure of speaker–auditor interaction during speaking turns. Language in Society 3, 2 (1974), 161–180. https://doi.org/10.1017/S0047404500004322

[9]

Aysu Ezen-Can. 2020. A Comparison of LSTM and BERT for Small Corpus. (09 2020).

[10]

Charles Goodwin. 1981. Conversational Organization: Interaction Between Speakers and Hearers.

[11]

Sebastian Gorga and Kazuhiro Otsuka. 2010. Conversation scene analysis based on dynamic bayesian network and image-based gaze detection. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction. 1–8.

Digital Library

[12]

Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2319–2323. https://doi.org/10.1109/ICASSP.2015.7178385

[13]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, and Junji Yamato. 2016. Prediction of Who Will Be the Next Speaker and When Using Gaze Behavior in Multiparty Meetings. ACM Transactions on Interactive Intelligent Systems 6 (05 2016), 1–31. https://doi.org/10.1145/2757284

Digital Library

[14]

Aobo Jin, Qixin Deng, and Zhigang Deng. 2022. A Live Speech Driven Avatar-mediated Three-party Telepresence System: Design and Evaluation. PRESENCE: Virtual and Augmented Reality 29 (2022), 1–27.

[15]

Aobo Jin, Qixin Deng, Yuting Zhang, and Zhigang Deng. 2019. A deep learning-based model for head and eye motion generation in three-party conversations. Proceedings of the ACM on Computer Graphics and Interactive Techniques 2, 2 (2019), 1–19.

Digital Library

[16]

Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. 2013. Gaze and Turn-Taking Behavior in Casual Conversational Interactions. ACM Trans. Interact. Intell. Syst. 3, 2, Article 12 (aug 2013), 30 pages. https://doi.org/10.1145/2499474.2499481

Digital Library

[17]

Tatsuya Kawahara, T. Iwatate, and Katsuya Takanashi. 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1 (01 2012), 726–729.

[18]

S. Kawato and J. Ohya. 2000. Real-time detection of nodding and head-shaking by directly detecting and tracking the "between-eyes". In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). 40–45. https://doi.org/10.1109/AFGR.2000.840610

[19]

Adam Kendon. 1967. Some functions of gaze-direction in social interaction. Acta Psychologica 26 (1967), 22–63. https://doi.org/10.1016/0001-6918(67)90005-4

[20]

Binh H Le, Xiaohan Ma, and Zhigang Deng. 2012. Live speech driven head-and-eye motion generators. IEEE transactions on visualization and computer graphics 18, 11 (2012), 1902–1914.

Digital Library

[21]

Meng-Chen Lee, Mai Trinh, and Zhigang Deng. 2023. Multimodal Turn Analysis and Prediction for Multi-party Conversations. In Proceedings of the 25th International Conference on Multimodal Interaction. 436–444.

Digital Library

[22]

Kazuhiro Otsuka. 2011. Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings. In Symposium on Human Interface. Springer, 171–179.

[23]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.

Digital Library

[24]

Antoine Raux and Maxine Eskenazi. 2009. A Finite-State Turn-Taking Model for Spoken Dialog Systems.Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, 629–637. https://doi.org/10.3115/1620754.1620846

[25]

Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language 50, 4 (1974), 696–735. http://www.jstor.org/stable/412243

[26]

Ryo Sato, Ryuichiro Higashinaka, Masafumi Tamoto, Mikio Nakano, and Kiyoaki Aikawa. 2002. Learning decision tree to determine turn-taking by spoken dialogue systems. https://doi.org/10.21437/ICSLP.2002-293

[27]

David Schlangen. 2006. From reaction to prediction experiments with computational models of turn-taking. INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP 5. https://doi.org/10.21437/Interspeech.2006-550

[28]

Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404 (2020), 132306.

[29]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1 (jan 2014), 1929–1958.

Digital Library

[30]

Will Styler. 2013. Using Praat for linguistic research. University of Colorado at Boulder Phonetics Lab (2013).

[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762

[32]

Nigel Ward, Olac Fuentes, and Alejandro Vega. 2010. Dialog prediction for a general model of turn-taking. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 2662–2665. https://doi.org/10.21437/Interspeech.2010-706

Index Terms

A Computational Study on Sentence-based Next Speaker Prediction in Multiparty Conversations
1. Human-centered computing
  1. Collaborative and social computing
    1. Empirical studies in collaborative and social computing
  2. Human computer interaction (HCI)
    1. Empirical studies in HCI

Recommendations

Multimodal Turn Analysis and Prediction for Multi-party Conversations
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal ...
Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

The listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...
A Study of Prediction of Listener's Comprehension Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

During dialogues, speakers need to be able to predict whether their partners understand their message. This is important for not only for human-to-human interaction but also human-to-agent interaction. We consider that if the listener's comprehension ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents

September 2024

337 pages

ISBN:9798400706257

DOI:10.1145/3652988

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 December 2024

Check for updates

Author Tags

Qualifiers

Extended-abstract
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)

Conference

IVA '24

Sponsor:

SIGAI

IVA '24: ACM International Conference on Intelligent Virtual Agents

September 16 - 19, 2024

GLASGOW, United Kingdom

Acceptance Rates

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
23
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)8

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten