skip to main content
10.1145/3652988.3673915acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
extended-abstract

A Computational Study on Sentence-based Next Speaker Prediction in Multiparty Conversations

Published: 26 December 2024 Publication History

Abstract

In this paper we present a computational study to quantitatively examine the task of predicting the next speaker in multi-party conversations using machine learning models. To accomplish this, we create features that accurately represent information relevant to speaker changes in such conversations. We utilize sentence-based models, rather than the widely-used InterPausal Unit (IPU)-based models, and extend the definition of verbal backchanneling to include additional reactions that signify listeners’ attention or interest. Through extensive experiments with various machine learning models and inputs, we show that our sentence-based models outperform existing IPU-based models, with the best model achieving 61.39% accuracy. Our study provides design implications and recommendations for the development of virtual agents or humanoid robots with interactive social interaction capabilities.

References

[1]
Brigitte Bigi and Béatrice Priego-Valverde. 2019. Search for Inter-Pausal Units: application to Cheese! corpus. In 9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznań, Poland, 289–293. https://hal.archives-ouvertes.fr/hal-02428485
[2]
L Breiman. 2001. Random Forests. Machine Learning 45 (10 2001), 5–32. https://doi.org/10.1023/A:1010950718922
[3]
Geert Brône, Bert Oben, Annelies Jehoul, Jelena Vranjes, and Kurt Feyaerts. 2017. Eye gaze and viewpoint in multimodal interaction management. Cognitive Linguistics 28, 3 (2017), 449–483. https://doi.org/10.1515/cog-2016-0119
[4]
Iwan de Kok and Dirk Heylen. 2009. Multimodal End-of-Turn Prediction in Multi-Party Meetings. In Proceedings of the 2009 International Conference on Multimodal Interfaces (Cambridge, Massachusetts, USA) (ICMI-MLMI ’09). Association for Computing Machinery, New York, NY, USA, 91–98. https://doi.org/10.1145/1647314.1647332
[5]
Alfred Dielmann, Giulia Garau, and Herve Bourlard. 2010. Floor Holder Detection and End of Speaker Turn Prediction in Meetings. (01 2010).
[6]
Yu Ding, Yuting Zhang, Meihua Xiao, and Zhigang Deng. 2017. A Multifaceted Study on Eye Contact Based Speaker Identification in Three-Party Conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 3011–3021. https://doi.org/10.1145/3025453.3025644
[7]
Starkey Duncan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology 23 (1972), 283–292.
[8]
Starkey Duncan. 1974. On the structure of speaker–auditor interaction during speaking turns. Language in Society 3, 2 (1974), 161–180. https://doi.org/10.1017/S0047404500004322
[9]
Aysu Ezen-Can. 2020. A Comparison of LSTM and BERT for Small Corpus. (09 2020).
[10]
Charles Goodwin. 1981. Conversational Organization: Interaction Between Speakers and Hearers.
[11]
Sebastian Gorga and Kazuhiro Otsuka. 2010. Conversation scene analysis based on dynamic bayesian network and image-based gaze detection. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction. 1–8.
[12]
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2319–2323. https://doi.org/10.1109/ICASSP.2015.7178385
[13]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, and Junji Yamato. 2016. Prediction of Who Will Be the Next Speaker and When Using Gaze Behavior in Multiparty Meetings. ACM Transactions on Interactive Intelligent Systems 6 (05 2016), 1–31. https://doi.org/10.1145/2757284
[14]
Aobo Jin, Qixin Deng, and Zhigang Deng. 2022. A Live Speech Driven Avatar-mediated Three-party Telepresence System: Design and Evaluation. PRESENCE: Virtual and Augmented Reality 29 (2022), 1–27.
[15]
Aobo Jin, Qixin Deng, Yuting Zhang, and Zhigang Deng. 2019. A deep learning-based model for head and eye motion generation in three-party conversations. Proceedings of the ACM on Computer Graphics and Interactive Techniques 2, 2 (2019), 1–19.
[16]
Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. 2013. Gaze and Turn-Taking Behavior in Casual Conversational Interactions. ACM Trans. Interact. Intell. Syst. 3, 2, Article 12 (aug 2013), 30 pages. https://doi.org/10.1145/2499474.2499481
[17]
Tatsuya Kawahara, T. Iwatate, and Katsuya Takanashi. 2012. Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1 (01 2012), 726–729.
[18]
S. Kawato and J. Ohya. 2000. Real-time detection of nodding and head-shaking by directly detecting and tracking the "between-eyes". In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). 40–45. https://doi.org/10.1109/AFGR.2000.840610
[19]
Adam Kendon. 1967. Some functions of gaze-direction in social interaction. Acta Psychologica 26 (1967), 22–63. https://doi.org/10.1016/0001-6918(67)90005-4
[20]
Binh H Le, Xiaohan Ma, and Zhigang Deng. 2012. Live speech driven head-and-eye motion generators. IEEE transactions on visualization and computer graphics 18, 11 (2012), 1902–1914.
[21]
Meng-Chen Lee, Mai Trinh, and Zhigang Deng. 2023. Multimodal Turn Analysis and Prediction for Multi-party Conversations. In Proceedings of the 25th International Conference on Multimodal Interaction. 436–444.
[22]
Kazuhiro Otsuka. 2011. Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings. In Symposium on Human Interface. Springer, 171–179.
[23]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[24]
Antoine Raux and Maxine Eskenazi. 2009. A Finite-State Turn-Taking Model for Spoken Dialog Systems.Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, 629–637. https://doi.org/10.3115/1620754.1620846
[25]
Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language 50, 4 (1974), 696–735. http://www.jstor.org/stable/412243
[26]
Ryo Sato, Ryuichiro Higashinaka, Masafumi Tamoto, Mikio Nakano, and Kiyoaki Aikawa. 2002. Learning decision tree to determine turn-taking by spoken dialogue systems. https://doi.org/10.21437/ICSLP.2002-293
[27]
David Schlangen. 2006. From reaction to prediction experiments with computational models of turn-taking. INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP 5. https://doi.org/10.21437/Interspeech.2006-550
[28]
Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404 (2020), 132306.
[29]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1 (jan 2014), 1929–1958.
[30]
Will Styler. 2013. Using Praat for linguistic research. University of Colorado at Boulder Phonetics Lab (2013).
[31]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
[32]
Nigel Ward, Olac Fuentes, and Alejandro Vega. 2010. Dialog prediction for a general model of turn-taking. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 2662–2665. https://doi.org/10.21437/Interspeech.2010-706

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents
September 2024
337 pages
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 December 2024

Check for updates

Author Tags

  1. Human-human Communication
  2. Interactive Social Interaction
  3. Machine Learning
  4. Multimodal Interaction
  5. Multiparty Conversations
  6. Next Speaker Prediction

Qualifiers

  • Extended-abstract
  • Research
  • Refereed limited

Funding Sources

Conference

IVA '24
Sponsor:
IVA '24: ACM International Conference on Intelligent Virtual Agents
September 16 - 19, 2024
GLASGOW, United Kingdom

Acceptance Rates

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 23
    Total Downloads
  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)8
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media