Multimodal Dialogue Response Timing Estimation Using Dialogue Context Encoder

Yahagi, Ryota; Chiba, Yuya; Nose, Takashi; Ito, Akinori

doi:10.1007/978-981-19-5538-9_9

Ryota Yahagi⁴⁰,
Yuya Chiba⁴¹,
Takashi Nose⁴⁰ &
…
Akinori Ito⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 943))

443 Accesses

Abstract

Spoken dialogue systems need to determine when to respond to a user in addition to the response. Various cues, such as prosody, gaze, and facial expression are known to affect response timing. Recent studies have revealed that using the representation of a system response improves the performance of response timing prediction. However, it is difficult to directly use a future response with dialogue systems that require an entire user utterance to generate a response. This study proposes a neural-based response timing estimation model using past utterances to alleviate this problem. The proposed model is expected to consider the intention of the system response implicitly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adiwardana D, Luong MT, So D, et al (2020) Towards a human-like open-domain chatbot, pp 1–38. arXiv:2001.09977
Baltrušaitis T, Robinson P, Morency LP (2016) OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE winter conference on applications of computer vision, pp 1–10
Google Scholar
Devlin J, Chang, MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding, pp 1–16. arXiv:1810.04805
Duncan S (1974) On the structure of speaker-auditor interaction during speaking turns. Language in Society, pp 161–180
Google Scholar
Duncan S, Fiske D (2015) Face-to-face interaction: research, methods, and theory. Routledge
Google Scholar
Eyben F, Scherer K, Schuller B et al (2015) The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2):190–202
Article Google Scholar
Fujiwara N, Itoh T, Araki K (2007) Analysis of changes in dialogue rhythm due to dialogue acts in Task-Oriented dialogues. In: Proceedings of international conference on text, speech and dialogue, pp 564–573
Google Scholar
Ji Y, Haffari G, Eisenstein J (2016) A latent variable recurrent neural network for discourse-driven language models. In: Proceedings of NAACL-HLT, pp 332–342
Google Scholar
Joulin A, Grave E, Bojanowski P, Douze M, Jégou H, Mikolov T (2016) FastText. zip: compressing text classification models, pp 1–13. arXiv:1612.03651
Kendon A (1967) Some functions of gaze-direction in social interaction. Acta Psychol 26:22–63
Article Google Scholar
Kitaoka N, Takeuchi M, Nishimura R, Nakagawa S (2006) Response timing detection using prosodic and linguistic information for human-friendly spoken dialog systems. Inf Media Technol 1(1):296–304
Google Scholar
Lee S, Choi J (2017) Enhancing user experience with conversational agent for movie recommendation: effects of self-disclosure and reciprocity. Int J Hum-Comput Stud 103:95–105
Article Google Scholar
Li R, Lin C, Collinson M, Li X, Chen G (2019) A dual-attention hierarchical recurrent neural network for dialogue act classification. In: Proceedings of CoNLL, pp 383–392
Google Scholar
Liu C, Ishi CT, Ishiguro H (2017) Turn-taking estimation model based on joint embedding of lexical and prosodic contents. In: Program INTERSPEECH, pp 1686–1690
Google Scholar
Masumura R, Tanaka T, Ando A, Ishii R, Higashinaka R, Aono Y (2018) Neural dialogue context online end-of-turn detection. In: Proceedings of SIGDIAL, pp 224–228
Google Scholar
Raheja V, Tetreault J (2019) Dialogue act classification with context-aware self-attention. In: Proceedings of NAACL-HLT, pp 3727–3733
Google Scholar
Ram A, Prasad R, Khatri C, et al (2018) Conversational AI: the science behind the Alexa Prize, pp 1–18. arXiv:1801.03604
Roddy M, Harte N (2020) Neural generation of dialogue response timings. In: Proceedings of ACL, pp 2442–2452
Google Scholar
Roddy M, Skantze G, Harte N (2018) Multimodal continuous turn-taking prediction using multiscale RNNs. arXiv:1808.10785
Sacks H (1974) An analysis of the course of a joke’s telling in conversation. In: Explorations in the ethnography of speaking. Cambridge University Press, London, pp 337–353
Google Scholar
Skantze G (2017) Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In: Proceedings of SIGDIAL, pp 220–230
Google Scholar
Skantze G (2020) Turn-taking in conversational systems and human-robot interaction: a review. Comput Speech Lang 101–178
Google Scholar
Smith E, Williamson M, Shuster K, Weston J, Boureau YL (2020) Can you put it all together: Evaluating conversational agents’ ability to blend skills, pp 1–10. arXiv:2004.08449
Yamazaki Y, Chiba Y, Nose T, Ito A (2020) Construction and analysis of a multimodal chat-talk corpus for dialog systems considering interpersonal closeness. In: Proceedings of LREC, pp 443–448
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Engineering, Tohoku University, Sendai, Japan
Ryota Yahagi, Takashi Nose & Akinori Ito
NTT Communication Science Laboratories, NTT Corporation, Chiyoda City, Japan
Yuya Chiba

Authors

Ryota Yahagi
View author publications
You can also search for this author in PubMed Google Scholar
Yuya Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Nose
View author publications
You can also search for this author in PubMed Google Scholar
Akinori Ito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuya Chiba .

Editor information

Editors and Affiliations

Toshiba (United Kingdom), Weybridge, UK
Svetlana Stoyanchev
Daimler (Germany), Stuttgart, Germany
Stefan Ultes
The Chinese University of Hong Kong, Shenzhen, China
Haizhou Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yahagi, R., Chiba, Y., Nose, T., Ito, A. (2022). Multimodal Dialogue Response Timing Estimation Using Dialogue Context Encoder. In: Stoyanchev, S., Ultes, S., Li, H. (eds) Conversational AI for Natural Human-Centric Interaction. Lecture Notes in Electrical Engineering, vol 943. Springer, Singapore. https://doi.org/10.1007/978-981-19-5538-9_9

Download citation

DOI: https://doi.org/10.1007/978-981-19-5538-9_9
Published: 01 November 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5537-2
Online ISBN: 978-981-19-5538-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics