Abstract
Spoken dialogue systems need to determine when to respond to a user in addition to the response. Various cues, such as prosody, gaze, and facial expression are known to affect response timing. Recent studies have revealed that using the representation of a system response improves the performance of response timing prediction. However, it is difficult to directly use a future response with dialogue systems that require an entire user utterance to generate a response. This study proposes a neural-based response timing estimation model using past utterances to alleviate this problem. The proposed model is expected to consider the intention of the system response implicitly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adiwardana D, Luong MT, So D, et al (2020) Towards a human-like open-domain chatbot, pp 1–38. arXiv:2001.09977
Baltrušaitis T, Robinson P, Morency LP (2016) OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE winter conference on applications of computer vision, pp 1–10
Devlin J, Chang, MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding, pp 1–16. arXiv:1810.04805
Duncan S (1974) On the structure of speaker-auditor interaction during speaking turns. Language in Society, pp 161–180
Duncan S, Fiske D (2015) Face-to-face interaction: research, methods, and theory. Routledge
Eyben F, Scherer K, Schuller B et al (2015) The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2):190–202
Fujiwara N, Itoh T, Araki K (2007) Analysis of changes in dialogue rhythm due to dialogue acts in Task-Oriented dialogues. In: Proceedings of international conference on text, speech and dialogue, pp 564–573
Ji Y, Haffari G, Eisenstein J (2016) A latent variable recurrent neural network for discourse-driven language models. In: Proceedings of NAACL-HLT, pp 332–342
Joulin A, Grave E, Bojanowski P, Douze M, Jégou H, Mikolov T (2016) FastText. zip: compressing text classification models, pp 1–13. arXiv:1612.03651
Kendon A (1967) Some functions of gaze-direction in social interaction. Acta Psychol 26:22–63
Kitaoka N, Takeuchi M, Nishimura R, Nakagawa S (2006) Response timing detection using prosodic and linguistic information for human-friendly spoken dialog systems. Inf Media Technol 1(1):296–304
Lee S, Choi J (2017) Enhancing user experience with conversational agent for movie recommendation: effects of self-disclosure and reciprocity. Int J Hum-Comput Stud 103:95–105
Li R, Lin C, Collinson M, Li X, Chen G (2019) A dual-attention hierarchical recurrent neural network for dialogue act classification. In: Proceedings of CoNLL, pp 383–392
Liu C, Ishi CT, Ishiguro H (2017) Turn-taking estimation model based on joint embedding of lexical and prosodic contents. In: Program INTERSPEECH, pp 1686–1690
Masumura R, Tanaka T, Ando A, Ishii R, Higashinaka R, Aono Y (2018) Neural dialogue context online end-of-turn detection. In: Proceedings of SIGDIAL, pp 224–228
Raheja V, Tetreault J (2019) Dialogue act classification with context-aware self-attention. In: Proceedings of NAACL-HLT, pp 3727–3733
Ram A, Prasad R, Khatri C, et al (2018) Conversational AI: the science behind the Alexa Prize, pp 1–18. arXiv:1801.03604
Roddy M, Harte N (2020) Neural generation of dialogue response timings. In: Proceedings of ACL, pp 2442–2452
Roddy M, Skantze G, Harte N (2018) Multimodal continuous turn-taking prediction using multiscale RNNs. arXiv:1808.10785
Sacks H (1974) An analysis of the course of a joke’s telling in conversation. In: Explorations in the ethnography of speaking. Cambridge University Press, London, pp 337–353
Skantze G (2017) Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks. In: Proceedings of SIGDIAL, pp 220–230
Skantze G (2020) Turn-taking in conversational systems and human-robot interaction: a review. Comput Speech Lang 101–178
Smith E, Williamson M, Shuster K, Weston J, Boureau YL (2020) Can you put it all together: Evaluating conversational agents’ ability to blend skills, pp 1–10. arXiv:2004.08449
Yamazaki Y, Chiba Y, Nose T, Ito A (2020) Construction and analysis of a multimodal chat-talk corpus for dialog systems considering interpersonal closeness. In: Proceedings of LREC, pp 443–448
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yahagi, R., Chiba, Y., Nose, T., Ito, A. (2022). Multimodal Dialogue Response Timing Estimation Using Dialogue Context Encoder. In: Stoyanchev, S., Ultes, S., Li, H. (eds) Conversational AI for Natural Human-Centric Interaction. Lecture Notes in Electrical Engineering, vol 943. Springer, Singapore. https://doi.org/10.1007/978-981-19-5538-9_9
Download citation
DOI: https://doi.org/10.1007/978-981-19-5538-9_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5537-2
Online ISBN: 978-981-19-5538-9
eBook Packages: Computer ScienceComputer Science (R0)