Abstract
In order to make conversational agents or robots conduct human-like behaviors, it is important to design a model of the system internal states. In this paper, we address a model of favorable impression to the dialogue partner. The favorable impression is modeled to change according to user’s dialogue behaviors and also affect following dialogue behaviors of the system, specifically selection of utterance constructional units. For this modeling, we propose a hierarchical structure of logistic regression models. First, from the user’s dialogue behaviors, the model estimates the level of user’s favorable impression to the system and also the level of the user’s interest in the current topic. Then, based on the above results, the model predicts the system’s favorable impression to the user. Finally, the model determines selection of utterance constructional units in the next system turn. We train each of the logistic regression models individually with a small amount of annotated data of favorable impression. Afterward, the entire multi-layer network is fine-tuned with a larger amount of dialogue behavior data. An experimental result shows that the proposed method achieves higher accuracy on the selection of the utterance constructional units, compared with methods that do not take into account the system internal states.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Anagnostopoulos CN, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177
Bates J (1994) The role of emotion in believable agents. Commun ACM 37(7):122–125
Becker C, Kopp S, Wachsmuth I (2004) Simulating the emotion dynamics of a multimodal conversational agent. In: ADS, pp. 154–165
Boersma P (2001) Praat, a system for doing phonetics by computer. Glot Int. 5(9):341–345
Bunt H, Alexandersson J, Carletta J, Choe JW, Fang AC, Hasida K, Lee K, Petukhova V, Popescu-Belis A, Romary L, et al (2010) Towards an ISO standard for dialogue act annotation. In: LREC, pp. 2548–2555
Den Y, Koiso H, Maruyama T, Maekawa K, Takanashi K, Enomoto M, Yoshida N (2010) Two-level annotation of utterance-units in japanese dialogs: an empirically emerged scheme. In: LREC, pp. 1483–1486
Inoue K, Milhorat P, Lala D, Zhao T, Kawahara T (2016) Talking with erica, an autonomous android. In: SIGDIAL, pp 212–215
Ishi CT, Ishiguro H, Hagita N (2012) Evaluation of formant-based lip motion generation in tele-operated humanoid robots. In: IROS, pp 2377–2382
Jurafsky D, Ranganath R, McFarland D (2009) Extracting social meaning: identifying interactional style in spoken conversation. In: NAACL, pp 638–646
Kawahara T (2018) Spoken dialogue system for a human-like conversational robot ERICA. In: IWSDS
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR
Pentland AS (2010) Honest signals: how they shape our world. MIT press, Cambridge (2010)
Picard RW (1997) Affective computing, vol 252. MIT press, Cambridge
Sakai K, Ishi CT, Minato T, Ishiguro H (2015) Online speech-driven head motion generating system and evaluation on a tele-operated robot. In: ROMAN, pp 529–534
Schuller B, Köhler N, Müller R, Rigoll G (2006) Recognition of interest in human conversational speech. In: ICSLP, pp 793–796
Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, et al (2013) The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Interspeech, pp 148–152
Sinclair JM, Coulthard M (1975) Towards an analysis of discourse: the English used by teachers and pupils. Oxford University Press, Oxford
Wang WY, Biadsy F, Rosenberg A, Hirschberg J (2013) Automatic detection of speaker state: lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification. Comput Speech Lang 27(1):168–189
Wu CH, Lin JC, Wei WL (2014) Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans Signal Inf Process 3:1–18
Young S, Gašić M, Thomson B, Williams JD (2013) Pomdp-based statistical spoken dialog systems: a review. Proc IEEE 101(5):1160–1179
Acknowledgements
This work was supported by JST ERATO Grant Number JPMJER1401, Japan. The authors would like to thank Professor Graham Wilcock for his insightful advice.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Tanaka, K., Inoue, K., Nakamura, S., Takanashi, K., Kawahara, T. (2021). End-to-end Modeling for Selection of Utterance Constructional Units via System Internal States. In: Marchi, E., Siniscalchi, S.M., Cumani, S., Salerno, V.M., Li, H. (eds) Increasing Naturalness and Flexibility in Spoken Dialogue Interaction. Lecture Notes in Electrical Engineering, vol 714. Springer, Singapore. https://doi.org/10.1007/978-981-15-9323-9_2
Download citation
DOI: https://doi.org/10.1007/978-981-15-9323-9_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9322-2
Online ISBN: 978-981-15-9323-9
eBook Packages: EngineeringEngineering (R0)