Abstract
Designing a question-answer database is important to make natural conversation for an example-based dialog system. We focused on the method to collect the example sentences by actual conversations with the system. In this study, examples in the database were collected from the conversation logs, then we investigated the relationship between the response accuracy and the number of the interaction. In the experiment, the transcriptions of the user’s utterances are added to the database at every end of the interaction. The responce sentences in the database were created manually. The result showed that the response accuracy appropriateness improved as increasing the number of the interactions and saturated at around 70%. In addition, we compared the collected database with the fully handcrafted database by the subjective evaluation. The score of the user satisfaction, dialog engagement, intelligence, and willingness to use were higher than the handcrafted database, and these results suggested that the proposed method can obtain more appropriate examples to the actual conversation from subjective point of view.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recently, many research works on non-task-oriented spoken dialog systems have been carried out actively [1,2,3]. Most of the non-task-oriented dialog systems employ the example-based system. Developing a large-scale example-response database is necessary to make a natural dialog for such dialog systems. There are lots of method for constructing the database, such as manual development [4] or collecting from web resources [5, 6]. However, the hand-crafted examples does not always coincide a user’s actual utterance, and the automatic collection method sometimes collects inappropriate responses or unusable examples.
Collecting the examples by actual conversations with the system seems to be one of the promising approaches. Several works employed this approach (i.e., [7]); however, these works lack attentive analyses, such as how much examples can be collected by iterating the interactions or how different the characteristics of the database from that developed by conventional approach.
In this study, we focus on the example collection by the conversation, and investigate the relationship between the response accuracy and the number of the interaction, and we compared the performance of the collected database with the fully hand-crafted database by the subjective evaluation.
2 Examples Collection by Conversation with Spoken Dialog System
We prepared initial databases and started the example collection. The transcriptions of the participants’ utterances and the response sentences were added at every end of the dialog.
2.1 Initial Example-Response Database
The initial databases were composed topic by topic, which include example-response pairs corresponding to greetings, backchannels, and task-specific interactions. We assumed four topics of the dialog: cooking, movie, meal, and shopping. We also assumed the conversation style as chatting between friends (assuming that the participant and the dialog agent are friends). In the dialog, the participants were instructed to ask the dialog agent what she had done yesterday on the assumption that she (the dialog agent) had led a human-like life. The initial database was constructed with the questions about the daily events she supposed to do. Table 1 summarizes the number of pairs of each database.
2.2 Procedure of Example Collection
Table 2 shows the flow of the method for collecting example sentences for topic t. Let I be the total number of the interaction, \( D_{i}^{t} \) be i-th database of topic t, and \( D_{CSJ} \) be a document set of the Corpus of Spontaneous Japanese (CSJ). Here, each participant talks with the system only once and the index of the iteration corresponds to the speaker index.
For the automatic example collection, detecting examples out of the current database \( E_{i} \) and generating the response are required. However, the focus of this paper is investigating the response appropriateness relating to the number of the iteration, and both of them were conducted manually. The utterances were translated to add to the database.
3 Analysis of Response Appropriateness by Iterating Dialog
3.1 Experimental Conditions
The dialog experiments for the example collection were conducted in a sound-proof chamber. Twenty-five persons (15 males and 10 females) participated the experiment. When a participant made an utterance, the system calculated the similarity between the speech recognition result of the utterance and example sentences in the database, and selected the response corresponding to the most similar example as the system’s utterance. The cosine similarity was used for the similarity calculation. The system was implemented based on MMDAgent [8]. MMDAgent is an open-source toolkit for building the speech interaction system. The language model was trained using the sentences in the CSJ and the examples of the initial databases to accommodate a task-specific utterance. The language model was re-trained every end of the dialog using the collected examples.
The experiments were separated into two sections. In the first section, the participants made 10 input utterances to investigate the appropriateness of each response of the database \( D_{i - 1} \). The participants evaluated each response as “appropriate” or “not appropriate”. After that, as the second section, they engaged in the dialog for the example collection for three minutes without appropriateness evaluation. The participants asked the agent what she had done yesterday. The dialog was user initiative one, and the participants asked the agent and the agent responded them.
3.2 Measurement of Appropriateness
We defined the coverage as an index of appropriateness of the system’s responses. The coverage \( C_{i}^{t} \) of topic t of i-th participant (equals to i-th dialog) is calculated as follows:
\( R_{i}^{t} \) is the number of the response evaluated as appropriate and \( N_{i}^{t} \) is the number of the interchanges (\( N_{i}^{t} = 10 \)). In the following section, we investigated the appropriateness of the response based on the coverage.
3.3 Experimental Results
Figure 1 shows the trend of the coverage by the number of the interaction. Horizontal axis is the average coverage score of 5 interactions. The result of the speech recognition is denoted as RECOG and the transcription is as TRANS. The blue line shows the trend of RECOG. As shown in the figure, the coverage improves until 11–15-th interaction, and remains flat after that. Because RECOG contains recognition error, the appropriateness of the response was also analyzed based on the manual transcription. The responses of the transcription were selected based on the Eq. (1), and the appropriateness was judged by a majority-vote of three evaluators (male: 1, female: 2). The red line of the figure shows the trend of TRANS. The tendency of the TRANS is similar to RECOG, and the coverage is saturated at 11–15-th interaction around 75%. More interactions are required to improve the appropriateness, but the efficacy of the example collection seems to be decreased because the remaining examples are interchanges of deeper interaction.
4 Comparison the Collected Databases with Hand-Crafted Databases by the Dialog Experiments
The collected databases were compared with the hand-crafted databases to investigate the efficacy of the example collection by the conversation. In this section, we denote the collected database as DIALOG and hand-crafted one as HANDCRAFTED.
4.1 Experimental Condition
The experiments were conducted by using DIALOG and HANDCRAFTED on the same conditions (except for three minutes interactions) with the previous Section. 10 subjects (male: 8, female: 2) participated the experiment. The DIALOG collected by 25 interactions is ranged from 550 to 600 sentences. The HANDCRAFTED database was created by 10 persons (male: 6, female: 4). One database creator made around 50 or 60 examples. The database creators were provided the initial database, and made the example sentences while assuming the possible interactions. The response sentences were developed by one person (the first author) for the consistency. Here, if the same example is included in DIALOG, we assigned the same response.
We constructed 8 systems preparing DIALOG and HANDCRAFTED database for each topic. The order of the topic presented to the participants was fixed, and HANDCRAFTED systems were presented first for randomly selected two topics out of the four topics. They made ten utterances for interaction with each system, and evaluated the response appropriateness at every end of the interaction. After the experiments, they answered the questionnaire for the subjective evaluation. The participants answered the following four questions using the five-grade Likert scale, one (not at all) to five (very much).
-
Satisfaction: whether the user was satisfied with the dialog with the system.
-
Engagement: whether the user felt that the dialog is engaged.
-
Intelligence: whether the user felt that the system is intelligent.
-
Willingness: whether the user want to use the system again.
4.2 Experimental Result
Table 3 shows the coverage averaged on the participants. We denoted the coverage of the speech recognition result as RECOG and the coverage of the transcription as TRANS. The appropriateness of TRANS was evaluated by the majority vote of three annotators (male: 1, female: 2). As shown in the table, the coverage of the DIALOG database was higher than the HANDCRAFTED database in both RECOG and TRANS cases. The coverage of DIALOG is around 70%. These results indicate the examples collected by the conversation is more suited to the actual use then handcrafting.
The results of the subjective evaluation are summarized in Fig. 2. The error-bar of the figure shows the standard error. This figure shows the DIALOG database outperformed the HANDCRAFTED database in all of the items. Using the unpaired t-test, we obtained the significant difference in satisfaction (\( N = 40 \), \( t = - 2.85 \), \( p = 0.006 \)), engagement (\( N = 40 \), \( t = - 3.42 \), \( p \le 0.001 \)), intelligence (\( N = 40 \), \( t = - 3.34 \), \( p = 0.001 \)), and willingness (\( N = 40 \), \( t = - 2.18 \), \( p = 0.016 \)). Therefore, the example collection by the conversation can construct the outstanding database in terms of not only the coverage but also the subjective evaluation. In particular, Engagement showed the largest difference among the evaluation items. It is suggested that the examples for the deeper interaction is especially important for the chat-style conversation. Many of the databases of the conventional systems are constructed by the developer while assuming the actual dialog, but it is not easy to cover the flows of possible conversation. Therefore, a framework to collect the examples by the interaction is important for the future dialog system to construct appropriate database at low-cost.
5 Conclusion
In this research, we examined the example collection method by the conversation for example-based dialog system and showed the efficacy of the method by several analyses. We found that the coverage of the example saturated at 75% by iterating the interaction 15 times. Then we compared the database collected by conversation with the fully hand-crafted database. The examined approach outperformed the hand-crafted method at Satisfaction, Engagement, Intelligence, and Willingness. In particular, the difference between Engagement scores was larger than the other scores, and the example collection by the conversation was efficient to obtain the more appropriate examples to the actual conversation than conventional approach.
In a future work, we will investigate the characteristic of the collected examples to clarify the difference from the hand-crafted examples. In addition, we are going to examine the methods to detect the out-of-database examples and generate the response automatically.
References
Bickmore, T.W., Picard, R.W.: Establishing and maintaining long-term human-computer relationships. ACM Trans. Comput.-Hum. Inter. 12(2), 293–327 (2005)
Meguro, T., Higashinaka, R., Minami, Y. Dohsaka, K.: Controlling listening-oriented dialogue using partially observable Markov decision processes. In: Proceedings of 23rd International Conference on Computational Linguistics, pp. 761–769 (2010)
Higashinaka, R., et al.,: Towards an open-domain conversational system fully based on natural language processing. In: Proceedings of COLING, pp. 928–939 (2014)
Sugiyama, H., Meguro, T., Higashinaka, R., Minami, Y.: Large-scale collection and analysis of personal question-answer pairs for conventional agents. In: Proceedings of International Conference on Intelligent Virtual Agents, pp. 420–433 (2014)
Ritter, A., Cherry, C., Dolan, B.: Unsupervised modeling of Twitter conversations. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 172–180. Association for Computational Linguistics (2010)
Bessho, F., Harada, T., Kuniyoshi, Y.: Dialog system using real-time crowdsourcing and Twitter large-scale corus. In: Proceedings of The 13th Annual Meeting of the Special Internet Group on Discourse and Dialog, pp. 227–231 (2012)
Traum, D., et al.: Evaluating spoken dialogue processing for time-offset interaction. In: Proceedings of SIGDIAL, pp. 199–208 (2015)
Lee, A., Oura, K., Tokuda, K.: MMDAgent-a fully open-source toolkit for voice interaction systems. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8382–8385 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kageyama, Y., Chiba, Y., Nose, T., Ito, A. (2017). Collection of Example Sentences for Non-task-Oriented Dialog Using a Spoken Dialog System and Comparison with Hand-Crafted DB. In: Stephanidis, C. (eds) HCI International 2017 – Posters' Extended Abstracts. HCI 2017. Communications in Computer and Information Science, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-319-58750-9_63
Download citation
DOI: https://doi.org/10.1007/978-3-319-58750-9_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58749-3
Online ISBN: 978-3-319-58750-9
eBook Packages: Computer ScienceComputer Science (R0)