Keywords

1 Introduction

Recently, many research works on non-task-oriented spoken dialog systems have been carried out actively [1,2,3]. Most of the non-task-oriented dialog systems employ the example-based system. Developing a large-scale example-response database is necessary to make a natural dialog for such dialog systems. There are lots of method for constructing the database, such as manual development [4] or collecting from web resources [5, 6]. However, the hand-crafted examples does not always coincide a user’s actual utterance, and the automatic collection method sometimes collects inappropriate responses or unusable examples.

Collecting the examples by actual conversations with the system seems to be one of the promising approaches. Several works employed this approach (i.e., [7]); however, these works lack attentive analyses, such as how much examples can be collected by iterating the interactions or how different the characteristics of the database from that developed by conventional approach.

In this study, we focus on the example collection by the conversation, and investigate the relationship between the response accuracy and the number of the interaction, and we compared the performance of the collected database with the fully hand-crafted database by the subjective evaluation.

2 Examples Collection by Conversation with Spoken Dialog System

We prepared initial databases and started the example collection. The transcriptions of the participants’ utterances and the response sentences were added at every end of the dialog.

2.1 Initial Example-Response Database

The initial databases were composed topic by topic, which include example-response pairs corresponding to greetings, backchannels, and task-specific interactions. We assumed four topics of the dialog: cooking, movie, meal, and shopping. We also assumed the conversation style as chatting between friends (assuming that the participant and the dialog agent are friends). In the dialog, the participants were instructed to ask the dialog agent what she had done yesterday on the assumption that she (the dialog agent) had led a human-like life. The initial database was constructed with the questions about the daily events she supposed to do. Table 1 summarizes the number of pairs of each database.

Table 1. The number of the pairs of the topics

2.2 Procedure of Example Collection

Table 2 shows the flow of the method for collecting example sentences for topic t. Let I be the total number of the interaction, \( D_{i}^{t} \) be i-th database of topic t, and \( D_{CSJ} \) be a document set of the Corpus of Spontaneous Japanese (CSJ). Here, each participant talks with the system only once and the index of the iteration corresponds to the speaker index.

Table 2. Procedure of example collection for topic t

For the automatic example collection, detecting examples out of the current database \( E_{i} \) and generating the response are required. However, the focus of this paper is investigating the response appropriateness relating to the number of the iteration, and both of them were conducted manually. The utterances were translated to add to the database.

3 Analysis of Response Appropriateness by Iterating Dialog

3.1 Experimental Conditions

The dialog experiments for the example collection were conducted in a sound-proof chamber. Twenty-five persons (15 males and 10 females) participated the experiment. When a participant made an utterance, the system calculated the similarity between the speech recognition result of the utterance and example sentences in the database, and selected the response corresponding to the most similar example as the system’s utterance. The cosine similarity was used for the similarity calculation. The system was implemented based on MMDAgent [8]. MMDAgent is an open-source toolkit for building the speech interaction system. The language model was trained using the sentences in the CSJ and the examples of the initial databases to accommodate a task-specific utterance. The language model was re-trained every end of the dialog using the collected examples.

The experiments were separated into two sections. In the first section, the participants made 10 input utterances to investigate the appropriateness of each response of the database \( D_{i - 1} \). The participants evaluated each response as “appropriate” or “not appropriate”. After that, as the second section, they engaged in the dialog for the example collection for three minutes without appropriateness evaluation. The participants asked the agent what she had done yesterday. The dialog was user initiative one, and the participants asked the agent and the agent responded them.

3.2 Measurement of Appropriateness

We defined the coverage as an index of appropriateness of the system’s responses. The coverage \( C_{i}^{t} \) of topic t of i-th participant (equals to i-th dialog) is calculated as follows:

$$ C_{i}^{t} = \frac{{R_{i}^{t} }}{{N_{i}^{t} }} $$
(1)

\( R_{i}^{t} \) is the number of the response evaluated as appropriate and \( N_{i}^{t} \) is the number of the interchanges (\( N_{i}^{t} = 10 \)). In the following section, we investigated the appropriateness of the response based on the coverage.

3.3 Experimental Results

Figure 1 shows the trend of the coverage by the number of the interaction. Horizontal axis is the average coverage score of 5 interactions. The result of the speech recognition is denoted as RECOG and the transcription is as TRANS. The blue line shows the trend of RECOG. As shown in the figure, the coverage improves until 11–15-th interaction, and remains flat after that. Because RECOG contains recognition error, the appropriateness of the response was also analyzed based on the manual transcription. The responses of the transcription were selected based on the Eq. (1), and the appropriateness was judged by a majority-vote of three evaluators (male: 1, female: 2). The red line of the figure shows the trend of TRANS. The tendency of the TRANS is similar to RECOG, and the coverage is saturated at 11–15-th interaction around 75%. More interactions are required to improve the appropriateness, but the efficacy of the example collection seems to be decreased because the remaining examples are interchanges of deeper interaction.

Fig. 1.
figure 1

Coverage of user’s utterance with respect to number of interaction (Color figure online)

4 Comparison the Collected Databases with Hand-Crafted Databases by the Dialog Experiments

The collected databases were compared with the hand-crafted databases to investigate the efficacy of the example collection by the conversation. In this section, we denote the collected database as DIALOG and hand-crafted one as HANDCRAFTED.

4.1 Experimental Condition

The experiments were conducted by using DIALOG and HANDCRAFTED on the same conditions (except for three minutes interactions) with the previous Section. 10 subjects (male: 8, female: 2) participated the experiment. The DIALOG collected by 25 interactions is ranged from 550 to 600 sentences. The HANDCRAFTED database was created by 10 persons (male: 6, female: 4). One database creator made around 50 or 60 examples. The database creators were provided the initial database, and made the example sentences while assuming the possible interactions. The response sentences were developed by one person (the first author) for the consistency. Here, if the same example is included in DIALOG, we assigned the same response.

We constructed 8 systems preparing DIALOG and HANDCRAFTED database for each topic. The order of the topic presented to the participants was fixed, and HANDCRAFTED systems were presented first for randomly selected two topics out of the four topics. They made ten utterances for interaction with each system, and evaluated the response appropriateness at every end of the interaction. After the experiments, they answered the questionnaire for the subjective evaluation. The participants answered the following four questions using the five-grade Likert scale, one (not at all) to five (very much).

  • Satisfaction: whether the user was satisfied with the dialog with the system.

  • Engagement: whether the user felt that the dialog is engaged.

  • Intelligence: whether the user felt that the system is intelligent.

  • Willingness: whether the user want to use the system again.

4.2 Experimental Result

Table 3 shows the coverage averaged on the participants. We denoted the coverage of the speech recognition result as RECOG and the coverage of the transcription as TRANS. The appropriateness of TRANS was evaluated by the majority vote of three annotators (male: 1, female: 2). As shown in the table, the coverage of the DIALOG database was higher than the HANDCRAFTED database in both RECOG and TRANS cases. The coverage of DIALOG is around 70%. These results indicate the examples collected by the conversation is more suited to the actual use then handcrafting.

Table 3. Coverage of databases

The results of the subjective evaluation are summarized in Fig. 2. The error-bar of the figure shows the standard error. This figure shows the DIALOG database outperformed the HANDCRAFTED database in all of the items. Using the unpaired t-test, we obtained the significant difference in satisfaction (\( N = 40 \), \( t = - 2.85 \), \( p = 0.006 \)), engagement (\( N = 40 \), \( t = - 3.42 \), \( p \le 0.001 \)), intelligence (\( N = 40 \), \( t = - 3.34 \), \( p = 0.001 \)), and willingness (\( N = 40 \), \( t = - 2.18 \), \( p = 0.016 \)). Therefore, the example collection by the conversation can construct the outstanding database in terms of not only the coverage but also the subjective evaluation. In particular, Engagement showed the largest difference among the evaluation items. It is suggested that the examples for the deeper interaction is especially important for the chat-style conversation. Many of the databases of the conventional systems are constructed by the developer while assuming the actual dialog, but it is not easy to cover the flows of possible conversation. Therefore, a framework to collect the examples by the interaction is important for the future dialog system to construct appropriate database at low-cost.

Fig. 2.
figure 2

Subjective evaluation results for system

5 Conclusion

In this research, we examined the example collection method by the conversation for example-based dialog system and showed the efficacy of the method by several analyses. We found that the coverage of the example saturated at 75% by iterating the interaction 15 times. Then we compared the database collected by conversation with the fully hand-crafted database. The examined approach outperformed the hand-crafted method at Satisfaction, Engagement, Intelligence, and Willingness. In particular, the difference between Engagement scores was larger than the other scores, and the example collection by the conversation was efficient to obtain the more appropriate examples to the actual conversation than conventional approach.

In a future work, we will investigate the characteristic of the collected examples to clarify the difference from the hand-crafted examples. In addition, we are going to examine the methods to detect the out-of-database examples and generate the response automatically.