1 Introduction

Conversational information retrieval system, which can allow users to answer a variety of information needs naturally and effciently, has attracted more and more attention. Such a system usually contains an open-domain conversation module to generate a response, which is a hot research topic in natural language processing.

Modern open-domain conversation systems often use data-driven approaches due to the availability of large amounts of conversation data and the recent progress made by neural methods.

Basically, there are two main categories of approaches to build an open-domain conversation system: retrieval-based methods and generation-based methods. Retrieval-based systems maintain a large repository of conversation data and search for a most reasonable response by information retrieval approaches (Ji et al. 2014; Li et al. 2016b; Yan et al. 2016; Bartl and Spanakis 2017). A clear advantage of retrieval-based approaches is that the responses returned are usually fluent and grammatically correct since they are selected from a repository of real human dialogues. However, as retrieval-based systems do not generate new responses, but only select a response from a repository, the repository must have a large coverage of conversations. This is difficult to guarantee in practice, as the conversation topics can vary greatly and the conversation repositories are usually limited samples of real-world conversations.

Fig. 1
figure 1

Sample input messages and corresponding responses from Weibo dataset. The original text is in Chinese, and we translate it into English here. Similar conversations are retrieved by our retrieval module in the training data. The words in bold appear in both input messages and retrieved results, while ones with underlines appear in both gold response and retrieved results

On the other hand, generation-based systems try to generate a response other than retrieving an existing one. Variants of sequence-to-sequence (Seq2seq) neural network models (Sutskever et al. 2014; Shang et al. 2015; Li et al. 2016a; Mou et al. 2016; Xing et al. 2017; Tian et al. 2017; Wu et al. 2018b) have been successfully applied for building conversation systems. The models typically incorporate an encoder and a decoder. The encoder aims to represent one message in a vector and the decoder generates a reply based on it. An attention mechanism is often used to improve the model on learning patterns from the data (Bahdanau et al. 2014; Luong et al. 2015).

The Seq2seq model is able to generate new replies for new messages. However, it is often observed that the Seq2seq model is liable to generate short, trivial and meaningless replies such as “something” and “I don’t know” (Li et al. 2016a). This problem is believed to stem from insufficient source information for generating meaningful targets (Tian et al. 2017). Despite providing a large number of data, only message-response pairs are used to capture the information and based on which all parameters are learned. In the absence of more information, trivial replies are often “safer” solutions. It is believed that this problem can be alleviated by introducing additional information to the generation process (Xing et al. 2017; Mou et al. 2016). Our work is also an attempt in this direction.

In previous studies, additional information provided by a pre-trained external model such as a commonsense knowledge graph, a topic model or an emotional classifier is proved to help generate more informative responses (Ghazvininejad et al. 2018; Zhou et al. 2018b; Xing et al. 2017; Zhou et al. 2018a). However, such external knowledge is not always available in real applications and the effectiveness of external models also influences the generation results. In this work, we propose a framework, called ReBoost, that uses the retrieved results as additional inputs to the Seq2seq model to boost the generation. These retrieved results are returned by an information retrieval (IR) system on the training data, thus avoid involving any external knowledge. Let us use some examples to explain what is retrieved results and to motivate our idea. As shown in Fig. 1, in a Weibo dataset, there are many similar dialogues (message-response pairs). These pairs are retrieved from an IR system by using the input message as the query. The IR system ranks the results based on the matching degree between the input message and each message in the repository. Thanks to these similar messages, the responses in retrieved results can provide some information contained in gold responses that should be generated. We call these retrieved responses as retrieved results. As can be seen, in the first example, the gold response and the retrieved results share some words such as “sleep”. If we offer this retrieved result to the response generation process, the model is possible to generate a response more related to “sleep”. Therefore, we hypothesize that retrieved results can provide useful prior knowledge for generating responses.

Fig. 2
figure 2

The overview of ReBoost. All data are marked in dash lines. The remain parts are four modules, i.e., a retrieval module, a message encoder, a retrieval information encoder and a keyword aware decoder

The overview of ReBoost is illustrated in Fig. 2. Specifically, given an input message, the retrieval module returns some relevant responses and their relevance scores. In our assumption, the information contained in these retrieved results can help generate better responses. As different words and different retrieved responses may play different roles in the generation process, we construct a hierarchical structure from word-level to sentence-level (each response contains only one sentence in our dataset). We design a gated hierarchical attention mechanism to integrate words, sentences and their relevance scores to improve the generation process. A word-level attention assigns different weights to words in retrieval results according to their importance in generation. The keywords which contain useful information are expected to get higher weights in this step. Then, each retrieved response is represented as a vector by the weighted sum of the word embeddings and fed into a sentence-level attention. Similarly, at this level, each retrieved result is assigned a weight based on its contribution to the generation process. Furthermore, we design a gate operation that utilizes the relevance scores as prior knowledge when assigning weights to the retrieved replies, to leverage the relevance information returned by the retrieval model. Consequently, the weighted sum of the sentence vectors constructs a supplementary vector which represents the retrieval information. In addition, to enhance the ability of the decoder, we extract some keywords from retrieved results to guide the generation process explicitly.

We conduct an empirical study on two large scale datasets. The first one is Sina Weibo dataset released by NTCIR-13 STC task (Shang et al. 2017). It is a Chinese dataset constructed by users’ posts and corresponding replies in Sina Weibo.Footnote 1 Another one is OpenSubtitles dataset proposed by Li et al. (2016a). This is an English dataset containing many scripted lines spoken by movie characters extracted from OpenSubtitles.Footnote 2 We compare our ReBoost model with the existing methods in both automatic and human evaluations and analyze the effectiveness of different modules in our model by a module ablation experiment. Experimental results show that ReBoost generates more informative and meaningful responses than the state-of-the-art models. This confirms our assumption that utilizing retrieved results in training data is helpful in the generation process.

Our contributions are concluded as follows: (1) we present a retrieved results aware neural response generation model, which uses retrieved results as supplementary information to help the generation; (2) we design a novel gated attention mechanism to make use of relevance scores as a kind of prior knowledge to improve the learning process; (3) we conduct experiments on two widely used datasets and prove our assumption that the retrieved results are helpful in generating better responses.

The rest of the paper is structured as follows: Sect. 2 briefly describes recent works in neural response generation. The details of our model is described in Sect. 3. Section 4 is the description of experiments and results. Analysis and discussion are also given in this section. Finally, we conclude our paper in Sect. 5.

2 Related work

In this section, we briefly introduce recent related works and compare them with our model. These studies are categorized into two groups: the retrieval-based system and the generation-based system.

2.1 Retrieval-based system

Retrieval-based methods take the input message as a query and select a set of suitable responses by information retrieval (IR) techniques from a large conversation repository (Ji et al. 2014). In addition to the basic information retrieval approaches, various additional features and deep networks have been used to rank and select replies. Some works focus on learning to rank responses according to their similarity with a given messages (Wu et al. 2018a; Bartl and Spanakis 2017; Yan et al. 2016). On the other hand of the spectrum, retrieved results from a basic IR system are further reranked by a deep learning based model (Li et al. 2016b).

2.2 Generation-based system

Generation-based methods and, in particular, Seq2seq models have recently attracted increasing attention (Sutskever et al. 2014). Initial works attempt to apply the Seq2seq model to response generation and the results has proved its effectiveness (Shang et al. 2015). However, many researchers have reported that the Seq2seq model is liable to generate short, trivial and meaningless replies (Li et al. 2016a; Xing et al. 2017; Tian et al. 2017). To tackle this problem, Li et al. proposed to modify the objective function in training process, i.e. use mutual information instead of maximum likelihood when training the model (Li et al. 2016a). Under this circumstance, the parameters in Seq2seq model are still learned from message-response pairs. With limited input information, the Seq2seq model cannot generate more informative responses substantially (Tian et al. 2017).

Fig. 3
figure 3

Comparison between our method and existing methods. The above one (1) represents existing methods which use external data as supplementary information to improve the response generation, while the bottom one (2) is our method that uses the internal data to boost the generation process

In order to incorporate more information into generation process, many researchers proposed using external knowledge and models. For example, Xing et al. used a topic model to excerpt topic information and guide the generation process (Xing et al. 2017). This model can generate more informative results with the help of topic information. However, it has two drawbacks. At first, training a usable topic model needs a large scale of text data. This external dataset is not always available. In early experiments, we trained a similar topic model on the conversation dataset (about 4 million pairs), but the results are extremely unreasonable. Secondly, given the limited number of topics, it is possible that no topic is specific enough to an input message, thus the approach is less useful in this case. Compared with this model, our method utilizes conversations in the training set that are related with input messages as supplementary materials, which can provide more specific information (such as some keywords, concepts, etc.) for response generation.

Many studies focus on facilitating response generation models with other external information such as commonsense knowledge and emotional class. Commonsense knowledge is vital to many natural language processing tasks and can also be helpful in dialog system theoretically (Ghazvininejad et al. 2018; Zhou et al. 2018b). Unfortunately, an open-domain commonsense knowledge graph is hard to obtain. In a recent work (Zhou et al. 2018b), only about 20,000 entities and their relations are used as commonsense information, which is far less compared with the amount of conversational pairs (3 million) in their experiments. That is to say, only a small part of conversations can be augmented with the commonsense information. Thus the improvement is limited. Building emotional conversation system is another interesting problem. The response can be more meaningful if the corresponding emotion is aware. Zhou et al. (2018a) proposed a chatting machine with such emotion information. All conversation pairs are categorized into six groups of emotion and the classification accuracy is reported to be 64%. The generation results depend on the emotion class directly, if an inaccurate emotion is given, the generation process is affected.

In summary, as we show in Fig. 3, all aforementioned models tend to improve the Seq2seq model by incorporating external data by external models. At the same time, the noise is also involved. Besides, the external data is not always available in a real application scenario. Compared with these studies relied on external knowledge, our method draws helpful information from the training set rather than an outside dataset. This is more applicable in a real scenario, and the noise in external data is also avoided meanwhile. Besides, our method that uses retrieved results to boost the generation moves a step further towards building an ensemble system combining both retrieval-based and generation-based methods.

3 The ReBoost model

To incorporate the information contained in retrieved results to the generation process, we propose the ReBoost model. As illustrated in Fig. 2, our model consists of a retrieval module, a message encoder, a retrieval information encoder and a keyword aware decoder. As introduced in Sect. 1, given an input message, our idea is to generate a response by using retrieved results from training data. We first retrieve \(n_s\) message-response pairs and their relevance scores with the input message by the retrieval module. Both the input message and these retrieved responses are represented as fixed-sized vectors by the message encoder and retrieval information encoder respectively. We call them a message vector and a retrieval information vector. In particular, when computing the retrieval information vector, we also take into account the relevance score provided by the retrieval module, which is proved to be a helpful prior knowledge in the learning process. For decoding, two vectors provided by the encoder guide the generation process jointly. And to convey the key information more directly, we extract some keywords from retrieved results and improve their generation probabilities explicitly. The details are introduced as follows.

Fig. 4
figure 4

Retrieval module

3.1 Retrieval module

Our motivation is using retrieved results in training data to improve the generation process, thus the first problem is how to obtain retrieved results. We build a retrieval module to achieve this (as shown in Fig. 4). In particular, we use the Apache Solr,Footnote 3 an open-source search platform, for the retrieval implementation. We construct the indices on the message-response pairs in training data. Both the message and response are set as attributes separately to allow the directed queries.

Given an input message, the retrieval module would provide many pairs and score them according to the semantic matching degree. Here we retrieve \(n_s\) message-response pairs according to the relevance score [BM25 (Robertson and Zaragoza 2009)] between the input and the message in each pair. These retrieved pairs are denoted as (\(m_{k}\), \(r_{k}\)), \(1 \le k \le n_s\). In this work, we use \(r_k\) as retrieved results. The information retrieval is a relatively mature technique, thus more sophisticated systems can be alternated as the retrieval module.

Fig. 5
figure 5

Message encoder

3.2 Message encoder

The input message is represented by the input message encoder (as illustrated in Fig. 5). We use a bi-directional RNN with GRU as the encoder to represent the input message.

Formally, assuming the input message with length m is \(X=(x_1, x_2, \ldots , x_m)\), ReBoost first uses an embedding layer to map each word x to an d-dimension embedding \(\mathbf {x}\):

$$\begin{aligned} x \Rightarrow \mathbf {x}. \end{aligned}$$
(1)

Then the hidden states of the encoder are corresponding representations, i.e., \((\mathbf {h}_1, \mathbf {h}_2, \ldots , \mathbf {h}_n)\), where \(\mathbf {h}_i\) is computed as follows:

$$\begin{aligned} \mathbf {h}_i&= [\overrightarrow{\mathbf {h}}_i; \overleftarrow{\mathbf {h}}_i], \end{aligned}$$
(2)
$$\begin{aligned} \overrightarrow{\mathbf {h}}_i&= \text {GRU}_1(\mathbf {x}_i, \overrightarrow{\mathbf {h}}_{i-1}), \end{aligned}$$
(3)
$$\begin{aligned} \overleftarrow{\mathbf {h}}_i&= \text {GRU}_2(\mathbf {x}_i, \overleftarrow{\mathbf {h}}_{i+1}), \end{aligned}$$
(4)

where [;] is the concatenation operation. \(\overrightarrow{\mathbf {h}}_i\) is the hidden state in the forward RNN, while \(\overleftarrow{\mathbf {h}}_i\) is the hidden state in the backward RNN. The hidden state \(\overrightarrow{\mathbf {h}}_0\) and \(\overleftarrow{\mathbf {h}}_{m+1}\) are randomly initialized. The operations in a GRU cell of the forward RNN are defined as follows:

$$\begin{aligned} \mathbf {z}&= \sigma (\mathbf {W}_{z}\mathbf {x}_{i}+\mathbf {U}_{z}\overrightarrow{\mathbf {h}}_{i-1}), \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {r}&= \sigma (\mathbf {W}_{r}\mathbf {x}_{i}+\mathbf {U}_{r}\overrightarrow{\mathbf {h}}_{i-1}), \end{aligned}$$
(6)
$$\begin{aligned} {\tilde{\mathbf {h}}}_i&= \tanh (\mathbf {W}_{h}\mathbf {x}_{i}+\mathbf {U}_{h}(\mathbf {r} \odot \overrightarrow{\mathbf {h}}_{i-1})), \end{aligned}$$
(7)
$$\begin{aligned} \text {GRU}_1(\mathbf {x}_i, \overrightarrow{\mathbf {h}}_{i-1})&= \mathbf {z} \odot \overrightarrow{\mathbf {h}}_{i-1}+(1-\mathbf {z}) \odot {\tilde{\mathbf {h}}}_i, \end{aligned}$$
(8)

where \(\odot\) denotes element-wise product between vectors. \(\tanh (\cdot )\) and \(\sigma (\cdot )\) are the tanh and sigmoid function. \(\mathbf {W}_h\), \(\mathbf {W}_z\), \(\mathbf {W}_r\), \(\mathbf {U}_h\), \(\mathbf {U}_z\) and \(\mathbf {U}_r\) are parameter matrices. The backward RNN is defined likewise and we omit its definition here. Note that the parameters in the two RNNs are not tied together, but randomly initialized and trained separately. With the bi-directional RNN, the representation \(\mathbf {h}_i\) for the word \(x_i\) can accumulate information from its context.

An attention mechanism is involved to summarized the input message representations into a fixed-size vector. To make it clear, we call it the input message vector and denote it as \(\mathbf {a}_t^M\). The calculation of the input message vector is:

$$\begin{aligned} \mathbf {a}_t^M&= \sum _{j=1}^{m}{\alpha _{tj}\mathbf {h}_j}, \end{aligned}$$
(9)
$$\begin{aligned} \alpha _{tj}&= \frac{\exp (e_{tj})}{\sum _{k=1}^{m}{\exp (e_{tk})}}, \end{aligned}$$
(10)
$$\begin{aligned} e_{tj}&= \tanh {(\mathbf {W}_{\alpha _1}[\mathbf {s}_{t-1};\mathbf {h}_j])}, \end{aligned}$$
(11)

where \(\mathbf {s}_{t-1}\) is the hidden state of the decoder in the decoding time step \(t-1\), which will be introduced later.

Fig. 6
figure 6

Retrieval information encoder (Color figure online)

3.3 Retrieval information encoder

From the retrieval module, we can obtain a number of retrieved results and their relevance scores. The next question is how to utilize and incorporate them into the generation process. In real life, facing a new message, people often generate replies containing some keywords. The retrieved responses can be used to identify those keywords. If similar conversations happened before, the replies can even be reused. Based on this observation, we utilize the retrieved results at different levels and propose a gated hierarchical attention mechanism.

A simple way to implement our idea is to directly feed the keywords or retrieved responses into the decoder. However, this simple model cannot distinguish between more important and less important retrieved results during reply generation. Besides, each retrieved result is a natural language sentence consisting of multiple words. The contribution of these words in generating a corresponding response is different. Thus retrieved results should be modeled hierarchically, namely from word-level to sentence-level. Unfortunately, the simple model cannot extract the hierarchical information contained in retrieved responses. To address these issues, we design a gated hierarchical attention mechanism (as shown in Fig. 6). Basically, this attention mechanism comprises a word-level attention layer and a sentence-level attention layer. They are used to assign different weights to the words in the retrieved results and to the retrieved results according to their importance or contribution in generating a target response. In the sentence-level attention layer, we add a gate operation (red rectangular in Fig. 6) to incorporate the relevance score provided by the retrieval module. The relevance score is used as prior knowledge to guide the calculation of weights for each retrieved response.

Formally, assume \((r_1, r_2, \ldots , r_{n_s})\) are responses provided by the retrieval module and \((sc_1, sc_2, \ldots , sc_{n_s})\) are their corresponding relevance scores. Similar to the input message, the kth response \(r_{k}=(w_{k,1}, w_{k,2}, \ldots , w_{k,n_k})\) is first mapped into d-dimension embeddings and then represented as \((\mathbf {h}_{k,1}, \mathbf {h}_{k,2}, \ldots , \mathbf {h}_{k,n_k})\) by an RNN with a GRU cell. At decoding time step t, the representation of \(r_{k}\) could be calculated using a traditional attention mechanism as follows:

$$\begin{aligned} \mathbf {r}_{k,t}&= \sum _{j=1}^{n_k}{\alpha _{k,t,j} \mathbf {h}_{k,j}}, \end{aligned}$$
(12)
$$\begin{aligned} \alpha _{k,t,j}&= \frac{\exp {(o_{k,t,j})}}{\sum _{l=1}^{n_k}{\exp {(o_{k,t,l})}}}, \end{aligned}$$
(13)
$$\begin{aligned} o_{k,t,j}&= \tanh {(\mathbf {W}_{\alpha _2}[\mathbf {s}_{t-1};\mathbf {h}_{k,j}])}, \end{aligned}$$
(14)

where \(o_{k,t,j}\) and \(\alpha _{k,t,j}\) are the original and normalized weights of the jth word in the kth retrieved result when generating the tth word in target response. Note that the representation of the kth response \(r_{k}\) is not fixed but changing in different decoding steps, thus we add a subscript to distinguish it, e.g., \(\mathbf {r}_{k,t}\) for the representation in the time step t. \(\{\mathbf {r}_{k,t}\}_{k=1}^{n_s}\) are then fed into the sentence level attention layer and assigned a weight \(\beta _{k,t}\) to form a context vector \(\mathbf {a}^R_t\):

$$\begin{aligned} \mathbf {a}_t^R&= \sum _{k=1}^{n_s}{\beta _{k,t}\mathbf {r}_{k,t}}, \end{aligned}$$
(15)
$$\begin{aligned} \beta _{k,t}&= \frac{\exp {(o'_{k,t})}}{\sum _{j=1}^{n_s}{\exp {(o'_{j,t})}}}, \end{aligned}$$
(16)
$$\begin{aligned} o'_{k,t}&= \tanh {(\mathbf {W}_\beta [\mathbf {s}_{t-1};\mathbf {r}_{k,t}])}, \end{aligned}$$
(17)

where \(\beta _{k,t}\) is the normalized attention weight of the kth retrieved result which reflects its contribution (importance) in generating the tth word. \(o'_{k,t}\) is the weight before normalization. These equations are used in the traditional attention mechanism, but they are not suitable in our sentence-level attention. Thus we modify the calculation of \(\beta _{k,t}\) and \(o'_{k,t}\), which are introduced as follows.

Fig. 7
figure 7

The gate mechanism

We modify \(\beta _{k,t}\) at first. This normalized weight of response is learned automatically. But in our case, when returning the retrieved results, the retrieval module also provides relevance scores for those results which measure their relevance with the given message. Obviously, these relevance scores are valuable prior knowledge for the attention mechanism when assigning a weight for each retrieved result. However, they are not always reliable. To take into account this factor, we also use alternative attention weights learned by the model itself. As both signals (given relevance scores and learned weights) are useful, we design a gate operation to automatically control their importance during the generation process.

The detail of this gate operation is shown in Fig. 7. Formally, considering the process of assigning a weight for a retrieved reply \(r_{k,t}\) at the time step t, the normalized weight of sentence \(\beta _{k,t}\) is calculated by a given relevance score \(sc_{k}\) and an original weight \(o'_{k,t}\) learned by the model:

$$\begin{aligned} \beta _{k,t} = z_{k,t} \cdot sc_{k} + (1-z_{k,t}) \cdot o'_{k,t}, \end{aligned}$$
(18)

where \(z_{k,t}\) is the refer gate that controls how much the overall weight refers to the relevance score. It is randomly initialized and tuned in the training process. A smaller \(z_{k,t}\) means the weight learned by the model is more suitable to the case.

In traditional attention mechanism (presented in Eq. 10), the original weight \(o'_{k,t}\) is normalized as a probability distribution over a set of input vectors, i.e. all retrieved replies are assigned positive values (probabilities) and their sum is equal to one. However, this is not suitable to our case because: (1) there could be more than one relevant replies, all of them can be assigned high weights, thus the limitation on the sum of their weights is not suitable; (2) the retrieved responses are not always relevant, all irrelevant responses should be assigned small weights, i.e. be ignored in the generation process. We expect our model to have the ability to determine whether a retrieved reply is useful or not. Based on these considerations, we remove the softmax normalization in Eq. (10) and modify the calculation of the weight \(o'_{k,t}\) as follows:

$$\begin{aligned} o'_{k,t} = \text {sigmoid}(\mathbf {W}_{\beta }[\mathbf {s}_{t-1}; \mathbf {r}_{k,t}]). \end{aligned}$$
(19)

The value of this weight is between 0 and 1. A higher value of \(o'_{k,t}\) indicates \(\mathbf {r}_{k,t}\) is more important in generation process.

With the above gated hierarchical attention mechanism, we can selectively use the retrieved replies and the words contained in them. The vector \(\mathbf {a}_t^R\) is used as our context vector.

3.4 Keyword aware decoder

From the aforementioned two encoders, both the input message and retrieval information are represented as vectors \(\mathbf {a}_t^M\) and \(\mathbf {a}_t^R\) respectively. Then the message vector \(\mathbf {a}_t^M\) and the retrieval information context vector \(\mathbf {a}_t^R\) are concatenated together as the joint attention vector and sent to the keyword aware decoder.

$$\begin{aligned} \mathbf {a}_t = [\mathbf {a}_t^M; \mathbf {a}_t^R], \end{aligned}$$
(20)

where [;] is the concatenation operation.

The modules we proposed above manipulate retrieved information in the encoder step. We also consider taking use of retrieval results to directly guide the generation process in decoder. Specifically, as shown in Fig. 8, we modify the generation probability in the decoder to make it biased towards some keywords in related responses. We call it the keyword aware decoder. The intuition is that the keywords which appeared frequently in related responses are more relevant and may contain helpful information. To implement this idea, we first extract some nouns as candidate keywords in responses according to their TF-IDF values. Then sort them by their frequency and the top \(N_k\) of them are remained as selected keywords.

Formally, at decoding time step t, for a target word \(y_t\), the generation probability \(p_t\) is:

$$\begin{aligned} p_t&= p_n + p_k, \end{aligned}$$
(21)
$$\begin{aligned} p_n&= \text {softmax}(\mathbf {W}_s\mathbf {s}_t + \mathbf {b}_s), \end{aligned}$$
(22)
$$\begin{aligned} p_k&= \text {softmax}(\mathbf {W}_k[\mathbf {s}_{t};\mathbf {a}_t^R]+\mathbf {b}_k), \end{aligned}$$
(23)
$$\begin{aligned} \mathbf {s}_t&= \text {GRU}(\mathbf {y}_{t-1}, [\mathbf {s}_{t-1}; \mathbf {a}_{t}]), \end{aligned}$$
(24)

where \(\mathbf {W}_s\), \(\mathbf {W}_k\), \(\mathbf {b}_s\) and \(\mathbf {b}_k\) are parameters. It is worth noting that the probability \(p_k\) is only computed for the selected keywords, and the probability for other words in this vector is masked as zero. In this way, the generation probability is biased to the selected keywords. For a non-keyword, the generation probability is the same as that in standard Seq2seq model. But for a selected keyword, there is an extra probability term that increase its generation probability. This extra term is determined by the current hidden state of decoder \(\mathbf {s}_t\) and the retrieval information attention vector \(\mathbf {a}_t^R\). When a keyword is relevant to the generated parts and the input message, it will be more possible to appear in a response.

In conclusion, in our keyword aware decoder, the retrieval information can guide the generation process through the joint attention vector (implicitly) and the keywords (explicitly).

Fig. 8
figure 8

Keyword aware decoder

One advantage of our model is that it will be trained to learn how to use different levels of retrieval information through the gated hierarchical attention mechanism. If such information turns out to be unreliable, the gated attention mechanism is able to assign a small weight to it or ignore it. On the other hand, the extracted keywords can influence the generation process directly, which further helps the model to generate more informative replies.

Overall, the retrieved replies and the input message provide complementary information to the response generation module. Our framework offers a new way to integrate retrieval-based and generation-based approaches.

4 Experiments

4.1 Dataset and preprocessing

We use the Chinese Sina Weibo dataset released by NTCIR-13 STC task (Shang et al. 2017) and the English OpenSubtitles dataset proposed by Li et al. (2016a).

For the Weibo dataset, the user’s posts are used as messages and the comments as responses. Following the existing approach (Xing et al. 2017), we randomly select 4.3 million pairs as the training set, 50,000 pairs as the validation set and 5000 pairs as the test set. There is no overlap among the three sets. The retrieval module is built on the training set and provides related responses for training, validation and test set. To avoid the model “seeing” the ground-truth response, we remove the original response (the ground-truth) from the retrieved results in the training set. The messages in the test set are used as inputs to generate responses and the corresponding original responses are used as the ground-truth to calculate evaluation metrics. All the text are segmented by Jieba,Footnote 4 a Chinese word segmentation tool. We construct two vocabularies for posts and responses by using 40,000 most frequent words, covering 97.01% and 95.65% usage of words respectively. The words not in the dictionary are replaced by a special token “\(\langle \text {unk} \rangle\)”.

For the OpenSubtitles dataset, it contains many scripted lines spoken by movie characters. As the dataset does not specify which character speaks each subtitle line, following the same assumption as Li et al. (2016a), each line of subtitle is used as a full speaker turn. And our models are trained to predict the next turn given the current ones based on the assumption that two consecutive turns belong to the same conversation. Consequently, we randomly select 5 million pairs as the training set, 50,000 pairs as the validation set and 50,000 pairs as the test set. Other settings are the same as the Weibo dataset. And the dataset is preprosessed by the author.Footnote 5

4.2 Baseline models and experiment setup

We compare our models with the following baseline models and the state-of-the-art models:

  • S2SA: the standard Seq2seq model with an attention mechanism. This is the basic model for the response generation.

  • NRM-hyb: the best model in Shang et al. (2015) using two encoders to represent messages in local and global schemes. In the local information encoder, attention mechanism is used to aggregate and summarize the information in the input message and the attention vector is used as the local representation. In the global information encoder, the hidden state of the last word in the input message is used as the global representation. The two representations are concatenated together and fed to the decoder. This model uses more complex encoders to get better representations of the input message, which is an easy way to improve the informativeness of the generated response.

  • MMI: the best model in Li et al. (2016a) which uses a diversity-promoting objective function to train the Seq2seq model. It first trains a Seq2seq model for generating responses based on the given input message. Then, another Seq2seq model is trained for generating input messages based the given response. The first model is used to generate a list of responses for a given input, and the second model is used to rerank the list based on their probability of generating the given input. This model modifies the objective function of the Seq2seq model which is different from us that uses supplementary information. We select this model as a baseline to compare which way is better in generating informative responses.

  • TA-Seq2seq: the model proposed by Xing et al. (2017) which uses a topic model to extract topic information and utilizes it to boost the Seq2seq model. For each input message, the pre-trained topic model assigns a topic for it and the corresponding topic words are fed into the decoder by the attention mechanism. In the experiments, we train a topic model on the training set to make a fair comparison.

We use the same settings for the training on two datasets. The common settings for all models are introduced at first followed by the specific settings for each model respectively.

(1) Common settings: for all models, including ReBoost and the baselines, the dimension of the hidden states of both encoder and decoder is 1000 and the dimension of the word embeddings is 300. All model parameters are initialized with uniform distribution in [− 0.1, 0.1] and trained with the Adam algorithm (Kingma and Ba 2014) and mini-batch of size 128 on NVIDIA Tesla K40 GPU. The initial learning rate is 0.001, which decays dynamically in the training. We also use the validation set for early stop. Beam search with a beam width of 10 is used for predicting the results.

(2) Specific settings: (a) NRM-hyb contains two RNNs as the encoder, both of them have the same hidden size (1000) but the parameters are not shared. (b) MMI trains two Seq2seq models and they have the same settings as the common settings. (c) The topic model for TA-Seq2seq is trained by Yan et al. (2013), which is a state-of-the-art topic model for short texts. Following the original experimental setting, the number of topics is 200 and the top 100 words in each topic are selected. For each input message, 15 topic words with the highest probability (topic probability multiply word probability) are selected as supplementary information for decoding. (4) In ReBoost, we use Apache Solr 6.5 and its default ranking function BM25 as the retrieval module. The number of retrieved results is ten. 15 words with the highest TF-IDF values in retrieved results are provided to decoder with a biased generation probability. Zero paddings are used if there are less than 15 keywords. As retrieved results are from the training set, we should avoid providing the original response for an input message. Therefore, the response that is the same as the original one is removed from the retrieved results and this forces the model to learn how to use the retrieved results rather than simply copy a ground-truth for the generation. All datasets and codes are available.Footnote 6

4.3 Evaluation metrics

To evaluate the performance of our model and baseline models, we follow existing studies and employ several standard metrics: perplexity, distinct and BLEU-N.

Fig. 9
figure 9

Examples for demonstrating the metrics

Distinct-1 and Distinct-2 These two metrics are proposed by Li et al. (2016a) to measure the degree of diversity according to the ratios of distinct unigrams and bigrams in generated responses. Higher values of these metrics indicate the replies contain more different words and more information potentially. Let us use an example to demonstrate the metrics. As shown in Fig. 9, all unigrams and bigrams in the left case are distinct, therefore the values of Distinct-1 and Distinct-2 are both 1.00. As for the right case, there are 5 unigrams and 4 bigrams in the sentence but only 3 of them are distinct, thus the results are 0.60 and 0.75 respectively.

BLEU-N BLEU is a metric that is originally used in machine translation (Papineni et al. 2002). It evaluates the output by using n-gram matching between the output and the reference. BLEU-1, BLEU-2, BLEU-3 and BLEU-4 are commonly used.

Formally, BLEU-N score is calculated by:

$$\begin{aligned} \text {BLEU-N} = \exp {\left( \min {\left( 1-\frac{r}{c}, 0\right) } + \sum _{n=1}^{N}{w_n\log {p_n}}\right) }, \end{aligned}$$
(25)

where r and c are the lengths of the reference response and candidate ones respectively, \(p_n\) is the modified n-gram precision, and N means using n-grams up to length N and \(w_n=1/N\). Based on the formula, we can see that the BLEU value depends on both the length of the response and the n-gram precision. Higher BLEU values mean that the output response and the reference have more sharing words and are more similar. As shown in Fig. 9, comparing the two cases, the left one is much close to the ground-truth sentence since they share more words, thus its BLEU values are much higher. The trigrams and four-grams in these two cases are all different with the ground-truth, thus the BLEU-3 and BLEU-4 are equal to 0.

Table 1 Automatic evaluation results

4.4 Overall performance

We compare our ReBoost model with all baselines and the results are shown in Table 1. The performance improvements of ReBoost on all metrics are statistical significant (p value \(<0.01\)) and Bonferroni correction is applied for counteracting the problem of multiple comparisons. Based on the results, we can find:

On the Weibo dataset, ReBoost achieves higher performance on all metrics. Based on the results in terms of Distinct-1 and Distinct-2, we can conclude that ReBoost can generate more different words. This partially indicates the responses are more diverse and informative. This result proves our assumption that the retrieved results are useful supplementary information in the response generation. As for the BLEU scores, a higher BLEU score usually indicates a higher similarity between the generated responses and the ground truth. All BLEU values of the results demonstrate our ReBoost model outperforms other baselines in response generation.

On the OpenSubtitles dataset, the conclusions are similar except for two points: (1) All values are lower than that on Weibo dataset. After comparing these two datasets, we find that the sentences in OpenSubtitles are usually incomplete. This may because of the ellipses in English. The incomplete sentences are much more difficult for the model to learn the mapping. (2) MMI achieves the best results in terms of BLEU-1 among all models. We check the generated responses and find that there are many long and repeated sentences such as “i don’t know what you’re thinking”. These results can achieve better BLEU values but are very boring and trivial, which leads to lower Distinct values.

In summary, our ReBoost model outperforms other baseline models in almost all automatic evaluation metrics. These results prove that incorporating retrieved responses can improve the performance of the Seq2seq model.

Table 2 Human evaluation results on Weibo dataset

4.5 Human evaluation

4.5.1 Results and analysis

In addition to evaluating the models with automatic metrics, we also conduct human evaluation. We randomly selected 200 messages from the test set and collect the corresponding results generated by each model. Then we invited five evaluators with rich experience of Sina Weibo to do two kinds of evaluations: absolute scoring and side-by-side comparison. In both evaluations, Fleiss’s kappa (Fleiss and Cohen 1973) is used to evaluate the degree of agreement.

The first human evaluation is absolute scoring. Following the criterion of Shang et al. (2015), the labelers are asked to judge a result based on five criteria: grammar correctness, fluency, logic consistency, semantic relevance and scenario dependence. Responses from different models are shuffled and mixed together and the evaluators are required to assign a score from 0 to + 2 for each response independently. A suitable (+ 2) response means the response is appropriate, natural and informative. A neutral (+ 1) one is a reply that is either suitable only in a specific scenario or trivial and universal that can be used for many messages. And an unsuitable (0) response means it is impossible to find a scenario where this response is suitable, i.e., it is irrelevant to the input message or contains grammar errors. To ensure consistency, before labeling, the annotators are trained with some examples.

Table 2a shows the results. The kappa scores indicate that labelers are in fair agreement with the quality of responses. The results demonstrate clearly that our ReBoost model generates much more informative responses (+ 2) and less trivial responses (+ 1). This indicates that additional retrieval information can help generate more informative replies. However, comparing with TA-Seq2seq, ReBoost generates more results with label 0. We analyze the results generated by ReBoost and find that ReBoost tends to use more diverse words to synthesize informative responses. This may involve some noise and hurt the coherence of the response. In the future, we plan to add more constraints on the decoder for generating more coherent responses. Among the baseline models, TA-Seq2seq introduces topic information as prior knowledge and it generates the most informative responses (28%). In fact, both ReBoost and TA-Seq2seq utilize additional information into the generating process, thus the results consistently prove that incorporating more information can help alleviate the trivial replies problem.

We further conduct a side-by-side comparison evaluation on generated results. For the 200 samples, we created 800 triplets (message, response 1, response 2) where one response is generated by ReBoost and the other is generated by a baseline. In each triplet, the two responses are randomly shuffled so that the evaluators cannot easily guess which response is generated by ReBoost. The evaluators follow the same five criteria in the former annotation to judge the quality of each response. They are required to compare the two results and make a decision among win, lose and tie (win: response 1 is better; loss: response 2 is better; tie: they are equally good or bad).

The side-by-side annotation results are shown in Table 2b. We find: (1) ReBoost model outperforms all the baselines, which indicates our model can generate much more suitable results. (2) ReBoost model outperforms TA-Seq2seq. This confirms that our method using retrieved replies is more effective than TA-Seq2seq, which selects a set of topic words to enhance response generation.

Fig. 10
figure 10

Cases of disagreement among annotators

4.5.2 Discussions

We find that the Kappa is not high in the human evaluation results. To investigate the reason, we sample some cases which cause disagreements among annotators. These cases are shown in Fig. 10. The generated responses are marked with underline.

In the first case, two annotators think that the generated response has grammatical errors and it is difficult for them to understand the response. On the contrary, another three annotators consider the response as a suitable one since it mentions the key information “cut hair” in the input message. As for the second and third examples, things are similar. One annotator cannot well understand the response and annotate it with “0” score. Some remaining annotators think the response are trivial and can be used for many input messages, while others consider the response are proper.

Based on the examples, we can find that it is difficult to make a gold standard in the evaluation of response generation. In the future, we plan to conduct the evaluation from different angles such as informativeness and appropriateness and perform the annotation respectively. This may help to improve the degree of agreement among different annotators.

Table 3 Module ablation results

4.6 Module ablation

In our model, we design a new gated hierarchical attention mechanism to encode the retrieved results. And we also modify the decoder to make the generated responses biased to some keywords in retrieved results. In order to investigate the effectiveness of these two strategies and the performance of the retrieval module, we conduct a module ablation experiment.

At first, we remove the gate mechanism in the retrieval information encoder. In other words, the relevance scores returned by the retrieval module are not provided to the model. The weights of different retrieved results are learned in the training process without any prior knowledge. We denote this model as ReBoost-gate. Then, we remove the additional probabilities for the keywords in the decoder. All words are treated as normal words and their generation probabilities are computed by Eq. (22). This model is denoted as ReBoost-keywords. Third, to investigate the performance of retrieval information encoder, we remove this encoder and only the input message encoder and the keyword aware decoder are remained. This model is denoted as ReBoost-retrieval. Finally, as the retrieval module could provides many related responses, we can use the top one result as the reply. And this model is denoted as Retrieval.

The results are reported in the Table 3. Based on the results, we can find: (1) Except for Retrieval in terms of Distinct, the full ReBoost model achieves the best results on all metrics. This demonstrates that all modules in ReBoost are useful in boosting the Seq2seq model. (2) The retrieval information encoder is the most important module in ReBoost, since the performance drops most after removing it. (3) The effectiveness of the gate mechanism and the keyword aware decoder is not definite, since the results are different on two datasets. We think the performance is related to the data. If more accurate keywords can be extracted, the keyword aware decoder would contribute more. (4) Retrieval can achieve extremely good results on Distinct but failed on BLEU values. We check the corresponding results and find that they are fluent and informative but not so relevant to the input message. This is because these responses are human written which are much longer and more natural. And this also indicates that directly using retrieved results as replies is not reliable and they are more suitable to be used as supplementary information.

Fig. 11
figure 11

Case study samples

Fig. 12
figure 12

Bad responses with different type of errors

4.7 Case study and error analysis

Figure 11 shows examples generated by ReBoost, TA-Seq2seq and S2SA. The sentences with underlines are one of the retrieved results. From the figure, we can observe a few findings:

  1. (1)

    Based on the first example, we can see that compared with S2SA, ReBoost and TA-Seq2seq can generate more suitable results. The S2SA model even generates a confused response. This is consistent with the basic assumption that the Seq2seq model can be improved by incorporating more supplementary information.

  2. (2)

    Based on the first two examples, the words in bold indicates that ReBoost can generate responses containing some keywords appeared in messages and retrieved results that make it more relevant and informative. This proves our assumption that the retrieved results can be used to boost the Seq2seq model in generating much more informative replies.

  3. (3)

    Analyzing the last example, we can find the ReBoost model can better extract semantic relationship between a message and a response (such as “zodiac signs” - “Aries”). This is achieved by providing the retrieved results to the model, since the “Aries” appeared in the retrieved results.

To further investigate how to improve our model, we also do an error analysis. We collect the samples that have more than three 0 labels and obtain 42 samples (the total number is 200). After checking their corresponding retrieved results, we categorize the errors into three types which are shown in Fig. 12.

The first type of error is caused by irrelevant retrieved results. About 16.7% (7 of 42) bad responses have this error. As shown in the first example, the retrieved result contains a name “Suwei” and ReBoost inserts this word into the generated response. Under this circumstance, the generated response can only be suitable in some specific cases (e.g., the input message is from Suwei or Dapeng). This indicates that our model cannot distinguish how specific a word is. Too specific words may hurt the generated response. In the future, we can use some keywords extraction techniques to provide a weight of each word in each retrieved results. This may help the model to reduce this type of problem.

The second type of error is stem from neglecting the useful retrieved results. About 33.3% (14 of 42) errors are in this type. As we can see in the second example, the retrieval module provides a suitable response for the input message but ReBoost neglects it. In the future, we plan to collect all responses generated by ReBoost and retrieved by the retrieval module, and then rerank them to output a most suitable one as the reply.

The third type of error is caused by using the retrieved results incorrectly. There are 50% (21 of 42) bad responses in this type. In the third example, the top one retrieved result mentioned the word “most”, but the generated response repeatedly uses this word and make a mistake. This indicates that we need to refine our keyword aware decoder to make sure the inserted keyword would not hurt the sentence.

5 Conclusion

In this paper, we propose to use retrieved replies to boost the Seq2seq model in order to generate more informative and interesting responses by a gated hierarchical attention mechanism. This is a novel way to combine the retrieval- and generation- based methods. Empirical results with both automatic and human evaluations confirm our model can generate better responses than the state-of-the-art models. The proposed framework can be improved in the future on several aspects: building a more advanced retrieval module, extracting other types of information from retrieved replies, etc.