research-article

Open access

Dual-View Conditional Variational Auto-Encoder for Emotional Dialogue Generation

Authors:

Mei Li,

Jiajun Zhang,

Xiang Lu,

Chengqing ZongAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 3

Article No.: 46, Pages 1 - 18

https://doi.org/10.1145/3481890

Published: 13 December 2021 Publication History

All formats PDF

Abstract

Emotional dialogue generation aims to generate appropriate responses that are content relevant with the query and emotion consistent with the given emotion tag. Previous work mainly focuses on incorporating emotion information into the sequence to sequence or conditional variational auto-encoder (CVAE) models, and they usually utilize the given emotion tag as a conditional feature to influence the response generation process. However, emotion tag as a feature cannot well guarantee the emotion consistency between the response and the given emotion tag. In this article, we propose a novel Dual-View CVAE model to explicitly model the content relevance and emotion consistency jointly. These two views gather the emotional information and the content-relevant information from the latent distribution of responses, respectively. We jointly model the dual-view via VAE to get richer and complementary information. Extensive experiments on both English and Chinese emotion dialogue datasets demonstrate the effectiveness of our proposed Dual-View CVAE model, which significantly outperforms the strong baseline models in both aspects of content relevance and emotion consistency.

1 Introduction

For a long time, building an excellent conversation system that can converse with humans naturally and intelligently has attracted wide attention in both academia and industry. In recent years, there has been a surge of interests toward designing large-scale non-task-oriented dialogue systems using sequence to sequence (Seq2Seq) neural networks [27, 28, 29, 35, 44]. However, these studies mainly focus on modeling the diversity, topic, and the personality of the response, neglecting the importance of the emotion in dialogue. Previous studies have shown that emotional intelligence has a crucial influence on the performance of human–computer dialogue system and is an indispensable part of successful dialogue systems [23, 26].

To alleviate this issue, several studies have already made contributions to the emotional dialogue generation task and emotional text generation [10, 12, 19, 20, 30, 33, 41, 46, 49]. Existing emotional dialogue generation methods can be mainly divided into two categories: coarse-grained and fine-grained emotion models. Coarse-grained models are based on limited emotion categories, such as three categories (e.g., positive, negative, and neutral) or five categories (e.g., like, happy, sad, disgust, and angry). These models are easy to train relatively as they can use a handful of existing resources in the field of sentiment analysis. However, such methods are hard to incorporate the rich emotions of humans whose emotion types are far more than three or five categories. Apart from this, human annotation is still costly. The fine-grained emotion model employs the emojis as the emotion categories. Nowadays hundreds of emoji characters widely used in social media are borrowed to represent the human emotion. These methods can avoid the defects of human annotation and bring richer emotional data without the need for annotation.

Zhou and Wang [49] first introduces the emoji dataset crawling from Twitter into the dialogue system. They use Conditional Variational Auto-Encoder (CVAE) to model the emotional dialogue generation, and reinforced learning is further applied to enhance the emotion consistency (reinforced learning CVAE (RL-CVAE)). However, these methods are not powerful enough to guarantee the content relevance between the response and the query as well as the emotion consistency between the response and the given emotion tag. Zhou et al. [46] report that the simply embedding emotion information cannot produce desirable emotional responses. Although reinforcement learning is further adopted to enhance the emotion consistency, it also causes side-effects that influence the language model performance and tend to generate non-fluent responses. As an example shown in Figure 1, the CVAE model generate response that is not well consistent with the emoji and the reinforcement learning tends to generate emotionally consistent responses, but the content is less relevant. Besides, the Reinforced CVAE heavily relies on the final result of the response’s emoji classification, whereas the classification accuracy is very low, since many emojis are similar and difficult to distinguish from each other. Finally, we find in the experiments that when CVAE training is sufficient, the reinforcement learning model rarely achieves a better result.

Fig. 1.

To improve the emotion consistency while maintaining the content relevance, we propose a novel Dual-View CVAE (DV-CVAE) model, which explicitly takes the content relevance view and the emotion consistency view into consideration. Our intuition is that the aspects in the response that query and emoji concerned with are usually different, and the given emotion tag can not only be used as a conditional feature but also can bring more emotional relevant information. We learn the representation distribution of the content and emotion of the response by using the Variational Auto-Encoder (VAE) with different attention models. In terms of the content relevance, we first map the query and response into two different representation distributions with VAE and then require the content distributions of the response and the query to be as close as possible. As for the emotion consistency, we employ the intermediate result of the emoji classifier, attention weighted sentence representation, as the emotion representation of the response, and then ensure the emotion distribution distance between the response and the given emoji tag to be as small as possible. In the test phase, we sample from the content distribution and the emotional distribution separately to obtain content-related information and emotion-related information.

To test the proposed model on more scenarios in addition to the emoji dataset from Twitter, we collect a large-scale Chinese emoji conversation data from Chinese social networking platform Weibo.¹ We conduct the experiments on both Chinese and English emoji datasets, and the results show that our model performs significantly better than the strong baselines. In summary, our contributions are threefold and can listed as follows:

•

We propose a novel dual-view CVAE model, which explicitly models both of the content relevance and emotion consistency at the same time.

•

We construct a large-scale dataset of Chinese conversation pairs² that naturally use fine-grained emojis, which will be publicly available.

•

Extensive experiments on both Chinese and English emoji datasets demonstrate the superiority of our proposed model.

The rest of this article is organized as follows. Section 2 summarizes the related work. In Section 3, we introduce the background. Our proposed model will be detailed in Section 4. Section 5 provides the experiments and analysis. We finally conclude this work in Section 6.

2 Related Work

In recent years, the conversation system has grown rapidly in a variety of areas [28, 36, 37, 44], among which generating an emotional response has gained more and more attention [10, 16, 31, 41, 46]. Ghosh et al. [10] proposed the emotional language model, which adds emotion categories and emotional intensity into the language model and generates emotional sentences. Zhou et al. [46] controls the generation of responses with different emotions, feeds the embedded emotion category to the decoder and captures the implicit change of the internal emotional state to balance the weights between the grammar state and the emotion state, and, last, uses an external memory module to select the word explicitly. Asghar et al. [1] proposed three strategies for affective dialogue generation: (a) augment traditional word embeddings with a three-dimensional affective space by using an external cognitively engineered affective dictionary, (b) design affective training loss functions, and (c) apply affectively diverse beam search. Xu et al. [41] adopted multi-task learning and dual-attention for generating emotional controllable response. Additionally, Wang and Wan [33] proposed a novel framework SentiGAN to address the lack of diversity and mode collapse problems. Colombo et al. [6] presented an affect-driven dialog system by modeling emotions at a word and sequence level.

However, these works are mainly based on coarse-grained emotions, and it is hard to transfer to fine-grained datasets. As mentioned above, they usually use external engineered affective dictionaries. When there are hundreds of emotional categories, it is almost impossible to construct such dictionaries. In addition, the coarse-grained emotional categories are difficult to express the rich emotions of human beings. The emotional categories people use in daily life are far from three or five. Nowadays, people widely use emojis to express their emotions on social media platforms in daily life. These visual symbols can be used to convey emotions and underlying information and provide extra information for the semantics of sentences. Research has shown that considering emojis in sentiment analysis can help improve sentiment analysis task [2]. People have used emojis to do a lot of work in the task of sentiment analysis, such as using emoji-related representation to assist the task of emotion classification, building datasets, building emotion dictionaries, and so on [8, 24]. In addition, recent works [3, 17, 47] have investigated the relation between words and emojis, predicting which emojis are evoked by text messages.

Recently, Zhou and Wang [49] combined emoji prediction with the task of emotional response generation by using reinforcement learning and a pretrained classifier. They use a pretrained classifier to fine-tune the pretrained CVAE model with a hybrid objective. However, it is still difficult for RL-CVAE to weigh the content relevance and emotion consistency. Although the RL model improves consistency, it sacrifices the fluency of the sentences even though a hybrid objective function of reinforcement learning and variational lower bound is considered. Besides, the accuracy of the classifier is very low, because the difference between some categories is actually insignificant. Directly using the result of the classifier is hard to achieve the desired effect. In contrast, we propose a method that leverages the intermediate results of the classifier to model the emotion consistency between the response and the given emoji tag. We consider both content relevance and emotion consistency in a dual-view CVAE model.

Multi-view or Dual-View methods have gained great success in many tasks [5, 39, 48]. Xie and Ma [39] leverage sentence-level and word-level features for text matching tasks and use VAE to encode sentences into latent codes. Similarly, Zhou et al. [48] proposed a multi-view response selection model that also utilize word sequence view and utterance sequence view. Wu and Wu [38] proposed a simple dual decoder to model positive and negative emotions respectively, but it was difficult to construct positive and negative responses for the same sentence. And this kind of method is difficult to extend to multiple classes. Okamoto et al. [25] proposed a dual variational autoencoder for generating images corresponding to multiclass labels, which condition with latent vectors that include label information. They assume that the latent vector is a linear combination of vectors of latent space and the dual latent space. Unlike the above method, we propose a dual-view CVAE model, which model emotion view and content view by using the CVAE model separately.

Fig. 2.

3 Background

3.1 VAE and CVAE

The VAE [7, 15] is a directed graphical model with continuous latent variables, and it is widely used in the generation task of image and natural language. Different from traditional autoencoder, the VAE encodes an input x into a probability distribution, then reconstructs the original input with a decoder network by sampling a continuous latent variable z from this probability distribution, as illustrated in Figure 2. A formal description of the problem is as follows. Let x be an observation of random variable, taking values in

. We assume that the generation of x involves a continuous latent variable z, taking values in

, by means of a point density

, parametrized by θ. Given a set of observed data points

, the goal of maximum likelihood estimation is to eatimate the parameters θ that maximize the marginal log-likelihood

(1)

Due to the integration over the latent variables, it is intractable to directly compute or differentiate the marginal log-likelihood. A common approach is to maximize a variational lower bound on the marginal log-likelihood by introducing an approximate posterior

(2)

where KL denotes the Kullback–Leibler divergence. The evidence lower bound can be also rewritten as a minimum description length loss function:

(3)

where the neural network with parameters φ, called “recognition” model, is introduced to approximate the true posterior

. Another neural network with parameters θ, which is represented as

, is aim to reconstruct the data. In general, we assume that

is a multivariate diagonal Gaussian distribution:

For particularly simple parametric forms of

, one can backpropagate through the sampling process

by applying the reparametrization trick, which first samples

, and then computes

. As a result, the VAE can be trained efficiently using stochastic gradient descent. This is essential in VAE training.

The CVAE is a modification of VAE based on certain attributes, e.g., generating different human faces given skin color, gender, age and so on [42], or generating different sentences given sentiment topic and so on. The formula is as follows where the c in this formula is condition:

(4)

Variational encoder-decoders have shown promising results in text generation [4, 34, 43]. Straightforwardly optimizing with Equation (4) results in the KL-vanishing problem where the Recurrent Neural Network (RNN) part ends up explaining all the structures without making use of the latent representation. Much meaningful work has been done to alleviate this problem [11, 14, 45]. When dealing with text generation, the CVAE model can generate more diverse sentences than the Seq2Seq model. However, in the emotional dialogue generation task, the general CVAE model is not powerful enough to be consistent with the corresponding emotion. Our proposed method employs CVAE as the baseline to accommodate the fine-grained emotions.

4 Dual-View CVAE

In this article, we propose a dual-view model to generate responses according to content relevance and emotion consistency simultaneously. Traditional CVAE models ignore that the query and emoji have different contributions to generating the response: The query is content related to its response, while the emoji mainly focuses on the generation of the emotional expression. By modeling the response from these two perspectives, our proposed model can generate a content- related and emotion-consistent response. More specifically, we design a dual-view CVAE model by transforming the traditional CVAE’s latent variable z into content relevance variable z_r and emotion consistency variable z_e. The detailed formula is as follows:

(5)

where c is the context vector, corresponding to the query history and emoji tag in this model, and x is the output response. The output x is generated from the distribution

. z_r and z_e are latent variables sampled from content Gaussian distribution and emotion Gaussian distribution. More details about z_r and z_e are described in Sections 4.1 and 4.2. The network named Prior network is to approximate

, used in testing time, and Recognition network

is introduced to approximate the true posterior distribution used in training time.

Figure 3 is an overview of our model. First, the two encoders, Bi-directional Long Short-Term Memory (Bi-LSTM), transform query and response into fixed-size vectors respectively by concatenating Bi-LSTM’s last bidirectional hidden states. After that, the content relevance module takes query vector, and all hidden states of the response encoder as input, getting the content-related response representation by calculating a weighted sum of all hidden states, where weights are query’s attention over response hidden states. Then, we map this representation into Gaussian distribution and use the reparametrization trick [15] to obtain samples of z_r, which can be conducted with a recognition network (training) or prior network (testing). As for the emotion consistency module, we only take hidden states of the response as input and obtain the emotional representation from the hidden state of an emoji classifier’s output layer. Then, we encode the emotional representation into a probability distribution and drawing samples of z_e from the learned emotional distribution with a recognition network (training) or prior network (testing). Finally, we reconstruct the response sequence with the dual-view representation via a decoder network.

Fig. 3.

4.1 Content Relevance Modeling

The content relevance module uses the Long Short-Term Memory (LSTM) model with a multi-head attention mechanism [32] to model the semantics of query–response pairs and predict their relatedness. This attention mechanism is designed to focus on the phrases of a response that are connected to the query content. This mechanism allows the model to learn pair-specific representations that are more effective at predicting responses, which better contributes to the predictions of key information in the inference process. Instead of performing a single attention function, Vaswani et al. [32] found it is beneficial to capture different context with multiple individual attention functions. More specifically, multi-head attention model first transforms Q, K, and V into H subspaces, with different, learnable linear projections.

Fig. 4.

Aiming at building pairwise relevance directly, multi-head attention calculates attention weights between each pair of elements in all input. Figure 4 demonstrates an overview of our content relevance model. Query interacts with response, resulting in a weighted vector representation (the green vector) for generating content-related latent vectors(z_r). To contain sentence-level information, we first encode the query and response into fixed-size vectors using Bi-LSTM. The input contains the last hidden state of query q and all hidden states of response that are expressed as

. Formally, for an H heads model, the hth head can be computed as

(6)

where

means the hth trainable parameter matrices and

is the attention weight between the q and jth response elements. It can be calculated as

(7)

and

(8)

in which

are parameter matrices and d_k represents the dimensionality of its subspace. Finally, we get the concatenation of all weighted response,

(9)

Fig. 5.

4.2 Emotional Consistency Modeling

This part mainly introduces the emotional consistency module, which is the green part on the left side of Figure 3. Figure 5 demonstrates a detail structure of our emotion consistency model. Our goal is to incorporate the emotion features into the basic CVAE model, and we first introduce self-attention described in Xu et al. [40] into CVAE. Generally speaking, the attention score partly reflects the sentiment contribution of each word in a well-trained sentiment classifier. Emotional words tend to get a higher score in sentences. Therefore we use the attention scores to denote the emotion representation in response. We encode this emotional representation into a probability distribution that is supervised by emoji tag, and drawing samples of z_e from the learned emotional distribution with a recognition network in the training time. During inference, we use the emoji vector to predict the emotion-related representation directly. In this part, we did not use the multi-head attention mechanism, because the multi-head attention mechanism has more parameters, and the redundant attention weights are a kind of noise for extracting emotional information. The experimental results show that the self-attention mechanism is better than the multi-head attention mechanism. The details of sentiment classifier are described as follows.

This module assumes that the last hidden state of encoder represents the information of the whole sentence, and then uses this last hidden state to compute the attention weights with all previous hidden states. This is why it is called self-attention. Finally, the weighted vectors are mapped to the softmax layer to calculate the final classification distribution. The final sentence representation before softmax layer is

(10)

where

is the hidden state of the Bi-LSTM in the ith time step and

is the attention weight computed as

(11)

where

is an alignment module and

is the final hidden state of sentences that contains all information of an input sentence.

5 Experiment and Analysis

5.1 Dataset

Zhou and Wang [49] constructed a dataset by collecting data from the Twitter website, annotated as MojiTalk dataset in this article. The MojiTalk dataset contains 64 common emojis as labels. The corpus consists of 596,959/32,600/32,600 conversation pairs for train /validation/test set. However, the emoji category of this dataset is extremely unbalanced, and the top-1 category accounts for more than 30% of the total data. Moreover, this dataset is an English dataset and no Chinese dataset is available. To test the model in more scenarios than Twitter’s emoji dataset, we collected a large-scale Chinese emoji conversation data from the Weibo.

To construct a large-scale emoji-enriched post-comment conversation dataset, we first crawl hundreds of millions of post-reply conversation pairs on Sina Weibo. Weibo is a popular Twitter-like microblogging service in China on which a user can post short messages and comments on others’ posts. The most popular emoji expressions in Weibo are in the form of pictures. Each picture has a tag like “[haha]” or “[doge],” so we use the emoji tag instead of emoji Unicode. During data pre-processing, we remove those data pairs whose responses do not contain any emoji. Then we filter out potential advertisements and a large number of repetitive high-frequency responses such as “[heart][heart][heart].” We finally select 40 most frequent emoji labels for our experiments. Figure 6 shows some statistics of the dataset used in this work. Similarly to the extraction method in Zhou and Wang [49], we select the emoji in the response as the tag. If there are multiple types of emoji in a response, then we use the emoji with most occurrences inside the response. If there are multiple emoji with the same frequency, then we choose emoji with lower frequency in the whole corpus. Finally, we select 698,167 training data, while the validation and test data are 9,966 and 5,200 respectively.

Fig. 6.

We also use a high-quality coarse-grained dataset NLPCC2017 Emotional Conversation Generation dataset.³ This dataset has 1,119,207 post-response pairs and each post and response are annotated with 6 emotion labels (happy, angry, sad, like, disgust, Other).

5.2 Evaluation Measures

Automatic evaluation. Automatic evaluation of the open domain generation dialogue model is still an open research challenge [21]. We use both quantitative metrics and human judgment to evaluate the proposed models. The quantitative metrics include not only PPL (perplexity) but also the DIST-n, which is a kind of metric to evaluate the degree of diversity of generated responses. PPL explicitly measures the model’s ability to interpret the syntactic structure of dialogue and each utterance. Lower PPL is indicative of a better model. Distinct N-grams (DIST-n) calculates the percentage of the distinct n-grams in all the n-grams of the generated responses [18]. We calculated the DIST-1 and DIST-2 scores that measure the degree of the unigram and bigram diversity.

Human Evaluation. In the human judgment, we ask three judges to evaluate 100 random items, each of which consists of a query, a given emoji and two generated responses from different models. We evaluate three aspects including relevance, consistency and overall evaluation respectively. We present our results from these three aspects. In the overall aspect, we present the original query, emoji, and generated responses, and the judges are asked to decide which one is a better reply to the original query and emoji. In the relevance aspect, we present the original query and generated responses, and the judges are asked to choose the one that is more relevant to the query. In the consistency setting, the emoji label is provided, and the judges are asked to select the better one that fits this emoji. Finally, we treat the 300 results as 300 cases and average the results.

5.3 Baselines

Seq2Seq. The sequence to sequence model has been successfully applied to a variety of tasks ranging from machine translation to speech recognition and dialogue generation. Here we use the traditional attention-enhanced Seq2Seq model [22] as our baseline. To model the emotion, the emoji tag is generated as the same way of the word embedding and is further reduced to smaller size through a dense layer. The small dense emotion vector together with the encoder output are fed to a decoder in the initial time. Response is generated step by step from the decoder.

Transformer. Sequence to sequence model solely on attention mechanisms. The emotion tags are encoded embeddings and add to the inputs of the decoder.

Emotional Chatting Machine (ECM) a sequence-to-sequence model with the emotion category embeddings and internal and external memory mechanisms [46]. The external memory mechanisms of ECM needs an external manual dictionary of emotions. As far as we know, there is no fine-grained dictionary of emotions. So we compare with ECM model only on coarse-grained dataset and our mainly experiments is on fine-grained dataset.

CVAE. As described in Section 3.1, the CVAE uses the emotion tag as a condition in the generation, which is similar to the Seq2Seq model. To get z, we use another same structure of encoder to encode the response into a dense vector and use it to generate z by using MLP. Based on the emotion latent vector, the encoder output of query and the emotion tag dense embedding, the decoder generates the response step by step.

RL-CVAE. RL-CVAE [49] is a combination of the above CVAE model and reinforcement learning using the classifier’s result as rewards. The formula is as follows:

(12)

(13)

where α is a variant coefficient related to the rank of emoji label probability, the higher R ranks in all types of emoji label, the closer α to 0, the value of α ∈ [0,1]. r is the baseline reward and R is the generation reward. λ is a balancing coefficient. This two-stage method is based on the pre-trained CVAE model.

5.4 Training Details

To verify the effectiveness of our approach, we conduct several experiments on two fine-grained datasets and one coarse-grained dataset. Prior studies have shown that pretraining can achieve better performance [49]. Therefore, we pre-train Seq2Seq together with self-attention emoji classifier. We perform training with the following hyperparameters: word embedding has size 128, bi-directional LSTM is used for the encoder, and the dimension of the LSTM hidden units set to 128. To be fair, the ECM’s hidden units and embedding size are also set 128. The latent variable z in baselines CVAE and RL-CVAE has a size of 268. The relevance latent variable size is 64, and the latent variable size in the consistency module is 128. We use Adam learning [13] to update the gradient and clip the gradient to 5.0. Finally, we use the BOW loss for both emotional consistency module and content relevance module, along with KL annealing [4] to achieve the best performance. For the transformer model, we use a smaller Transformer than the base configuration [32]: three layers, 278-dimensional embedding, 507-dimensional inner layers, two heads, and the dropout is 0.1.

5.5 Result and Analysis

Table 1.

Model	PPL	DIST-1	DIST-2
Seq2Seq	243.7	0.015	0.036
Transformer	129.87	0.063	0.159
CVAE	49.58	0.087	0.335
RL-CVAE	41.52	0.105	0.451
DV-CVAE	25.70	0.109	0.400

Table 1. Generation Perplexity and DIST-n with Seq2Seq, CVAE, RL-CVAE, and DV-CVAE Results on Weibo Dataset

Table 2.

Model	PPL	DIST-1	DIST-2
Seq2Seq	132.20	0.004	0.012
Transformer	81.05	0.040	0.116
CVAE	35.56	0.021	0.155
RL-CVAE	32.38	0.024	0.200
DV-CVAE	23.60	0.017	0.225

Table 2. Generation Perplexity and DIST-n with Seq2Seq, CVAE, RL-CVAE, and DV-CVAE Results on MojiTalk Dataset

Table 3.

Model	PPL	DIST-1	DIST-2
Seq2Seq	176.1	0.036	0.144
Transformer	93.34	0.061	0.222
ECM	88.4	0.085	0.203
CVAE	43.3	0.102	0.496
RL-CVAE	42.1	0.081	0.476
DV-CVAE	33.7	0.105	0.517

Table 3. Generation Perplexity and DIST-n with Seq2Seq, CVAE, RL-CVAE, and DV-CVAE Results on NLPCC2017 Dataset

Automatic Evaluation. The quantitative evaluation results are shown in Table 1, Table 2, and Table 3. We use the PPL to test the fluency of generated sentences. The DIST-n metric indicates the diversity of generated sentences. Intuitively a good model should achieve low perplexity and high DIST-n scores. From the results, we can see that our Dual-View CVAE outperforms baselines in terms of PPL and distinct measures. Our model achieves lower PPL on both fine-grained datasets and coarsed-grained dataset. These results demonstrate that our model maintains good fluency while generating texts with emotions. In terms of DIST-1 metric, our model performs best on Weibo dataset. Regarding to DIST-2, our model works best on MojiTalk dataset. This indicates that the response diversity generated by our model is at least on par with the baselines, but the fluency is much better than the baselines. In addition, we found that RL-CVAE is very prone to model collapse and generate meaningless continuous non-repetitive words. This result greatly increases the value of DIST-n, but the fluency of generated sentences is affected. The RL-CVAE model is trained on the basis of pre-trained CVAE model, while CVAE model is trained on the pre-trained Seq2Seq model. The results of CVAE model pre-training greatly affect the results of RL model. When the CVAE model converges to a certain extent, the RL model cannot be improved at all. CVAE is relatively stable whose effect is better than ECM model, and our model is superior to CVAE and does not require additional pre-training beyond Seq2Seq.

Table 4.

Models	Weibo Data		MojiTalk Data
Models	(l)2-3 (l)4-5	top-1	top-5	top-1	top-5
CVAE	17.2%	46.2%	29.8%	53.8%
RL-CVAE	18.3%	45.8%	30.5%	54.7%
DV-CVAE	14.2%	59.9%	31.6%	56.0%

Table 4. Emoji Generation Accuracy

Fig. 7.

Considering the consistency between the emotion category of the generated sentence and the given emoji label, we also performed an experiment on the classification of the generated sentences on fine-grained datasets. Since the meaning of different emojis may overlap with only a subtle difference, there are potentially multiple types of emotion in reaction to an utterance. Some automatic categorization is considered wrong, but the results of human assessment are indeed acceptable. This makes the difference of automatic and human evaluations. Specially, in the Weibo dataset the “[heart]” and “[love],” which are the second- and third-largest categories in the dataset, are similar. This afftects the top-1 accuracy of the classifier to some extent. So we also use top-5 accuracy as an automatic evaluation metric. As the results are shown in Table 4, our model is also superior to baselines in emotional categories of generated sentences in the MojiTalk Dataset, which further indicates that the emotional consistency module we display is effective. In the Weibo dataset, our model performs better than other models on top-5 accuracy metric. In Figure 7, we drew a comparison of the accuracy of CVAE, RL-CVAE, and DV-CVAE models among the top 16 emotions in the English MojiTalk dataset. In most cases, the accuracy of DV-CVAE is higher than that of the other two models.

Table 5.

Models	Aspect	Win	Lose	Tie
DV-CVAE vs. CVAE	Overall	48.3%	32.6%	19.0%
		Relevance	47.3%	30.3%	22.3%
		Consistency	40.7%	33.0%	26.7%
DV-CVAE vs. RL-CVAE	Overall	47.3%	30.7%	22.0%
		Relevance	46.0%	30.3%	23.7%
		Consistency	46.0%	31.0%	23.0%

Table 5. Results of Human Evaluation in the Chinese Weibo Dataset

Table 6.

Models	Aspect	Win	Lose	Tie
DV-CVAE vs. CVAE	Overall	36.3%	28.3%	35.3%
		Relevance	34.3%	23.6%	42.0%
		Consistency	36.3%	26.6%	37.0%
DV-CVAE vs. RL-CVAE	Overall	39.0%	27.3%	35.3%
		Relevance	39.0%	25.6%	35.0%
		Consistency	35.3%	29.3%	35.3%

Table 6. Results of Human Evaluation in the English MojiTalk Dataset

Human Evaluation. At present, there is no unified and authoritative criterion to test the correlation between sentences in dialogues. Most papers use human evaluation, so we use human evaluation to test the relevance and consistency. And we calculated the Fleiss’ kappa [9] to measure the consistency among annotators. Fleiss’ Kappa values are mostly between 0.511 and 0.613, which indicates that most of them are “Moderate agreement.”

As shown in Table 5 and Table 6, the results of human evaluation of our model are obviously better than those of baselines. Compared with the RL-CVAE model, the percentage of “Win” in the Chinese dataset is as high as 46% in relevance and consistency. In the English dataset, the percentage of “Win” is also more than 35%. In the Weibo dataset, the percentage of “Win” in our model exceeds that of “Lose” by more than 15%. At the same time, our model has obvious advantages in the English MojiTalk dataset. The percentage of “Win” in our model exceeds that of “Lose” by more than 5%. From the results of Table 5 and Table 6, we observe that our Dual-view CVAE model significantly outperforms the baseline methods. Notably, our model obtains particularly high scores in both relevance and consistency aspects, which indicates that our model is capable of capturing the content and emotion information. That is, our model does not treat the words in the sentence uniformly but can better capture the key information.

Fig. 8.

Table 7.

Table 8.

To further verify the effect of the relevance and consistency part, we use a case study to visualize the attention weights. As shown in Figure 8, the affective consistency module is more inclined to notice some words related to the expression of mood and emotion, such as “ang” in the example, which is a common modal particle. The correlation module pays more attention to the words related to the content, such as “you already so thin” in the example. The word “thin” is associated with both content and emotion, so the weight of attention in each is similar. We also show some examples of the response generation in Table 7. In these examples, the assigned emoji is [doge], which represents a lonely bachelor/bachelorette or expresses pride, silence, or disdain. We can see that the Seq2Seq model tends to generate safe responses and the CVAE model can generate content-related responses but cannot consistent with the emotion. The RL-CVAE model seems better, generating more emotion-related responses. However, it is hard for RL-CVAE to cover content and emotion two aspects, which is just the reason why we propose the dual-view CVAE model. Our model is more prone to generate content related to the query, such as “fitness” and “sports” and shows more consistent response in terms of emotions. In addition, we also provide the results of giving different emoji conditions for the same query, as shown in Table 8. From the table, we can see that emoji tags are effective in changing the emotional expression of response in our DV-CVAE model.

6 Conclusion and Future Work

In this article, we have proposed a novel dual-view conditional variational auto-encoder model to build the emotional dialogue generation system. This method explicitly models the content relevance and emotion consistency at the same time. We evaluate our model on two large-scale datasets (Twitter and Weibo) and one coarse-grained dataset and achieve the best results on multiple evaluation metrics. Our results demonstrate that the proposed model produces more diverse and interesting responses while improving the content relevance and emotion consistency by human evaluation.

In the future, we plan to find more powerful methods to model the distribution of query, response and emotion. In addition, we will explore the potential to automatically detect emotions during dialogue instead of emoji tag as given.

Footnotes

https://weibo.com.

https://drive.google.com/open?id=1diyfF9B1q4JsfhuxYF6KxSEJoTTysOdb.

http://tcci.ccf.org.cn/conference/2017/.

References

[1]

Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou. 2017. Affective neural response generation. In European Conference on Information Retrieval. 154–166.

Abstract

1 Introduction

2 Related Work

3 Background

3.1 VAE and CVAE

4 Dual-View CVAE

4.1 Content Relevance Modeling

4.2 Emotional Consistency Modeling

5 Experiment and Analysis

5.1 Dataset

5.2 Evaluation Measures

5.3 Baselines

5.4 Training Details

5.5 Result and Analysis

6 Conclusion and Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

Emotional Dialogue Generation Based on Conditional Variational Autoencoder and Dual Emotion Framework

Generating emotional response by conditional variational auto-encoder in open-domain dialogue system

Empathetic Dialogue Generation with Emotional Enhancement and Knowledge Refinement

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations