skip to main content
research-article
Open access

An Emotional Dialogue System Using Conditional Generative Adversarial Networks with a Sequence-to-Sequence Transformer Encoder

Published: 23 November 2024 Publication History

Abstract

Understanding the expression of emotion and generating appropriate responses are key steps toward constructing emotional, conversational agents. In this article, we propose a framework for single-turn emotional conversation generation, and there are three main components in our model, namely, a sequence-to-sequence model with stacked encoders, a conditional variational autoencoder, and conditional generative adversarial networks. For the sequence-to-sequence model with stacked encoders, we designed a two-layer encoder by combining Transformer with gated recurrent units-based neural networks. Because of the flexibility of the sequence-to-sequence model, we adopted a conditional variational autoencoder in our framework, which uses latent variables to learn a distribution over potential responses and generates diverse responses. Furthermore, we regard a conditional variational autoencoder-based, sequence-to-sequence model as the generative model, and the training of the generative model is assisted by both a content discriminator and an emotion classifier, which assists our model in promoting content information and emotion expression. We use automated evaluation and human evaluation to evaluate our model and baselines on the NII Test Collections for IR Systems short-text conversation task Chinese emotional conversation generation Subtask dataset [44], and the experimental results demonstrate that our proposed framework can generate semantically reasonable and emotionally appropriate responses.

1 Introduction

In the past several years, there has been a tremendous wave of interest in the area of neural conversational models. With the increasing advancement of end-to-end neural dialogue generation techniques and natural language processing, conversational agents have begun to be widely used in our daily lives. Due to the higher usage of conversational agents, considerable expectations and imagination have arisen in this research field. For this reason, the techniques of the open domain dialogue system based on end-to-end neural dialogue generation techniques, which aim to generate human-like responses to users during the conversation by finding the correlation of the words in conversations, have become the objective of study.
Sequence-to-sequence (Seq2Seq) is a significant model in neural dialogue generation. On account of the simplicity and the flexibility of creating systems, there has been a considerable amount of open-domain conversation generation research (Vinyals et al. [38]; Li et al. [21]), which has advanced by using Seq2Seq models trained on large-scale data. The majority of research in dialogue systems has aimed to improve the quality and the word-diversity of generated sentences by employing Seq2Seq models in different aspects. These open-domain conversation generation systems do not encode the conversation structure, and are entirely data-driven. They learn to predict the maximum-likelihood estimation (MLE) based on a large training corpus. However, despite the success of the conversational neural model in generating grammatically correct and human-like answers, the generated responses are often dull and generic rather than involving the sharing of feelings and emotions.
Notably, there have been a number of research works proposed in which emotion is an essential part of the human communication, and displaying emotion in human-computer interfaces is a key to enhancing users’ performance and increasing users’ satisfaction. For example, Picard [29] proposed extensive discussion on the necessity of affect analysis not only in human communication and interaction but also in human creativity behavior and decision-making. Lüdtke et al. [18] proposed that even if the sentences are the same, in the case that the corresponding emotional words are different, the person’s feelings would be entirely different. Experiments in Reference [30] showed that an empathetic computer agent can contribute to a more positive perception of interaction. The existing literature [31] indicates that naturally addressing affective states in dialogue systems can improve user satisfaction. Within this context of the critical importance of emotions, being empathetic is an essential step toward human-like conversation. Furthermore, there have also been a few attempts (Zhou et al. [44]; Huang et al. [13]) to incorporate emotion into end-to-end neural dialogue generation, but each approach has its own shortcomings. For example, the model that Zhou et al. [44] introduced focuses on supervised context-response pairs with specified emotion labels, and requires a large sized annotated training corpus for effective training. Huang et al. [13] presented the model that emphasizes emotions but neglects the semantic meaning and the diversity of word choices in the generated sentences.
The purpose of this research is to express wished emotions in neural dialogue generation, and to enhance the machine to understand human emotions. However, there are some challenges in addressing emotional factors for a neural dialogue system. First, there is a lack of Chinese dialogue conversation datasets that are labeled with emotions. The problem was solved by Huang et al. [44], who used a neural emotion classifier trained to predict the emotion expressed in the source post-response pairs, and then to label the emotion the classifier predicts based on the corresponding responses. Second, it is challenging to generate emotional sentences naturally with grammatical correctness. To solve the aforementioned challenges, we utilized the Transformer-based encoder [37] and the gated recurrent units (GRU)-based encoder in the Seq2Seq model, and adopted the conditional variational autoencoder and the attention mechanism in our sequence generation model for the dialogue system to express wished emotions naturally and explicitly in a conversation. Furthermore, we derived the idea from Kong et al. [17], who proposed a conditional generative adversarial net (GAN)-based sentiment-controlled dialogue generation model, and also modified the model in the discriminator network. Our framework has three subcomponents: the generator, the content discriminator, and the emotion classifier. The generator is responsible for generating emotional responses given the conversation history and sentiment labels, while the adversarial content discriminator attempts to determine whether the generated response comes from real data distribution to enhance the quality of the emotional responses, and the emotion classifier is to distinguish whether the generated response matches the given emotion category. In summary, the main contributions of this article are the following:
(1)
We enhance the sequence-to-sequence model by designing a sophisticated two-layer encoder that integrates the Transformer architecture with gated recurrent units. This hybrid approach leverages the Transformer’s ability to capture long-range dependencies and contextual information while benefiting from the GRUs’ efficiency in handling sequential data. This combination enhances the model’s capability to encode intricate input sequences more effectively, leading to improved performance in generating coherent and contextually relevant responses.
(2)
Recognizing the flexibility of the sequence-to-sequence framework, we incorporate a conditional variational autoencoder (CVAE) into our model. The CVAE uses latent variables to learn a distribution over potential responses, allowing the model to generate a wide range of diverse and contextually appropriate responses. This approach addresses the common issue of generating repetitive or generic responses, thereby enhancing the richness and variability of the generated dialogue.
(3)
Our model utilizes a conditional variational autoencoder-based sequence-to-sequence architecture as its generative core. To further refine the generative process, we introduce a content discriminator and an emotion classifier to assist during training. The content discriminator ensures that the generated responses maintain high relevance and coherence with the input content, while the emotion classifier promotes the accurate expression of specified emotions in the responses. This dual-assistance framework significantly boosts the model’s ability to produce emotionally nuanced and content-rich dialogue, resulting in a more engaging and human-like conversational experience.
(4)
We demonstrate that our model outperforms existing methods through comprehensive experiments on real-world datasets, and validate that the proposed method can generate semantically reasonable and emotionally appropriate responses.
The result of this study could be useful for people responsible for generating emotional responses in a conversational system. The rest of this article is organized as follows. Section 1 provides a brief introduction to dialogue systems and sentiment analysis. Sections 2 and 3 review the related work about the theories and the background information that are relevant to our proposed method, respectively. Section 4 presents the design of our model. Sections 5 and 6 show our experimental results and implementation details, respectively. Finally, Section 7 introduces the contribution of this research and concludes our work.

2 Related Work

The Seq2Seq framework described in Sutskever et al. [36] learns to read an input sequence and then generates an output sequence. Vinyals et al. [38] successfully presented a simple language model based on this Seq2Seq framework. It is capable of extracting knowledge from a noisy but open-domain dataset. Owing to the proposition of neural dialogue models, there have since been many works based on this Seq2Seq framework, which has enhanced the quality of responses in a different sense.
To make the generated response higher quality, to have higher word-diversity, and to be indistinguishable from human-generated dialogue utterances, Li et al. [23] introduced a reinforcement learning framework for adoption to force the generative method to capture more extended discussions. Li et al. [24] proposed the adversarial learning framework to generate dialogues that mostly resemble human dialogues. Vaswani et al. [37] proposed a practical encoder-decoder architecture called Transformer, which is completely dependent on the attention and fully connected networks without using any convolutional neural networks and recurrent neural networks. To handle unknown words, See et al. [33] and Gu et al. [11] combined the pointer network framework with the Seq2Seq framework to nicely pick sub-sequences in the input sequence and to put them in the proper position in the output sequence. To provide a conversational agent with personality, Li et al. [22] presented a persona-based model for making the model understand speaker consistency in the data of questions and neural response generations. Herzig et al. [12] proposed a neural response generation model for customer service based on encoding target personality traits. Although these works have paid a great deal of attention to improving the quality of generative responses such as grammatical correctness and human-like answers, the responses are lacking in emotions. For instance, we cannot tell the model to express a specific emotion in a response.
Emotions are essential for communication between people. Therefore, for a machine to communicate smoothly with humans, it needs to generate sentences with emotion. Picard [29] provides a detailed discussion on the necessity of affect analysis not only in human communication and interaction but also in human creativity behavior and decision-making. Furthermore, emotional factors play an important role in human-machine interaction. The capability of detecting signs of human emotions and of reacting to users in a satisfactory way can improve the quality of communication and naturalness. Reeves and Nass’s [32] research indicates that people relate to computers in the same way as relating to other humans, and some relationships are identical to real social relations. The existing literature [31] presented by Prendinger, Mori, and Ishizuka indicates that addressing the affective state in the dialogue systems in a natural way can improve user satisfaction. Experiments in Prendinger and Ishizuka [30] showed that an empathetic computer agent helps users have lower stress levels (via skin conductance) shortly after empathy and after an apology.
To address the emotion factor in a dialogue conversation system, Zhou et al. [44] first introduced the emotion factor into large-scale conversation generation. They dealt with the problem by using external and internal memory networks, in which the external memory network decides where to choose an emotional or generic word, and the internal memory network measures how much an emotion has already been expressed. Huang et al. [13] presented three models that can either concatenate the embedded emotion before and after the input message is passed to the encoder, or push the embedded emotion in the decoder. Kong et al. [17] proposed a conditional GAN-based sentiment-controlled dialogue generation model. The generator produces an emotional response based on the specific emotional label, and the discriminator makes a distinction between the generated responses and the correct responses through checking the dialogue history and the emotional label. Later, Zhou et al. [43] introduced the nCG-ESM system to generate diverse emotional responses by enhancing the encoder-decoder framework with an emotional supervised mechanism and an emotional classifier for emotion distribution. A redundancy penalty was added to the objective function to prevent repetitive words, resulting in better quality responses. The results showed that the model effectively infused responses with different emotions. Peng et al. [28] proposed TE-ECG, which is a model with topic and dynamic emotional attention modules. The topic module ensures response relevance by identifying and using topic words as prior knowledge. The dynamic emotional attention combines content with emotional factors during decoding to produce diverse, informative, and emotion-specific utterances. In Reference [8], the proposed latent emotion memory (LEM) network addresses multi-label emotion classification through a latent emotion module and a memory module. The latent emotion module learns emotion distributions using a variational autoencoder, while the memory module captures emotion-related features. These representations are combined and fed into a bi-directional gated recurrent unit for prediction. The model was trained in a supervised end-to-end manner, allowing the model to learn latent emotion distributions without external knowledge. In Reference [4], the proposed topic-enhanced capsule (TECap) network for multi-label emotion classification consists of a topic module and a capsule module. The topic module uses a variational autoencoder (VAE) to learn latent topics from bag-of-words input. The capsule module captures emotion features through three deep capsule layers, using a latent topic attention-based routing algorithm to transfer semantic features. It then computes the probability for each emotion label independently. These components are trained jointly, enabling TECap to learn latent topic information without external knowledge. Fi et al. [6] proposed a multiplex cascade framework for unified aspect-based sentiment analysis (ABSA) using a cascade grid-tagging scheme within a multi-task architecture. This system enhances subtask interactions by reusing knowledge from lower-level tasks and incorporates external syntax knowledge, such as part-of-speech tags and dependency features, through a unified syntax graph convolution network. Evaluations on ABSA datasets show significant improvements over state-of-the-art models across seven subtasks. The model leverages existing training data without needing additional annotations. The approach demonstrates the effectiveness of multiplex decoding and syntax integration for ABSA. Li et al. [20] enhanced task performance by introducing a dual-level disentanglement mechanism to separate features into modality and utterance spaces using contrastive learning. For feature fusion, a contribution-aware fusion mechanism (CFM) to manage multimodal contributions and a context refusion mechanism (CRM) to coordinate dialogue contexts were proposed. The system achieves state-of-the-art performance on two public MM-ERC datasets, effectively utilizing multimodal and context features. These methods also have potential applications in other conversational multimodal tasks.
Fi et al. [7] presented an innovative encoder–decoder framework for end-to-end aspect-based sentiment triplet extraction (ASTE). Specifically, the ASTE task is first modeled as an unordered triplet set prediction problem, which is satisfied with a non-autoregressive decoding paradigm with a pointer network. A novel high order aggregation mechanism was proposed for fully integrating the underlying interactions between the overlapping structure of aspect and opinion terms. Next, a bipartite matching loss was introduced for facilitating the training of the non-autoregressive system. Experimental results on benchmark datasets show that the proposed framework significantly outperforms the state-of-the-art methods. In Reference [5], a multi-hop reasoning solution for implicit sentiment detection was introduced, significantly improving over traditional non-reasoning methods. It is the first successful application of the Chain of Thought (CoT) idea to sentiment analysis. The method is simple, effective, and easily applicable to other similar NLP problems. Li et al. [19] introduced the task of conversational aspect-based sentiment quadruple analysis (DiaASQ) for detecting target-aspect-opinion-sentiment quadruples in dialogues. A large-scale, high-quality DiaASQ dataset in Chinese and English was constructed, and a neural model was developed to benchmark the task, incorporating dialogue-specific and discourse features for improved quadruple extraction. The contributions include pioneering dialogue-level aspect-based sentiment analysis, releasing a bilingual dataset, and presenting an effective end-to-end model for the task. In Reference [41], Zhang et al.proposed an approach that adopts prompt-based conversations with large language model (LLM) as knowledge base to enhance dialogue systems by providing step-by-step analyses similar to a human counselor. This expert advice is incorporated during both training and inference, improving the model’s ability to integrate useful strategies. Evaluations show that this approach makes models more suggestive, helpful, and engaging than those not using expert consultation, highlighting the benefit of combining models. Xue et al. [40] introduced the emotional chat model (E-chat), designed to respond based on detected speech emotions. To address the lack of emotional spoken dialogue datasets, the E-chat200 dataset was developed to extract emotion embeddings using a speech encoder, combine them with LLM decoder inputs for joint training, and respond according to various emotions. Evaluations show that E-chat significantly outperforms baseline LLMs on objective metrics and achieves higher mean opinion scores (MOS) in subjective evaluations, demonstrating its ability to deliver emotionally nuanced responses.
In this article, we derive the idea from Kong et al. [17] by modifying the model in the discriminator network and using a Transformer-based encoder and emotion classifier to enhance the dialogue response generation.

3 Background

In this work, our goal is to train a dialogue system that is able to generate appropriate responses given a post and an emotional label. In this section, we first introduce the GRU-based sequence-to-sequence model and the Transformer-based sequence-to-sequence model. Next, we briefly describe the framework of conditional variational autoencoders. Finally, we discuss the existing method of generative adversarial nets.

3.1 GRU-based Sequence-to-Sequence Model

In this section, we mainly introduce a Seq2Seq model with GRUs [3]. A sequence-to-sequence (i.e., encoder-decoder network, Seq2Seq), is a model consisting of two recurrent neural networks called the encoder and the decoder [36]. The encoder reads the dialogue history \(W^{h}=(w^{h}_{1}, w^{h}_{2}, \ldots , w^{h}_{n})\) converted to hidden representations \(h=(h_{1}, h_{2}, \ldots , h_{n})\), and outputs a single vector called context vector c, which is obtained by choosing the last hidden state of h according to the Equations (1) and (2):
\begin{equation} h_{t}=GRU_{enc}(h_{t-1}, w^{h}_{t}), t=1, \ldots , n, \end{equation}
(1)
\begin{equation} c=h_{n}. \end{equation}
(2)
When the decoder reads the context vector c, it produces an output sequence \(W^{r}=(w^{r}_{1},w^{r}_{2},\) \(\ldots , w^{r}_{m})\). More clearly, at every time step of decoding, the decoder is given a previously hidden state of the decoder and a previously embedded target word \(e(w^{r}_{t-1})\) to update its hidden state \(s_{t}\), which is calculated by Equation (3):
\begin{equation} s_{t}=GRU_{dec}(s_{t-1}, c, e(w^{r}_{t-1})). \end{equation}
(3)
After obtaining the hidden state \(s_{t}\), the decoder predicts the probability distribution vector through a softmax function for each character over the whole vocabulary at the tth time step. The formulation is given by Equation (4), where \(V_{o}\) is a trainable matrix:
\begin{equation} P(w^{r}_{t}|w^{r}_{t-1}, w^{h}) = Softmax(V_{o}s_{t}). \end{equation}
(4)

3.2 Attention Mechanism

The Attention mechanism [1, 26] is essential for a sequence model to focus on a specific range of the input sentence. It should be noted that we mainly introduce the global attention with the general score function described in Reference [26]. When only the context vector c is passed between the encoder and the decoder, it does not attempt to use the basic encoder to encode the entire input sentence into a single fixed-length vector. To handle this problem, the attention mechanism utilizes a dynamically changing context \(\widetilde{h}_{t}\) in the decoding process. Mathematically, the attention mechanism is used to calculate a set of attention weights \(a_{t}\) over all hidden states of the encoder \(h=(h_{1}, h_{2}, \ldots , h_{n})\) and decoder hidden state \(s_{t}\) at time t as formulated in Equation (5). In this way, at every time step of decoding, the attention mechanism allows the decoder to focus on different parts of the encoder outputs:
\begin{equation} a_{t} = Softmax(s_{t}W_{a}h), \end{equation}
(5)
where \(W_{a}\) is a trainable matrix in the attention layer, which shows the similarity strength for attention. Moreover, each attention weight \(a_{t}\) is multiplied by the source hidden state h of the encoder to create a new weighted combination context vector \(\widetilde{h}_{t}\) as formulated in Equation (6):
\begin{equation} \widetilde{h}_{t} = a_{t}\cdot h. \end{equation}
(6)
Afterward, the concatenation of the vector \(\widetilde{h}_{t}\) and the decoder output \(s_{t}\) at time t is fed into a linear layer to predict the next word according to Equation (7). In this way, the result \(\widetilde{h}_{t}\) includes the information about a particular section of the input sequence, and thus facilitates the decoder’s prediction of the correct output words:
\begin{equation} {s}^{\prime }_{t} = tanh(W_{\widetilde{h}}[\widetilde{h}_{t};s_{t}]), \end{equation}
(7)
where \(W_{\widetilde{h}}\) is a trainable matrix, and \({s}^{\prime }_{t}\) is used to predict the next word.

3.3 Transformer-based Sequence-to-Sequence Model

The Transformer follows the overall architecture of a sequence-to-sequence model using the component of stacked self-attention and a point-wise, feed-forward neural network for both the encoder and the decoder. The Transformer model relies entirely on an attention mechanism, and it has the ability to capture the long distant dependency information well. Moreover, the Transformer model has been proved to be superior in quality for many sequence-to-sequence tasks, such as machine translation, reading comprehension, summarization, and language understanding.
In more detail, the encoder is broken down into two sub-layers. The first is a multi-head self-attention layer, and the second is a simple, feed-forward neural network. A self-attention layer helps the encoder obtain the information of other words in the input sentence. The outputs of the self-attention layer are addressed in a feed-forward neural network. The decoder has both those layers, however, between which there is an attention layer that helps the decoder focus on corresponding parts of the input sentence. The multi-head self-attention mechanism consists of several scaled dot-product attention layers running in parallel. Given a matrix of n query vectors \(Q\in R^{n\times d}\), key vectors \(K\in R^{n\times d}\), and value vectors \(V\in R^{n\times d}\), the scaled dot-product attention computes the attention scores by Equation (8):
\begin{equation} Attention(Q, K, V)=softmax\left(\frac{QK^{T}}{\sqrt {d}}\right)V, \end{equation}
(8)
where d is the number of hidden units. The multi-head attention allows the model to jointly attend to information from different representation subspaces in different positions. The formulation is represented by Equations (9) and (10):
\begin{equation} MultiHead(Q, K, V)=Concat(head_{1}, \ldots , head_{h})W^{O}, \end{equation}
(9)
\begin{equation} head_{i}=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}), \end{equation}
(10)
where h is the number of parallel attention layers, or heads. The projections are parameter matrices \(W_{i}^{Q}\in R^{n\times d/h}\), \(W_{i}^{K}\in R^{n\times d/h}\), \(W_{i}^{V}\in R^{n\times d/h}\), \(W_{i}^{O}\in R^{d\times d}\).

3.4 Conditional Variational Autoencoder

The CVAE [35] are among the most representative deep generative models recently. The goal of the CVAE is to compress the input x into a smaller probability distribution z first, then to transform distribution z back into an approximation of the original input x conditioned on an extra attribute. Because the structure of the CVAE is similar to an encoder-decoder, sequence-to-sequence can be easily extended to a CVAE. Moreover, the CVAE has shown great success in generating diverse and appropriate dialogue responses [42]. In the field of dialogue generation, the target response \(W^{r}\) is described as x, and the extra attribute is composed of the context vector c of the source texts \(W^{h}\). Mathematically, CVAE is trained by maximizing the lower bound probability of the conditional likelihood of \(W^{r}\) given \(W^{h}\), according to Equation (11):
\begin{equation} p(W^{r}|c)=\int p(W^{r}|z,c)p(z|c)dz. \end{equation}
(11)
z is the latent variable that is sampled from a prior network \(p(z|c)\), and then the response decoder \(p(W^{r}|z,c)\) reconstructs \(W^{r}\) based on the samples z and context attribute c. A recognition network \(q(z|W^{r},c)\) is introduced to approximate the true posterior distribution \(p(z|W^{r},c)\) by taking the target response \(W^{r}\) and context c, and will be absent in the testing stage. The variational lower bound can be written as Equation (12):
\begin{equation} L_{CVAE}=-KL(q(z|W^{r},c)||p(z|c))+E_{q(z|W^{r},c)}(logp(W^{r}|z,c)). \end{equation}
(12)
In the training stage, the latent variable sample z drawn by the recognition network is passed to the response decoder. Also, z is trained to approximate the latent variable \({z}^{\prime }\) drawn by the prior network. In the testing stage, for the purpose of generating responses without the knowledge of the target responses, the latent variable \({z}^{\prime }\) drawn by the prior network is passed to the response decoder.

3.5 Generative Adversarial Nets

GANs, which were proposed by Goodfellow et al. [9], are an exciting recent innovation in machine learning. The term “generative” in the name generative adversarial nets describes a class of statistical models that contrasts with discriminative models. The original GAN framework contains a generative model G, which is in charge of learning to generate reasonable data, where the generated instances turn into negative training examples for the discriminative model D, learns to become skilled at distinguishing the generated data from real data, and imposes punishment to the generative model for producing unreasonable results. During the training process, the generative model of a GAN learns to create fake data by incorporating feedback from the discriminative model, the fake data that are able to fool D into classifying its output as real; the discriminative model of a GAN learns to correctly classify real items coming from the training data set and also correctly classifies generated items coming from G.
The existing research proposed by Li et al. [24] demonstrates that the adversarially trained system generates highly realistic responses and improves the quality of text generation. This adversarial dialogue generation model, in which the generative model G takes a standard sequence-to-sequence model form, defines the policy that could generate a response \(W^{r}\) given the dialogue history \(W^{h}\). The discriminative model D is a binary classifier that takes the dialogue history \(W^{h}\) and a dialogue response \(W^{r}\) as an input and outputs a label \(D(W^{h}, W^{r})\) indicating whether the dialogue response \(W^{r}\) is generated from machines or human beings. For more details, its goal is to maximize the expected reward of generated response \(D(W^{h}, W^{r})\) using the following algorithm Equation (13) according to the reinforcement learning by Williams et al. [39]:
\begin{equation} J=E_{W^{r}\sim G_{\Theta }(.|W^{h})}[D(W^{r}, W^{h})]. \end{equation}
(13)
Nevertheless, one primary challenge in the GANs framework is that there is no control over the modes of data to be generated. As a result, Mirza and Osindero [27] proposed the conditional adversarial nets (CGANs), which change the original GANs framework by adding an additional parameter as extra information to the generator in the hope that the corresponding item is generated according to this extra information. Kong et al. [17] have successfully applied CGANs to the sentiment-controlled dialogue generation model. Similar to the training algorithm of GANs, the objective of the generative model in CGANs is to maximize the expected reward of generated responses. It can be written as Equation (14):
\begin{equation} J=E_{W^{r}\sim G_{\Theta }(.|W^{h}, label)}[D(W^{r}, W^{h}, label)]. \end{equation}
(14)
Among them, label is an additional parameter to control the output and guide the generative model figure in what to do. \(D(W^{h}, W^{r}, label)\) is considered as the probability of the response \(W^{r}\), which is generated either from human beings or machines given the \(W^{h}\) and corresponding label. It should be noted that the discriminative model’s evaluation is done not only on the similarity between the generated data and the original data but also on the correspondence of the generated data to its input label. With this, we can incorporate the extra information known as the parameter label into the generated data that will be learned and also into the discriminative model’s input, which is no longer completely random. Thus, CGANs help the generator to generate sentiment-controlled dialogue responses.

4 Model Architecture

This research aims to design a dialogue system with emotional conversation generation. We first introduce the problem definition of our model in Section 4.1 and describe how we enhance the emotional sequence-to-sequence model with stacked encoders in Section 4.2. Subsequently, in Section 4.3, we present the structure of conditional variational autoencoder for dialogue generation. To further improve the quality of generated responses, we combine adversarial training methods on top of the CVAE-Seq2Seq model for our task and the details are presented in Section 4.4.

4.1 Problem Definition

The main objective of this research is to generate semantically reasonable and emotionally appropriate responses \(W^{r}=(w^{r}_{1},w^{r}_{2}, \ldots ,w^{r}_{m})\) according to the dialogue history \(W^{h}=(w^{h}_{1},w^{h}_{2}, \ldots ,w^{h}_{n})\) and the specified emotion label y. The response can be formulated as the conditional probability, which is given by Equation (15):
\begin{equation} P(W^{r}|W^{h})=\prod ^{n}_{t=1}P(w^{r}_{t}|w^{r}_{[0:t-1]},W^{h},y). \end{equation}
(15)
An overview of our model is given in Figure 1. We investigate our task by constructing the sequence-to-sequence based on the dialogue generation model with the stacked encoders, the GRU-based decoder, the CVAE technique, the content discriminator, and the emotion classifier.
Fig. 1.
Fig. 1. System overview.

4.2 Emotional Sequence-to-Sequence Model with Stacked Encoders

To expressly handle the emotion of the generated response, we enhance the structure of the standard sequence-to-sequence model. First, the stacked encoders in our model are composed of the encoder of the Transformer proposed in Reference [37] and the standard GRU-based encoder. Second, we concatenate the corresponding emotion label y to the output of the encoder, so that the decoder can learn the information of specified emotion to generate semantically reasonable and appropriate emotion responses.
The structure of the stacked encoders is illustrated in Figure 2. For learning wealthier representation of the input sequence and nicely capturing long-term dependencies, we use the Transformer-based encoder as our first encoding layer of the stacked encoders as in Equation (16):
\begin{equation} E^{out} = TransformerEncoder(W^{h}). \end{equation}
(16)
Fig. 2.
Fig. 2. Architecture of the stacked encoders.
\(E^{out}\) consists of \(E^{out}_{1}, E^{out}_{2}, \ldots E^{out}_{n}\), where n is the length of the source sequence. Because the Transformer-based encoder can extract other words in the input sequence as it encodes a specific word, the architecture can indicate a prominent part of an input sentence as well as syntactic and semantic properties. Then, the output of the Transformer \(E^{out}\) is the input of the second encoding layer. The GRU-based encoder is given by Equation (17), which is similar to Equation (1); however, the input is different:
\begin{equation} h_{t}=GRU_{enc}(h_{t-1}, E^{out}_{t}), t=1, \ldots , n. \end{equation}
(17)
To generate the emotion vector \(v_{y}\), we compute the embedding vector of the emotion label y the same way as word embeddings. The emotion embedding is then addressed in a fully connected neural network to output a smaller size vector—emotion vector \(v_{y}\). Afterwards, the output of the second encoding layer in the stacked encoders as context vector c is concatenated with emotion vector \(v_{y}\) into a vector that is called emotion context vector \(v_{c}\). Among them, the method of generating the context vector c is the same as Equation (2). Due to the vector c, which represents the information of the input sentence, and the vector \(v_{y}\), which represents the specified emotion expression, the vector \(v_{c}\) represents not only appropriate content but also appropriate emotion. Ultimately, vector \(v_{c}\) is fed to a GRU-based decoder of GPU cells. A response is then generated from the decoder using the attention mechanism with the general score function [26] as Equation (5) to Equation (7).

4.3 Conditional Variational Autoencoder for Dialogue Generation

Due to having similar encoder-decoder structures, sequence-to-sequence can be easily extended to the CVAE model [35]. We follow the model structure described in Zhou and Wang [45] and Kong et al. [17] to build the CVAE-Seq2Seq model. The objective function is different from Equation (11). The goal of the objective function is to maximize the lower bound probability of the conditional likelihood of \(W^{r}\) given the emotion context vector \(v_{c}\) as listed in Equation (18):
\begin{equation} p(W^{r}|v_{c})=\int p(W^{r}|z,v_{c})p(z|v_{c})dz, \end{equation}
(18)
where z is the latent variable, \(v_{c}\) is an emotion context vector, which is composed of context vector c and an emotion vector \(v_{y}\) that is described in Section 4.2. Figure 1 illustrates the structure of the CVAE-Seq2Seq model. The structure of the response encoder is the same as the structure of the context encoder, which is a stacked encoder described in Section 4.2, but these two encoders have separate parameters. For the context encoder, we use the embeddings of the dialogue history as the input of the context encoder, and the difference is that we use the embeddings of the response as the input of the response encoder. The variational lower bound can be written as Equation (19):
\begin{equation} L_{CVAE}=-KL(q(z|x,v_{c})||p(z|v_{c}))+E_{q(z|x,v_{c})}(logp(x|z,v_{c})), \end{equation}
(19)
where the recognition network formulated as \(q(z|x,v_{c})\) takes the emotion context vector \(v_{c}\) and the vector x, which represents the response \(W^{r}\) as input and encodes it into a latent distribution z. The prior network formulated as \(p(z|v_{c})\) only takes \(v_{c}\) as input and outputs a latent distribution \({z}^{\prime }\). We adopt two fully connected layers to build a recognition network and prior network, where we utilized the ReLU activation function and Tanh activation function in the first fully connected layer and second fully connected layer, respectively. Then, we utilized the reparameterization trick [16] to obtain latent variables. At the training stage, z drawn by the recognition network is trained to approximate \({z}^{\prime }\) drawn by the prior network, and the decoder encodes the catenation of the vector \(v_{c}\) and the sampled stochastic latent distribution z to reconstruct the original response x. At the testing stage, the target responses are absent; the catenation of the vector \(v_{c}\) and \({z}^{\prime }\) drawn by the prior network is then fed into the decoder. The detailed process of the testing stage is shown in Figure 3. Furthermore, to handle the vanishing latent variable problem [2], which causes the GRU-based decoder to fail to encode meaningful information in z, we use the techniques of KL annealing [2] and bag-of-word loss [42]. Zhao et al. [42] have shown that using both KL annealing and bag-of-word loss can work nicely against the vanishing latent problem and also gets lower perplexity and higher KL cost. Therefore, the final objective function for the CVAE-Seq2Seq can be written as Equation (20):
\begin{equation} {L}^{\prime }=L_{CVAE}+L_{bow}. \end{equation}
(20)
Fig. 3.
Fig. 3. Testing stage of our model.

4.4 Conditional Generative Adversarial Nets for Dialogue Generation

To further generate higher-quality responses and to control the emotions of text generation more explicitly, we combine adversarial training methods on top of the CVAE-Seq2Seq model for our task. Our proposed framework consists of three components: a generative model G, a content discriminative model D, and an emotion classifier C. The detailed descriptions of these three components are provided in the following sections.

4.4.1 CGAN Framework.

The Generative model: We adopt both the stacked encoders and the decoder as the generative model, which defines the policy that generates a response \(W^{r}\), given the dialogue history \(W^{h}\) and a specified emotion label y. The generative model aims to learn to create semantically reasonable responses by incorporating feedback from the content discriminator, and also to fool the content discriminator into classifying the outputs of the generative model to be true.
The Content Discriminator: The content discriminator in our framework is simply a binary classifier that takes the dialogue history \(W^{h}\), the specified emotion label y, and the response \(W^{r}\) as input and outputs a result indicating whether the input response is generated from human beings or machines. The detailed process of the content discriminator module is illustrated in Figure 4. Our design of the content discriminator, which is inspired by Kong et al. [17], consists of two encoders, \(Encoder_{C}\) and \(Encoder_{R}\). Both of the structures of those two encoders are GRU-based; however, their inputs are different. The \(Encoder_{C}\) takes the dialogue history \(W^{h}\) as input and outputs a representation vector c, and then this representation vector c is concatenated with the emotion vector \(v_{y}\) to compose the emotion context vector \(v_{c}\). The hidden state of the \(Encoder_{R}\) is initialized as the emotion context vector \(v_{c}\). The purpose of this action is to make the \(Encoder_{R}\) obtain the information of the dialogue history and the emotion label. Subsequently, the \(Encoder_{R}\) takes the target response \(W^{r}\) (namely, human reference) or the generated response (namely, fake response) as input and outputs a representation vector \({v}^{\prime }_{c}\). Eventually, the concatenation of the representation vector \({v}^{\prime }_{c}\) and the vector \(v_{c}\) is then fed into a fully connected neural network-based binary classifier to predict the probability of the input response episode being a machine-generated dialogue or a human-generated dialogue. The reason for adopting the emotion context vector \(v_c\) once more is for making the content discriminator pay more attention to the emotion information.
The Emotion Classifier: The detailed process of the emotion classifier module is illustrated in Figure 5. To further control the emotion of our generative model more explicitly, we adopt a CNN-based emotion classifier to distinguish whether the generated response \(W^{r}_{g}\) matches the given emotion label y. The emotion classifier is implemented with the convolution operation, the ReLU activation function, the max-pooling metric, and the linear layer. The emotion classifier takes target response \(W^{r}\) (namely, human reference) or the generated response \(W^{r}_{g}\) (namely, fake response) as input, and outputs a K-dimensional vector, where K is the number of the emotion categories.
Fig. 4.
Fig. 4. Architecture of the content discriminator.
Fig. 5.
Fig. 5. Architecture of the emotion classifier.

4.4.2 Adversarial Training for the Generative Model.

In the adversarial training, we first pre-trained a generative model without using the content discriminative model and the emotion classifier, and then we kept the constants of the pre-trained generative model to pre-train the content discriminative model and the emotion classifier, where at each iteration, the parameters of the non-training model will be frozen. In more detail, the generative model first encodes the given dialogue history \(W^{h}\) and the specified emotion label y into a response \(W^{r}_{g}\). Then, the discriminative model tries to distinguish whether the data \((W^{r}, W^{h}, y)\) or \((W^{r}_{g}, W^{h}, y)\) come from true distribution. The generative model will be optimized on the basis of the feedback obtained from the content discriminative model. The emotion classifier tries to approximately characterize the conditional distribution \(P(y|W^{r}_{g}) \approx P(y|W^{r})\). Concretely, in the training stage, the generative model tries to maximize the expected reward of the generated responses using Equation (21), which is similar to Equation (12), where the additional parameter label is replaced by y:
\begin{equation} L_{GD}=E_{W^{r}_{g}\sim G_{\Theta }(.|W^{h}, y)}[D(W^{r}_{g}, W^{h}, y)]. \end{equation}
(21)
Note that \(D(W^{r}_{g}, W^{h}, y)\) can be considered as the probability of the response \(W^{r}_{g}\) being obtained from human beings given the \(W^{h}\) and y. Also, the generative model tries to maximize the following objective function, as listed in Equation (22):
\begin{equation} L_{GC}=E_{W^{r}_{g}\sim G_{\Theta }(.|W^{h}, y)}[P(y|W^{r}_{g})]. \end{equation}
(22)
The goal of our approach is to maximize the following objective function, as listed in Equation (23):
\begin{equation} L={L}^{\prime }+L_{GD}+L_{GC}, \end{equation}
(23)
where \({L}^{\prime }\), \(L_{GD}\), and \(L_{GC}\) are related to the generative model. They represent whether the generated response is relevant to the input training sample, the real response, and the real emotion label, respectively. To control the stability of the training process, we also added the teacher forcing method to assist the training process so that the generative model has access to the real responses.

4.5 Integration with Large Language Models

In this section, we discuss the extension of the emotional dialogue system using LLMs. Inspired by existing models [40, 41], our approach leverages context embedding and emotion embedding in combination with an LLM decoder to generate coherent and contextually relevant responses. Previous studies have demonstrated that incorporating LLMs can significantly enhance the capabilities of language models, improving both fluency and contextual relevance. By integrating LLMs, our model may benefit from the rich pre-trained knowledge embedded within these models, allowing for more sophisticated and accurate dialogue generation. Moreover, LLMs can serve as domain-specific knowledge experts, enabling the generation of dialogues tailored to specific fields or topics. This capability is particularly valuable for creating responses that require specialized knowledge or nuanced understanding.
To further refine the emotional dialogue system, we propose integrating an emotion classifier with LLMs. This integration helps to ensure that the generated dialogues are not only contextually appropriate but also emotionally accurate. The emotion classifier assesses the emotional tone of the generated dialogue, providing feedback that can be used to adjust and enhance the emotional quality of the responses. Additionally, the content discriminator in our model benefits from comprehensive contextual and emotional features provided by the LLM and emotion classifier integration. This enriched feature set allows the content discriminator to more effectively evaluate the quality and relevance of the generated dialogues, leading to overall improvements in the system’s performance. By combining these elements, our extended model aims to produce emotionally nuanced and contextually rich dialogues, advancing the state-of-the-art in emotional dialogue systems. Our future work will focus on further investigating the aforementioned ideas, which have demonstrated potential in advancing the capabilities of emotional dialogue systems.

5 Experiments

5.1 Dataset and Preprocessing

To evaluate our proposed model, we used the dataset of the NII Test Collections for IR Systems short-text conversation task (STC-3) Chinese emotional conversation generation (CECG) Subtask [44]. The dataset includes emotion labels of each post and response and has more than 600,000 Chinese post-response pairs. There are six emotion categories in this dataset, namely, “Like,” “Sad,” “Disgust,” “Angry,” “Happy,” and “Other.” In this dataset, we do not need any segmentation tool to segment any sentence, because each single word, emoji, or symbol is separated by a white space in the original dataset. In addition, each emoji in an original sentence is expressed by a pair of square brackets and a word, and each token is separated by a white space, such as “[ tear ].” Therefore, we removed the white spaces in all emojis and regarded them as new words; for example, we converted “[ tear ]” to “[tear]” and regarded “[tear]” as one new word. To simplify the task, we removed all post-response pairs, which are labelled “Other” in the emotion category. To normalize the data, we removed duplicates and reserved those pairs that had a minimum length of five words in the utterance; for example, “! ! ! !” was shortened to “!,” and “[tear] [tear] [tear]” was shortened to “[tear].” The emotion labels of these data are noisy, because these labels are obtained by an emotion classifier [44], and not manually annotated by humans. To handle this problem, we only kept the post-response pairs that had frequent words for each emotion category to obtain higher confidence.

5.2 Baselines

In the experiments, we compared our model with the following neural dialogue models, including Seq2Seq, CVAE, and CGAN-CVAE. The details of the baselines are shown as follows.
Seq2Seq: It is an encoder-decoder neural dialogue model that has an attention mechanism similar to Reference [34]. To make the machine learn additional emotional information, we let the RNN-based encoder encode the dialogue history to a context vector, and then the concatenation of the encoded context vector and the emotion vector is addressed in the RNN-based decoder as the initial hidden state. The way we produced emotion vectors is mentioned in Section 4.2
CVAE: It is a neural dialogue model adapted from a conditional variational autoencoder model. As described in Reference [42], it can be nicely applied for affected-controlled dialogue generation, where the meta feature is an emotion label in this comparison. The same as Reference [42], we also added KL annealing and bag-of-word loss on CVAE to enhance its performance.
CGAN-CVAE: The CGAN-CVAE model [17] uses an adversarial approach to generate emotion-aware responses. The generator is a conditional variational autoencoder-based Seq2Seq model. The discriminator identifies whether the input response is generated from human beings or machines given the dialogue history and emotion information.

5.3 Evaluation Metrics

To evaluate our framework, we adopt the following evaluation metrics, including perplexity and emotion accuracy. The details of the evaluation metrics are displayed in the following sections.
Perplexity: Liu et al. [25] found that BLEU showed either low or no correlation with human judgements; therefore, BLEU is not appropriate for measuring conversation generation. Instead, we used perplexity to evaluate the performance of the content level in the model. Perplexity is the most popular evaluation metric for many natural language processing tasks, and it can be calculated efficiently. Perplexity evaluates whether a model predicts precisely the same as the ground truth responses that we have the data. The formula of perplexity only contains likelihood in the denominator, and higher likelihood means a model is better. Therefore, a model assigns a lower perplexity, which means the performance of the context level is better, because low perplexity causes high likelihood.
Accuracy of Expressed Emotions: The purpose of our task is to build a generative model that can express a reasonable emotion while responding, given a dialogue history and a specific emotion label. Therefore, we built an emotion classifier, which obtained an accuracy of 83.09%. Then, we used the emotion classifier we trained to evaluate emotion accuracy as the consistency between the specified emotion tag and the predicted emotion tag of the generated response. A higher emotion accuracy means the emotion control capacity of a model is better.
Human Evaluation: We performed human evaluation to better evaluate the response quality and the emotion control capacity of our proposed model. We asked 32 undergraduate and graduate students to score dialogue responses generated by the baselines and our model. There are five random items for each emotion category; that is, there are a total of 25 random items in the test set; each of the items contains a dialogue history, an emotion category, a golden response, and generated responses based on each dialogue model. Our evaluation method is inspired by Zhou et al. [44] and Shang et al. [34]. Human annotators are expected to evaluate the responses based on two settings. In the first setting, for judging the quality of dialogue responses from each model, annotators were requested to score a response in terms of Content (rating scale is 0, 1, 2), where score 2 means the generated response is obviously a logically coherent, grammatically fluent, and natural response to the dialogue history; score 1 means the generated response can be an appropriate response in a specific scenario; and score 0 means it is difficult or impossible for the generated response to find a suitable scenario. In the second setting, for judging the emotion expression, annotators were requested to score a response in terms of Emotion (rating scale is 0, 1), where score 1 means the emotion category expressed in the generated response is the same as the given emotion category; and score 0 means the emotion category of the generated response does not match the given emotion category.

5.4 Implementation Details

In this section, we first introduce several emotion classifiers and choose the best performing classifier for emotion accuracy of automated evaluation. Next, we present the details of the default parameters of constructing our model.

5.4.1 Emotion Classifier for Automated Evaluation.

Since we aim to control the emotion of the response given a specified emotion and dialogue history, it is significant that the generated response is able to correctly reflect the emotion. Consequently, we trained several emotion classifiers on the STC-3 CECG dataset [44], and then we adopted the best classifier for automated evaluation. The emotion classifiers that we used include RNN, Bidirectional-LSTM [10], FastText [14], and CNN [15]. Besides, the emotion classifiers are trained to distinguish all emotion labels in the original dataset, including “Like,” “Sad,” “Disgust,” “Angry,” “Happy,” and “Other.” In particular, the emotion classifier distinguishes a generated sentence to the “Other” category, if none of the categories is matched. The emotion accuracies are shown in Table 1. As we can see, the Bi-LSTM classifier obtains the best performance on the STC-3 CECG dataset. Thus, we chose Bi-LSTM to evaluate the likelihood of the predicted label representing a given emotion label.
Table 1.
MethodAccuracy
RNN63.20%
Bi-LSTM83.09%
FastText82.66%
CNN80.02%
Table 1. Classification Accuracy on the STC-3 CECG Dataset

5.4.2 Our Model.

We used PyTorch to implement the proposed model and methodology. We employed a Seq2Seq model with stacked encoders to build an emotional Seq2Seq model. There are two encoding layers in the stacked encoders, which is composed of a Transformer-based encoder and a GRU-based encoder. The first encoding layer is a Transformer-based encoder that has three Transformer encoder layers, and each Transformer encoder layer has eight attention heads, and the dimension of the feed-forward network in the Transformer encoder is 200. The second encoding layer is a one-layer bidirectional GRU structure with a hidden size 128 in each direction. The word embedding size is the same as the embedding size of the emotion category, which is set to 128. The dimension of the emotion vector is set to 12. The decoder is a one-layer unidirectional GRU structure with a hidden size of 268 (128 \(\times\) 2 \(+\) 12). The dimensions of the mean and log variance of latent variable z are set to 128. Both the prior network and the recognition network are one fully connected network with a hyperbolic tangent activation function for their output layer. At the training stage, the latent variable, which is sampled from the recognition network and has the information of golden responses, is passed to the response decoder. At the testing stage, the latent variable, which is sampled from the prior network and has no information of golden responses, is directly fed into the response decoder. The batch size is 32. We used Adam as our optimizer with a learning rate of 1e-3 and gradient clipping at 5.0 to train our model. We ran 30 epochs. The entire experimental environment in this study includes four Intel Core i7-3770 CPUs and Nvidia GeForce GTX 1070 GPUs.

6 Experimental Results and Analysis

In this section, we analyze and present the results of the experiments of the baselines and our model. Section 6.1 shows the measurement of the change in performance on the perplexity score and emotion accuracy by varying our default model parameters. Section 6.2 shows the evaluation of each dialogue model in terms of the perplexity score and emotion accuracy. Section 6.3 shows the dialogue response quality and emotion accuracy of each dialogue model based on human annotation. Section 6.4 samples some examples generated from each dialogue model.

6.1 Parameters of the Experimental Settings

To measure the importance of different aspects of our model, we adjusted our default model by adopting various parameters, as shown in Table 2. \(N_{layer}\) means the Transformer encoder is composed of a stack of \(N_{layer}\) identical layers; \(d_{model}\) is the dimension of the embedding layer, and also the dimension of the hidden state in the GRU-based encoder and the GRU-based decoder; \(d_{ff}\) is the dimension of the feed-forward network model in the Transformer encoder; \(N_{head}\) is the number of self-attention heads for the Transformer encoder; \(d_{latent}\) is the dimension of the latent variable for the variational autoencoder; finally, \(d_{emotion}\) is the dimension of the emotion vector. In Table 2 rows (A), (B), and (C), we observed that increasing \(N_{layer}\), \(d_{model}\), and \(d_{ff}\) incurred worse PPL and ACC. The most obvious difference among them is \(N_{layer}\). Our model achieves a PPL score of 29.64 when \(N_{layer}\) is set to 1; however, the PPL score increases to 76.03 when \(N_{layer}\) is set to 6. Meanwhile, we observed in rows (D) that increasing the number of the self-attention head enhances the model quality. In Table 2 row (E), we observed that increasing \(d_{latent}\) is helpful for avoiding over-fitting. We further observed in row (F) that, as expected, increasing \(d_{emotion}\) results in higher emotion accuracy.
Table 2.
 \(N_{layer}\)\(d_{model}\)\(d_{ff}\)\(N_{head}\)\(d_{latent}\)\(d_{emotion}\)PPLACC
Default212851281281229.9969.16%
(A)1     29.6159.57%
3     38.1169.64%
6     76.0368.79%
(B) 64    29.9471.86%
 256    53.3970.54%
(C)  256   29.4665.43%
  1024   34.3472.40%
(D)   1  31.7772.91%
   4  30.0164.35%
(E)    64 43.3472.50%
    256 20.6067.88%
(F)     630.3362.56%
     2432.8077.56%
Table 2. Parameters of the Experimental Setting in Our Model

6.2 Automated Evaluation

The results of the automated evaluation are shown in Table 3. It should be noted that each baseline uses the same dimension of parameters as our model if the parameters appear in baselines. Compared with other models, our model achieves the lowest PPL score, which means our model obtains the best response quality, because the PPL score indicates the level of difficulty a model has while generating responses. Also, our model obtains the best performance in terms of emotion accuracy, which means the ability for emotion control of our model is the highest among the baselines. For emotion accuracy, the results show that CVAE can capture more precise emotion than the standard Seq2Seq, and CGAN-CVAE can perform better than CVAE. For the PPL score, we suspect that, because the Transformer learns a wealthier representation of the input sequence and nicely captures long-term dependencies, combining the Transformer encoder layer and GRU-based encoder layer in the Seq2Seq framework can produce more reasonable responses.
Table 3.
MethodPerplexityEmotion Accuracy
SeqSeq81.9360.25%
CVAE39.6667.40%
CGAN-CVAE38.8469.05%
Ours29.9969.16%
Table 3. Objective Evaluation with Perplexity and Accuracy
To further examine the ability of our model to generate responses with given emotion categories, we visualize the emotion interaction patterns of the target emotions and the predicted emotions, as shown in Figure 6. The reason we chose the confusion matrix to display the results is because it is simple to understand. We can clearly observe that there is a dark-colored diagonal in Figure 6. This indicates that, given a specified emotion, the emotion categories expressed in the generated responses have the highest similarity to the target emotion. We can observe that our model tends to generate responses with the emotion “Happy” when it is instructed to express “Like.” We suspect that this is because there are some similar frequent words including emoji codes for “Happy” and “Like” such as “[Chuckle],” “[hee hee],” and “cheerful.” Also, one can note that the probabilities of the emotion “Disgust” (the third column) in each emotion category (each row) are high. The reason for this result is because the frequent words that appeared in the emotion category of “Disgust” overlap with other emotion categories.
Fig. 6.
Fig. 6. Confusion matrix of our model for emotion interaction.

6.3 Human Evaluation

Table 4 presents the results of the Content score and the Emotion score for each emotion category based on different dialogue models. As shown in the column “Overall” in Table 4, the performance of our model in Content and Emotion is better than the baselines, which means incorporating the Transformer encoder into the Seq2Seq framework and adopting both the CVAE and CGAN techniques can improve the content quality and emotion accuracy. Also, we can observe that the score of the category “Disgust” and the score of the category “Happy” are the highest, which means that our model can better understand the human emotions of “Disgust” and “Happy.” The same conclusion can be observed in Figure 6.
Table 4.
MethodOverallLikeSadDisgustAngryHappy
Cont.Emot.Cont.Emot.Cont.Emot.Cont.Emot.Cont.Emot.Cont.Emot.
Seq2Seq0.900.430.820.450.860.350.990.510.800.271.030.58
CVAE0.790.390.950.480.820.380.780.370.530.260.850.48
CGAN-CVAE0.980.480.690.400.890.371.250.600.880.391.200.63
Ours1.310.691.270.721.280.661.350.701.210.541.470.85
Table 4. Results of Human Evaluation Based on Different Dialogue Models
Cont. = Content; Emot. = Emotion.

6.4 Case Study

For all models, we used greedy decoding to generate responses. As shown in Table 5, we present some examples of generated responses according to the specified emotion category and a dialogue history extracted from the test set. We can observe that the Seq2Seq model tends to repeat words and generate frequent words in the output sentence, as we can see, “我也是” is repeated many times. In contrast to the Seq2Seq model, the CVAE model can generate responses diversely. However, some of the generated responses from the CVAE model are so diversified that the responses are not suitable for the source sequences. We can also observe that the CGAN-CVAE model complements the shortcomings of the CVAE model owing to the GAN structure. The generated responses from the CGAN-CVAE model are more stable than those of CVAE model. Eventually, generation from our model is more related to both source sequence and target emotion than the existing methods.
Table 5.
 Original textTranslation
Context感冒,啊…,烦死了I got a cold, ah..., annoying
Emotion噁心Disgust
Seq2Seq你这是怎么了?What’s wrong with you?
CVAE你又感冒了?[挖鼻屎]Do you get a cold again? [digging booger]
CGAN-CVAE怎么了,好好休息What happened, have a good rest
Our Model你干什么坏事了,睡觉吧What bad thing did you do, go to sleep
Context我会告诉你我哭了吗Will I tell you I cried yet
Emotion伤心Sad
Seq2Seq我也是, 我也是Me too, me too
CVAE你哭哭了, 呵呵You are weeping, huh, huh
CGAN-CVAE我也想哭的事?The thing that I also want to cry?
Our Model你又不想我。[心]You do not miss me. [heart]
Context真的感动了, 谢谢你们。Really touched, thank you.
Emotion快乐Happy
Seq2Seq你是不是很开心吗?Aren’t you happy?
CVAE生日快乐? 生日快乐[蛋糕]Happy birthday? Happy birthday[cake]
CGAN-CVAE你生日快乐啊亲[蛋糕]You happy Birthday, dear[cake]
Our Model你生日快乐! 开心就好了[嘻嘻]Happy birthday to you! Just be happy[hee hee]
Context我们一家人没救了。Everyone in my family is hopeless.
Emotion生气Angry
Seq2Seq怎么回事?What’s wrong?
CVAE什么?又来了?What? Did it happen again?
CGAN-CVAE你怎么了?哼!What’s wrong with you? Humph!
Our Model怎么啦?谁欺负你What’s up? Who bullied you
Context想去旅行结婚,这个想法怎么样?I want to have a wedding tour, how about this idea?
Emotion喜欢Like
Seq2Seq我也想去I want to go, too
CVAE你是想去的嘛。Do you want to go.
CGAN-CVAE那你要好好珍惜!Then you should cherish it!
Our Model我陪你一块快乐。I am happy to accompany you.
Table 5. Examples with a Corresponding Emotion Category from the STC-3 CECG Dataset

7 Conclusions

We proposed a novel sequence-to-sequence model with stacked encoders, which combines the characteristics of the Transformer-based encoder and the GRU-based encoder to our stacked encoders. Also, we adopted a conditional variational autoencoder and conditional generative adversarial networks in our framework. Both the content discriminator and the emotion classifier are utilized to force the decoder to control the emotion of the generated response explicitly when given the specified emotion category. We used a Chinese emotional dataset to evaluate our proposed model and the baseline approaches. We performed automated evaluation and human evaluation to measure the quality of the generated responses. In automated evaluation, we adopted PPL score and emotion accuracy. Experiments demonstrate that our approach outperforms Seq2Seq, CVAE, and CGAN-CVAE, and it is capable of generating appropriate responses in both content quality and emotion expression.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http://arxiv.org/abs/1409.0473
[2]
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, 10–21.
[3]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS Workshop on Deep Learning.
[4]
Hao Fei, Donghong Ji, Yue Zhang, and Yafeng Ren. 2020. Topic-enhanced capsule network for multi-label emotion classification. IEEE/ACM Trans. Audio, Speech Lang. Proc. 28 (June2020), 1839–1848.
[5]
Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. 2023. Reasoning Implicit Sentiment with Chain-of-Thought Prompting. Retrieved from https://arXiv:2305.11255
[6]
Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. 2022. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI’22), Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 4121–4128. Main Track.
[7]
Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. 2023. Nonautoregressive encoder–Decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Trans. Neural Netw. Learn. Syst. 34, 9 (2023), 5544–5556.
[8]
Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. 2020. Latent emotion memory for multi-label emotion classification. Proceedings of the AAAI Conference on Artificial Intelligence. 7692–7699.
[9]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). MIT Press, Cambridge, MA, 2672V2680.
[10]
Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications (ICANN’05). Springer-Verlag, Berlin, 799–804.
[11]
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1631–1640.
[12]
Jonathan Herzig, Michal Shmueli-Scheuer, Tommy Sandbank, and David Konopnicki. 2017. Neural response generation for customer service based on personality traits. In Proceedings of the 10th International Conference on Natural Language Generation. Association for Computational Linguistics, 252–256.
[13]
Chenyang Huang, Osmar Zaïane, Amine Trabelsi, and Nouha Dziri. 2018. Automatic dialogue generation with expressed emotions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 49–54.
[14]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 427–431. https://www.aclweb.org/anthology/E17-2068
[15]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1746–1751.
[16]
Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of 2nd International Conference on Learning Representations (ICLR’14), Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http://arxiv.org/abs/1312.6114
[17]
Xiang Kong, Bohan Li, Graham Neubig, Eduard Hovy, and Yiming Yang. 2019. An adversarial approach to high-quality, sentiment-controlled neural dialogue generation. In Proceedings of AAAI Workshop on Reasoning and Learning for Human-Machine Dialogues (DEEP-DIAL’19). Retrieved from https://arxiv.org/abs/1901.07129.
[18]
Jana Lüdtke and Arthur Jacobs. 2015. The emotion potential of simple sentences: Additive or interactive effects of nouns and adjectives? Front. Psychol. 6 (2015), 1137.
[19]
Bobo Li, Hao Fei, Fei Li, Yuhan Wu, Jinsong Zhang, Shengqiong Wu, Jingye Li, Yijiang Liu, Lizi Liao, Tat-Seng Chua, and Donghong Ji. 2023. DiaASQ: A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis. Retrieved from https://arXiv:2211.05705
[20]
Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. 2023. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia (MM’23). Association for Computing Machinery, New York, NY, 5923–5934.
[21]
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 110–119.
[22]
Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 994–1003.
[23]
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1192–1202.
[24]
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2157–2169.
[25]
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2122–2132.
[26]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1412–1421.
[27]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. Retrieved from http://arxiv.org/abs/1411.1784.
[28]
Yehong Peng, Yizhen Fang, Zhiwen Xie, and Guangyou Zhou. 2019. Topic-enhanced emotional conversation generation with attention mechanism. Knowl.-Based Syst. 163 (2019), 429–437.
[29]
Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA.
[30]
Helmut Prendinger and Mitsuru Ishizuka. 2005. The empathic companion: A character-based interface that addresses users’ affective states. Appl. Artific. Intell. 19, 3–4 (Mar.2005), 267–285.
[31]
Helmut Prendinger, Junichiro Mori, and Mitsuru Ishizuka. 2005. Using human physiology to evaluate subtle expressivity of a virtual quizmaster in a mathematical game. Int. J. Hum.-Comput. Stud. 62, 2 (Feb.2005), 231–245.
[32]
Byron Reeves and Clifford Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places. Cambridge University Press, New York, NY.
[33]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1073–1083.
[34]
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 1577–1586.
[35]
Kihyuk Sohn, Xinchen Yan, and Honglak Lee. 2015. Learning structured output representation using deep conditional generative models. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS2015). MIT Press, Cambridge, MA, 3483–3491.
[36]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). MIT Press, Cambridge, MA, 3104–3112.
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.
[38]
Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In Proceedings of the ICML Deep Learning Workshop. 37.
[39]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3V4 (May1992), 229V256.
[40]
Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, and Lei Xie. 2024. E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models. Retrieved from https://arXiv:2401.00475
[41]
Qiang Zhang, Jason Naradowsky, and Yusuke Miyao. 2023. Ask an Expert: Leveraging Language Models to Improve Strategic Reasoning in Goal-Oriented Dialogue Models. Retrieved from https://arXiv:2305.17878
[42]
Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 654–664.
[43]
Guangyou Zhou, Yizhen Fang, Yehong Peng, and Jiaheng Lu. 2019. Neural conversation generation with auxiliary emotional supervised models. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 2, Article 19 (Sep.2019), 17 pages.
[44]
Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence. 730–738. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16455
[45]
Xianda Zhou and William Yang Wang. 2018. MojiTalk: Generating emotional responses at scale. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1128–1137.

Index Terms

  1. An Emotional Dialogue System Using Conditional Generative Adversarial Networks with a Sequence-to-Sequence Transformer Encoder

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 12
    December 2024
    237 pages
    EISSN:2375-4702
    DOI:10.1145/3613720
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 November 2024
    Online AM: 02 October 2024
    Accepted: 21 September 2024
    Revised: 03 July 2024
    Received: 18 January 2024
    Published in TALLIP Volume 23, Issue 12

    Check for updates

    Author Tags

    1. Chinese conversation generation
    2. emotional intelligence
    3. natural language processing
    4. deep learning
    5. gated recurrent units
    6. sequence-to-sequence
    7. attention mechanism
    8. transformer model
    9. conditional adversarial generative nets
    10. conditional variational autoencoder

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 571
      Total Downloads
    • Downloads (Last 12 months)571
    • Downloads (Last 6 weeks)109
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media