1 Introduction
In the past several years, there has been a tremendous wave of interest in the area of neural conversational models. With the increasing advancement of end-to-end neural dialogue generation techniques and natural language processing, conversational agents have begun to be widely used in our daily lives. Due to the higher usage of conversational agents, considerable expectations and imagination have arisen in this research field. For this reason, the techniques of the open domain dialogue system based on end-to-end neural dialogue generation techniques, which aim to generate human-like responses to users during the conversation by finding the correlation of the words in conversations, have become the objective of study.
Sequence-to-sequence (Seq2Seq) is a significant model in neural dialogue generation. On account of the simplicity and the flexibility of creating systems, there has been a considerable amount of open-domain conversation generation research (Vinyals et al. [
38]; Li et al. [
21]), which has advanced by using Seq2Seq models trained on large-scale data. The majority of research in dialogue systems has aimed to improve the quality and the word-diversity of generated sentences by employing Seq2Seq models in different aspects. These open-domain conversation generation systems do not encode the conversation structure, and are entirely data-driven. They learn to predict the
maximum-likelihood estimation (MLE) based on a large training corpus. However, despite the success of the conversational neural model in generating grammatically correct and human-like answers, the generated responses are often dull and generic rather than involving the sharing of feelings and emotions.
Notably, there have been a number of research works proposed in which emotion is an essential part of the human communication, and displaying emotion in human-computer interfaces is a key to enhancing users’ performance and increasing users’ satisfaction. For example, Picard [
29] proposed extensive discussion on the necessity of affect analysis not only in human communication and interaction but also in human creativity behavior and decision-making. Lüdtke et al. [
18] proposed that even if the sentences are the same, in the case that the corresponding emotional words are different, the person’s feelings would be entirely different. Experiments in Reference [
30] showed that an empathetic computer agent can contribute to a more positive perception of interaction. The existing literature [
31] indicates that naturally addressing affective states in dialogue systems can improve user satisfaction. Within this context of the critical importance of emotions, being empathetic is an essential step toward human-like conversation. Furthermore, there have also been a few attempts (Zhou et al. [
44]; Huang et al. [
13]) to incorporate emotion into end-to-end neural dialogue generation, but each approach has its own shortcomings. For example, the model that Zhou et al. [
44] introduced focuses on supervised context-response pairs with specified emotion labels, and requires a large sized annotated training corpus for effective training. Huang et al. [
13] presented the model that emphasizes emotions but neglects the semantic meaning and the diversity of word choices in the generated sentences.
The purpose of this research is to express wished emotions in neural dialogue generation, and to enhance the machine to understand human emotions. However, there are some challenges in addressing emotional factors for a neural dialogue system. First, there is a lack of Chinese dialogue conversation datasets that are labeled with emotions. The problem was solved by Huang et al. [
44], who used a neural emotion classifier trained to predict the emotion expressed in the source post-response pairs, and then to label the emotion the classifier predicts based on the corresponding responses. Second, it is challenging to generate emotional sentences naturally with grammatical correctness. To solve the aforementioned challenges, we utilized the Transformer-based encoder [
37] and the
gated recurrent units (GRU)-based encoder in the Seq2Seq model, and adopted the conditional variational autoencoder and the attention mechanism in our sequence generation model for the dialogue system to express wished emotions naturally and explicitly in a conversation. Furthermore, we derived the idea from Kong et al. [
17], who proposed a conditional
generative adversarial net (GAN)-based sentiment-controlled dialogue generation model, and also modified the model in the discriminator network. Our framework has three subcomponents: the generator, the content discriminator, and the emotion classifier. The generator is responsible for generating emotional responses given the conversation history and sentiment labels, while the adversarial content discriminator attempts to determine whether the generated response comes from real data distribution to enhance the quality of the emotional responses, and the emotion classifier is to distinguish whether the generated response matches the given emotion category. In summary, the main contributions of this article are the following:
(1)
We enhance the sequence-to-sequence model by designing a sophisticated two-layer encoder that integrates the Transformer architecture with gated recurrent units. This hybrid approach leverages the Transformer’s ability to capture long-range dependencies and contextual information while benefiting from the GRUs’ efficiency in handling sequential data. This combination enhances the model’s capability to encode intricate input sequences more effectively, leading to improved performance in generating coherent and contextually relevant responses.
(2)
Recognizing the flexibility of the sequence-to-sequence framework, we incorporate a conditional variational autoencoder (CVAE) into our model. The CVAE uses latent variables to learn a distribution over potential responses, allowing the model to generate a wide range of diverse and contextually appropriate responses. This approach addresses the common issue of generating repetitive or generic responses, thereby enhancing the richness and variability of the generated dialogue.
(3)
Our model utilizes a conditional variational autoencoder-based sequence-to-sequence architecture as its generative core. To further refine the generative process, we introduce a content discriminator and an emotion classifier to assist during training. The content discriminator ensures that the generated responses maintain high relevance and coherence with the input content, while the emotion classifier promotes the accurate expression of specified emotions in the responses. This dual-assistance framework significantly boosts the model’s ability to produce emotionally nuanced and content-rich dialogue, resulting in a more engaging and human-like conversational experience.
(4)
We demonstrate that our model outperforms existing methods through comprehensive experiments on real-world datasets, and validate that the proposed method can generate semantically reasonable and emotionally appropriate responses.
The result of this study could be useful for people responsible for generating emotional responses in a conversational system. The rest of this article is organized as follows. Section
1 provides a brief introduction to dialogue systems and sentiment analysis. Sections
2 and
3 review the related work about the theories and the background information that are relevant to our proposed method, respectively. Section
4 presents the design of our model. Sections
5 and
6 show our experimental results and implementation details, respectively. Finally, Section
7 introduces the contribution of this research and concludes our work.
2 Related Work
The Seq2Seq framework described in Sutskever et al. [
36] learns to read an input sequence and then generates an output sequence. Vinyals et al. [
38] successfully presented a simple language model based on this Seq2Seq framework. It is capable of extracting knowledge from a noisy but open-domain dataset. Owing to the proposition of neural dialogue models, there have since been many works based on this Seq2Seq framework, which has enhanced the quality of responses in a different sense.
To make the generated response higher quality, to have higher word-diversity, and to be indistinguishable from human-generated dialogue utterances, Li et al. [
23] introduced a reinforcement learning framework for adoption to force the generative method to capture more extended discussions. Li et al. [
24] proposed the adversarial learning framework to generate dialogues that mostly resemble human dialogues. Vaswani et al. [
37] proposed a practical encoder-decoder architecture called Transformer, which is completely dependent on the attention and fully connected networks without using any convolutional neural networks and recurrent neural networks. To handle unknown words, See et al. [
33] and Gu et al. [
11] combined the pointer network framework with the Seq2Seq framework to nicely pick sub-sequences in the input sequence and to put them in the proper position in the output sequence. To provide a conversational agent with personality, Li et al. [
22] presented a persona-based model for making the model understand speaker consistency in the data of questions and neural response generations. Herzig et al. [
12] proposed a neural response generation model for customer service based on encoding target personality traits. Although these works have paid a great deal of attention to improving the quality of generative responses such as grammatical correctness and human-like answers, the responses are lacking in emotions. For instance, we cannot tell the model to express a specific emotion in a response.
Emotions are essential for communication between people. Therefore, for a machine to communicate smoothly with humans, it needs to generate sentences with emotion. Picard [
29] provides a detailed discussion on the necessity of affect analysis not only in human communication and interaction but also in human creativity behavior and decision-making. Furthermore, emotional factors play an important role in human-machine interaction. The capability of detecting signs of human emotions and of reacting to users in a satisfactory way can improve the quality of communication and naturalness. Reeves and Nass’s [
32] research indicates that people relate to computers in the same way as relating to other humans, and some relationships are identical to real social relations. The existing literature [
31] presented by Prendinger, Mori, and Ishizuka indicates that addressing the affective state in the dialogue systems in a natural way can improve user satisfaction. Experiments in Prendinger and Ishizuka [
30] showed that an empathetic computer agent helps users have lower stress levels (via skin conductance) shortly after empathy and after an apology.
To address the emotion factor in a dialogue conversation system, Zhou et al. [
44] first introduced the emotion factor into large-scale conversation generation. They dealt with the problem by using external and internal memory networks, in which the external memory network decides where to choose an emotional or generic word, and the internal memory network measures how much an emotion has already been expressed. Huang et al. [
13] presented three models that can either concatenate the embedded emotion before and after the input message is passed to the encoder, or push the embedded emotion in the decoder. Kong et al. [
17] proposed a conditional GAN-based sentiment-controlled dialogue generation model. The generator produces an emotional response based on the specific emotional label, and the discriminator makes a distinction between the generated responses and the correct responses through checking the dialogue history and the emotional label. Later, Zhou et al. [
43] introduced the nCG-ESM system to generate diverse emotional responses by enhancing the encoder-decoder framework with an emotional supervised mechanism and an emotional classifier for emotion distribution. A redundancy penalty was added to the objective function to prevent repetitive words, resulting in better quality responses. The results showed that the model effectively infused responses with different emotions. Peng et al. [
28] proposed TE-ECG, which is a model with topic and dynamic emotional attention modules. The topic module ensures response relevance by identifying and using topic words as prior knowledge. The dynamic emotional attention combines content with emotional factors during decoding to produce diverse, informative, and emotion-specific utterances. In Reference [
8], the proposed
latent emotion memory (LEM) network addresses multi-label emotion classification through a latent emotion module and a memory module. The latent emotion module learns emotion distributions using a variational autoencoder, while the memory module captures emotion-related features. These representations are combined and fed into a bi-directional gated recurrent unit for prediction. The model was trained in a supervised end-to-end manner, allowing the model to learn latent emotion distributions without external knowledge. In Reference [
4], the proposed
topic-enhanced capsule (TECap) network for multi-label emotion classification consists of a topic module and a capsule module. The topic module uses a
variational autoencoder (VAE) to learn latent topics from bag-of-words input. The capsule module captures emotion features through three deep capsule layers, using a latent topic attention-based routing algorithm to transfer semantic features. It then computes the probability for each emotion label independently. These components are trained jointly, enabling TECap to learn latent topic information without external knowledge. Fi et al. [
6] proposed a multiplex cascade framework for unified
aspect-based sentiment analysis (ABSA) using a cascade grid-tagging scheme within a multi-task architecture. This system enhances subtask interactions by reusing knowledge from lower-level tasks and incorporates external syntax knowledge, such as part-of-speech tags and dependency features, through a unified syntax graph convolution network. Evaluations on ABSA datasets show significant improvements over state-of-the-art models across seven subtasks. The model leverages existing training data without needing additional annotations. The approach demonstrates the effectiveness of multiplex decoding and syntax integration for ABSA. Li et al. [
20] enhanced task performance by introducing a dual-level disentanglement mechanism to separate features into modality and utterance spaces using contrastive learning. For feature fusion, a
contribution-aware fusion mechanism (CFM) to manage multimodal contributions and a
context refusion mechanism (CRM) to coordinate dialogue contexts were proposed. The system achieves state-of-the-art performance on two public MM-ERC datasets, effectively utilizing multimodal and context features. These methods also have potential applications in other conversational multimodal tasks.
Fi et al. [
7] presented an innovative encoder–decoder framework for end-to-end
aspect-based sentiment triplet extraction (ASTE). Specifically, the ASTE task is first modeled as an unordered triplet set prediction problem, which is satisfied with a non-autoregressive decoding paradigm with a pointer network. A novel high order aggregation mechanism was proposed for fully integrating the underlying interactions between the overlapping structure of aspect and opinion terms. Next, a bipartite matching loss was introduced for facilitating the training of the non-autoregressive system. Experimental results on benchmark datasets show that the proposed framework significantly outperforms the state-of-the-art methods. In Reference [
5], a multi-hop reasoning solution for implicit sentiment detection was introduced, significantly improving over traditional non-reasoning methods. It is the first successful application of the
Chain of Thought (CoT) idea to sentiment analysis. The method is simple, effective, and easily applicable to other similar NLP problems. Li et al. [
19] introduced the task of conversational aspect-based sentiment quadruple analysis (DiaASQ) for detecting target-aspect-opinion-sentiment quadruples in dialogues. A large-scale, high-quality DiaASQ dataset in Chinese and English was constructed, and a neural model was developed to benchmark the task, incorporating dialogue-specific and discourse features for improved quadruple extraction. The contributions include pioneering dialogue-level aspect-based sentiment analysis, releasing a bilingual dataset, and presenting an effective end-to-end model for the task. In Reference [
41], Zhang et al.proposed an approach that adopts prompt-based conversations with
large language model (LLM) as knowledge base to enhance dialogue systems by providing step-by-step analyses similar to a human counselor. This expert advice is incorporated during both training and inference, improving the model’s ability to integrate useful strategies. Evaluations show that this approach makes models more suggestive, helpful, and engaging than those not using expert consultation, highlighting the benefit of combining models. Xue et al. [
40] introduced the
emotional chat model (E-chat), designed to respond based on detected speech emotions. To address the lack of emotional spoken dialogue datasets, the E-chat200 dataset was developed to extract emotion embeddings using a speech encoder, combine them with LLM decoder inputs for joint training, and respond according to various emotions. Evaluations show that E-chat significantly outperforms baseline LLMs on objective metrics and achieves higher
mean opinion scores (MOS) in subjective evaluations, demonstrating its ability to deliver emotionally nuanced responses.
In this article, we derive the idea from Kong et al. [
17] by modifying the model in the discriminator network and using a Transformer-based encoder and emotion classifier to enhance the dialogue response generation.
3 Background
In this work, our goal is to train a dialogue system that is able to generate appropriate responses given a post and an emotional label. In this section, we first introduce the GRU-based sequence-to-sequence model and the Transformer-based sequence-to-sequence model. Next, we briefly describe the framework of conditional variational autoencoders. Finally, we discuss the existing method of generative adversarial nets.
3.1 GRU-based Sequence-to-Sequence Model
In this section, we mainly introduce a Seq2Seq model with GRUs [
3]. A sequence-to-sequence (i.e., encoder-decoder network, Seq2Seq), is a model consisting of two recurrent neural networks called the encoder and the decoder [
36]. The encoder reads the dialogue history
\(W^{h}=(w^{h}_{1}, w^{h}_{2}, \ldots , w^{h}_{n})\) converted to hidden representations
\(h=(h_{1}, h_{2}, \ldots , h_{n})\), and outputs a single vector called context vector
c, which is obtained by choosing the last hidden state of
h according to the Equations (
1) and (
2):
When the decoder reads the context vector
c, it produces an output sequence
\(W^{r}=(w^{r}_{1},w^{r}_{2},\) \(\ldots , w^{r}_{m})\). More clearly, at every time step of decoding, the decoder is given a previously hidden state of the decoder and a previously embedded target word
\(e(w^{r}_{t-1})\) to update its hidden state
\(s_{t}\), which is calculated by Equation (
3):
After obtaining the hidden state
\(s_{t}\), the decoder predicts the probability distribution vector through a softmax function for each character over the whole vocabulary at the
tth time step. The formulation is given by Equation (
4), where
\(V_{o}\) is a trainable matrix:
3.2 Attention Mechanism
The Attention mechanism [
1,
26] is essential for a sequence model to focus on a specific range of the input sentence. It should be noted that we mainly introduce the global attention with the general score function described in Reference [
26]. When only the context vector
c is passed between the encoder and the decoder, it does not attempt to use the basic encoder to encode the entire input sentence into a single fixed-length vector. To handle this problem, the attention mechanism utilizes a dynamically changing context
\(\widetilde{h}_{t}\) in the decoding process. Mathematically, the attention mechanism is used to calculate a set of attention weights
\(a_{t}\) over all hidden states of the encoder
\(h=(h_{1}, h_{2}, \ldots , h_{n})\) and decoder hidden state
\(s_{t}\) at time
t as formulated in Equation (
5). In this way, at every time step of decoding, the attention mechanism allows the decoder to focus on different parts of the encoder outputs:
where
\(W_{a}\) is a trainable matrix in the attention layer, which shows the similarity strength for attention. Moreover, each attention weight
\(a_{t}\) is multiplied by the source hidden state
h of the encoder to create a new weighted combination context vector
\(\widetilde{h}_{t}\) as formulated in Equation (
6):
Afterward, the concatenation of the vector
\(\widetilde{h}_{t}\) and the decoder output
\(s_{t}\) at time
t is fed into a linear layer to predict the next word according to Equation (
7). In this way, the result
\(\widetilde{h}_{t}\) includes the information about a particular section of the input sequence, and thus facilitates the decoder’s prediction of the correct output words:
where
\(W_{\widetilde{h}}\) is a trainable matrix, and
\({s}^{\prime }_{t}\) is used to predict the next word.
3.3 Transformer-based Sequence-to-Sequence Model
The Transformer follows the overall architecture of a sequence-to-sequence model using the component of stacked self-attention and a point-wise, feed-forward neural network for both the encoder and the decoder. The Transformer model relies entirely on an attention mechanism, and it has the ability to capture the long distant dependency information well. Moreover, the Transformer model has been proved to be superior in quality for many sequence-to-sequence tasks, such as machine translation, reading comprehension, summarization, and language understanding.
In more detail, the encoder is broken down into two sub-layers. The first is a multi-head self-attention layer, and the second is a simple, feed-forward neural network. A self-attention layer helps the encoder obtain the information of other words in the input sentence. The outputs of the self-attention layer are addressed in a feed-forward neural network. The decoder has both those layers, however, between which there is an attention layer that helps the decoder focus on corresponding parts of the input sentence. The multi-head self-attention mechanism consists of several scaled dot-product attention layers running in parallel. Given a matrix of
n query vectors
\(Q\in R^{n\times d}\), key vectors
\(K\in R^{n\times d}\), and value vectors
\(V\in R^{n\times d}\), the scaled dot-product attention computes the attention scores by Equation (
8):
where
d is the number of hidden units. The multi-head attention allows the model to jointly attend to information from different representation subspaces in different positions. The formulation is represented by Equations (
9) and (
10):
where
h is the number of parallel attention layers, or heads. The projections are parameter matrices
\(W_{i}^{Q}\in R^{n\times d/h}\),
\(W_{i}^{K}\in R^{n\times d/h}\),
\(W_{i}^{V}\in R^{n\times d/h}\),
\(W_{i}^{O}\in R^{d\times d}\).
3.4 Conditional Variational Autoencoder
The CVAE [
35] are among the most representative deep generative models recently. The goal of the CVAE is to compress the input
x into a smaller probability distribution
z first, then to transform distribution
z back into an approximation of the original input
x conditioned on an extra attribute. Because the structure of the CVAE is similar to an encoder-decoder, sequence-to-sequence can be easily extended to a CVAE. Moreover, the CVAE has shown great success in generating diverse and appropriate dialogue responses [
42]. In the field of dialogue generation, the target response
\(W^{r}\) is described as
x, and the extra attribute is composed of the context vector
c of the source texts
\(W^{h}\). Mathematically, CVAE is trained by maximizing the lower bound probability of the conditional likelihood of
\(W^{r}\) given
\(W^{h}\), according to Equation (
11):
z is the latent variable that is sampled from a prior network
\(p(z|c)\), and then the response decoder
\(p(W^{r}|z,c)\) reconstructs
\(W^{r}\) based on the samples
z and context attribute
c. A recognition network
\(q(z|W^{r},c)\) is introduced to approximate the true posterior distribution
\(p(z|W^{r},c)\) by taking the target response
\(W^{r}\) and context
c, and will be absent in the testing stage. The variational lower bound can be written as Equation (
12):
In the training stage, the latent variable sample z drawn by the recognition network is passed to the response decoder. Also, z is trained to approximate the latent variable \({z}^{\prime }\) drawn by the prior network. In the testing stage, for the purpose of generating responses without the knowledge of the target responses, the latent variable \({z}^{\prime }\) drawn by the prior network is passed to the response decoder.
3.5 Generative Adversarial Nets
GANs, which were proposed by Goodfellow et al. [
9], are an exciting recent innovation in machine learning. The term “generative” in the name generative adversarial nets describes a class of statistical models that contrasts with discriminative models. The original GAN framework contains a generative model
G, which is in charge of learning to generate reasonable data, where the generated instances turn into negative training examples for the discriminative model
D, learns to become skilled at distinguishing the generated data from real data, and imposes punishment to the generative model for producing unreasonable results. During the training process, the generative model of a GAN learns to create fake data by incorporating feedback from the discriminative model, the fake data that are able to fool
D into classifying its output as real; the discriminative model of a GAN learns to correctly classify real items coming from the training data set and also correctly classifies generated items coming from
G.
The existing research proposed by Li et al. [
24] demonstrates that the adversarially trained system generates highly realistic responses and improves the quality of text generation. This adversarial dialogue generation model, in which the generative model
G takes a standard sequence-to-sequence model form, defines the policy that could generate a response
\(W^{r}\) given the dialogue history
\(W^{h}\). The discriminative model
D is a binary classifier that takes the dialogue history
\(W^{h}\) and a dialogue response
\(W^{r}\) as an input and outputs a label
\(D(W^{h}, W^{r})\) indicating whether the dialogue response
\(W^{r}\) is generated from machines or human beings. For more details, its goal is to maximize the expected reward of generated response
\(D(W^{h}, W^{r})\) using the following algorithm Equation (
13) according to the reinforcement learning by Williams et al. [
39]:
Nevertheless, one primary challenge in the GANs framework is that there is no control over the modes of data to be generated. As a result, Mirza and Osindero [
27] proposed the
conditional adversarial nets (CGANs), which change the original GANs framework by adding an additional parameter as extra information to the generator in the hope that the corresponding item is generated according to this extra information. Kong et al. [
17] have successfully applied CGANs to the sentiment-controlled dialogue generation model. Similar to the training algorithm of GANs, the objective of the generative model in CGANs is to maximize the expected reward of generated responses. It can be written as Equation (
14):
Among them, label is an additional parameter to control the output and guide the generative model figure in what to do. \(D(W^{h}, W^{r}, label)\) is considered as the probability of the response \(W^{r}\), which is generated either from human beings or machines given the \(W^{h}\) and corresponding label. It should be noted that the discriminative model’s evaluation is done not only on the similarity between the generated data and the original data but also on the correspondence of the generated data to its input label. With this, we can incorporate the extra information known as the parameter label into the generated data that will be learned and also into the discriminative model’s input, which is no longer completely random. Thus, CGANs help the generator to generate sentiment-controlled dialogue responses.