The VAE [
7,
15] is a directed graphical model with continuous latent variables, and it is widely used in the generation task of image and natural language. Different from traditional autoencoder, the VAE encodes an input
x into a probability distribution, then reconstructs the original input with a decoder network by sampling a continuous latent variable
z from this probability distribution, as illustrated in Figure
2. A formal description of the problem is as follows. Let
x be an observation of random variable, taking values in

. We assume that the generation of
x involves a continuous latent variable
z, taking values in

, by means of a point density

, parametrized by θ. Given a set of observed data points

, the goal of maximum likelihood estimation is to eatimate the parameters θ that maximize the marginal log-likelihood

:
Due to the integration over the latent variables, it is intractable to directly compute or differentiate the marginal log-likelihood. A common approach is to maximize a variational lower bound on the marginal log-likelihood by introducing an approximate posterior

:
where KL denotes the Kullback–Leibler divergence. The evidence lower bound can be also rewritten as a minimum description length loss function:
where the neural network with parameters φ, called “recognition” model, is introduced to approximate the true posterior

. Another neural network with parameters θ, which is represented as

, is aim to reconstruct the data. In general, we assume that

is a multivariate diagonal Gaussian distribution:
For particularly simple parametric forms of

, one can backpropagate through the sampling process

by applying the reparametrization trick, which first samples

, and then computes

. As a result, the VAE can be trained efficiently using stochastic gradient descent. This is essential in VAE training.
The CVAE is a modification of VAE based on certain attributes, e.g., generating different human faces given skin color, gender, age and so on [
42], or generating different sentences given sentiment topic and so on. The formula is as follows where the
c in this formula is condition:
Variational encoder-decoders have shown promising results in text generation [
4,
34,
43]. Straightforwardly optimizing with Equation (
4) results in the KL-vanishing problem where the
Recurrent Neural Network (RNN) part ends up explaining all the structures without making use of the latent representation. Much meaningful work has been done to alleviate this problem [
11,
14,
45]. When dealing with text generation, the CVAE model can generate more diverse sentences than the Seq2Seq model. However, in the emotional dialogue generation task, the general CVAE model is not powerful enough to be consistent with the corresponding emotion. Our proposed method employs CVAE as the baseline to accommodate the fine-grained emotions.