An improved CycleGAN-based emotional voice conversion model by augmenting temporal dependency with a transformer
Introduction
In recent years, affective expression has rapidly increased for artificial intelligence systems, including motions, speech, and facial expressions (Sheldon, 2001, Pelachaud, 2009, Chella et al., 2008). Emotional voice conversion (EVC) is one of the important topics in this research field. Nevertheless, because speech is a complex signal that contains rich information, its performance can further be improved with deep learning methods.
Generally, emotional voice conversion is a special type of voice conversion (VC), which aims to transform an utterance’s emotional features into a target one while retaining semantic information and speaker identity. Some previous research in this field focused on mapping the prosody and spectrogram with partial least square regression (Helander et al., 2010), Gaussian mixed model (GMM) (Aihara et al., 2012, Kawanami et al., 2003), and sparse representation method (Ming et al., 2016, Takashima et al., 2013). Recently, researchers leverage deep learning methods to improve the performance of EVC, such as deep neural network (DNN) (Vekkot et al., 2020, Luo et al., 2016), sequence-to-sequence model (seq2seq) with long–short-term memory network (LSTM) (Robinson et al., 2019), convolutional neural network (CNN) (Kameoka et al., 2020), as well as their combinations with the attention mechanism (Choi and Hahn, 2021). However, these models require to be trained on parallel data; that is, both the source and target should be from the same speaker and have identical linguistic information but different emotions.
To reduce the models’ reliance on parallel training data, some novel frameworks are introduced into this field. Gao et al. (2018) proposed a nonparallel data-driven emotional VC method with an auto-encoder. Recently, Ding and Gutierrez-Osuna (2019) adopted vector quantized variational autoencoders (VQ-VAE) with the group latent embedding (GLE) for nonparallel data training. AC-VAE (Kameoka et al., 2019) has been proposed by designing an auxiliary classifier. Moreover, to better learn the mapping function between non-parallel data distributions, cycle-consistent adversarial network (CycleGAN) (Zhou et al., 2020a, Liu et al., 2020) and variational autoencoder-generative adversarial network (VAE-GAN) (Cao et al., 2020) were introduced into the EVC task. Furthermore, Moritani et al. (2021) employed starGAN to realize non-parallel spectral envelope transformation. These EVC models with CNN-based layers trained on non-parallel data all achieved a reasonable performance.
Despite the progress made in non-parallel data training, the quality of converted emotional voice is yet to be improved. Because speech is a time series with rich acoustic features, there are some interactive temporal relationships among frames. Although CNNs are well-known for their ability to handle temporal data, to process speech data, which is a sort of lengthy temporal sequence, CNN-based models must be stacked significantly deep to widen the temporal dependency (see Fig. 1). However, temporal intra-relations can be diluted per layer with this manner and make the model suffers from some instability problems (e.g., mispronunciations and skipped phonemes).
To enhance the model’s ability to capture contextual information and intra-relations among frames, transformers have been widely discussed in the field of computer vision (Dosovitskiy et al., 2020) and natural language processing (Wu et al., 2020a). In addition, its attention distance is explored. However, few studies have investigated the capabilities of transformers for a speech generation or conversion task. Therefore, to address these problems, the following are the contributions of this study:
- •
We proposed a CycleGAN-based model with the transformer and investigated its ability in the EVC task; we called our model CycleTransGAN.
- •
To enhance the model’s ability to convert emotional voices, we adopted curriculum learning to gradually increase the frame length during training. Furthermore, a fine-grained level discriminator was designed to distinguish how much each segment is close to the real samples.
- •
The proposed method was evaluated on the Japanese emotional speech dataset and Emotional Speech Dataset (ESD). The model is compared to widely used baselines (i.e., ACVAE Kameoka et al., 2019, CycleGAN (Zhou et al., 2020a)), and our proposed model’s different configurations.
- •
We discussed the proposed model’s temporal dependency augmented by the transformer.
The remainder of this paper is structured as follows. Related works are introduced in Section 2. The proposed method is described in detail in Section 3. The experiment and results are described in Section 4. Discussions are presented in Section 5, and the final Section 6 briefly summarizes our work.
Section snippets
Transformer
The transformer was first proposed by Vaswani et al. (2017), which adopts the attention mechanism to weigh the significance of each portion of input data. Similar to recurrent neural networks (RNNs), transformer is proposed to process sequential data, such as translation (Conneau and Lample, 2019), language understanding (Wang et al., 2018), and text classification (Devlin et al., 2018, Yang et al., 2019) in natural language processing. However, unlike RNNs, there is no need to process the data
Novel designs
In this study, we have three novel designs to achieve better performance from the model. The following contents demonstrate the details.
Dataset
In this study, we used the Japanese emotional speech dataset (Asai et al., 2020) and Emotional Speech Dataset (ESD, Zhou et al. (2022)) that contains happy, anger, sad, and neutral utterances.
Japanese emotional speech dataset: each category of this dataset has 1070 utterances in total. We used 1000 utterances in the training phase, in which 50 utterances were assigned to the validation set and the left 70 utterances for the testing set. The duration of each emotion is presented in Table 1. In
Discussion
Conclusion
In this paper, we proposed a CycleGAN-based emotional VC model with a transformer module, called CycleTransGAN. The model was evaluated on two datasets: the Japanese emotional speech dataset and Emotional Speech Dataset (ESD). Taking the advantage of the transformer, the model can take the contextual information into account with a wider range. This allows the generated speech to be more consistent in terms of temporal features, thereby improving the quality and naturalness of the converted
CRediT authorship contribution statement
Changzeng Fu: Conceptualization, Methodology, Implementation, Evaluation. Chaoran Liu: Conceptualization, Methodology. Carlos Toshinori Ishi: Conceptualization. Hiroshi Ishiguro: Conceptualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Grant-in-Aid for Scientific Research on Innovative Areas JP20H05576 (model training) and by JST, Moonshot R&D under Grant JPMJMS2011 (model evaluation).
References (48)
CheapTrick, a spectral envelope estimator for high-quality speech synthesis
Speech Commun.
(2015)- et al.
Emotional voice conversion: Theory, databases and ESD
Speech Commun.
(2022) - et al.
GMM-based emotional voice conversion using spectrum and prosody features
Am. J. Signal Process.
(2012) - Asai, S., Yoshino, K., Shinagawa, S., Sakti, S., Nakamura, S., 2020. Emotional speech corpus for persuasive dialogue...
- et al.
Nonparallel emotional speech conversion using VAE-GAN
- et al.
An emotional storyteller robot
- et al.
When vision transformers outperform ResNets without pretraining or strong data augmentations
(2021) - et al.
Sequence-to-sequence emotional voice conversion with strength control
IEEE Access
(2021) - et al.
Cross-lingual language model pretraining
Adv. Neural Inf. Process. Syst.
(2019) - et al.
Bert: Pre-training of deep bidirectional transformers for language understanding
(2018)