An improved CycleGAN-based emotional voice conversion model by augmenting temporal dependency with a transformer

doi:10.1016/j.specom.2022.09.002

Speech Communication

Volume 144, October 2022, Pages 110-121

https://doi.org/10.1016/j.specom.2022.09.002 Get rights and content

Highlights

•
A CycleTransGAN is proposed to improve its performance on the emotional voice conversion (EVC) task.
•
Curriculum learning was adopted to gradually increase the input length during training.
•
A fine-grained level discriminator was designed to enhance the model’s ability to convert emotional voices.
•
The proposed method was evaluated on a Japanese emotional speech dataset and Emotional Speech Dataset (ESD, containing English and Chinese speech).
•
The transformer enhanced the model’s temporal dependency with a wider range, which improved the quality of converted speech.

Abstract

Emotional voice conversion (EVC) is a task that converts an utterance’s emotional features into a target one while retaining semantic information and speaker identity. Recently, some researchers leverage deep learning methods to improve the performance of EVC, such as deep neural network (DNN), sequence-to-sequence model (seq2seq), long-short-term memory network (LSTM), and convolutional neural network (CNN), as well as their combinations with an attention mechanism. However, their methods always suffer from some instability problems (e.g., mispronunciations and skipped phonemes) because these models fail to capture temporal intra-relationships among a wide range of frames, resulting in unnatural speech and discontinuous emotional expression. To enhance the ability to capture intra-relations among frames by augmenting the temporal dependency of models, we explored the power of a transformer in this study. Specifically, we proposed a CycleGAN-based model with the transformer and investigated its ability in the EVC task. In the training procedure, we adopted curriculum learning to gradually increase the frame length to ensure that the model can monitor from short segments throughout the entire speech. The proposed method was evaluated on a Japanese emotional speech dataset and Emotional Speech Dataset (ESD, contains English and Chinese speech). Then, it was compared to widely used EVC baselines (ACVAE, CycleGAN) involving objective and subjective evaluations. The results indicate that our proposed model can convert emotion with higher emotional similarity, quality, and naturalness.

Introduction

In recent years, affective expression has rapidly increased for artificial intelligence systems, including motions, speech, and facial expressions (Sheldon, 2001, Pelachaud, 2009, Chella et al., 2008). Emotional voice conversion (EVC) is one of the important topics in this research field. Nevertheless, because speech is a complex signal that contains rich information, its performance can further be improved with deep learning methods.

Generally, emotional voice conversion is a special type of voice conversion (VC), which aims to transform an utterance’s emotional features into a target one while retaining semantic information and speaker identity. Some previous research in this field focused on mapping the prosody and spectrogram with partial least square regression (Helander et al., 2010), Gaussian mixed model (GMM) (Aihara et al., 2012, Kawanami et al., 2003), and sparse representation method (Ming et al., 2016, Takashima et al., 2013). Recently, researchers leverage deep learning methods to improve the performance of EVC, such as deep neural network (DNN) (Vekkot et al., 2020, Luo et al., 2016), sequence-to-sequence model (seq2seq) with long–short-term memory network (LSTM) (Robinson et al., 2019), convolutional neural network (CNN) (Kameoka et al., 2020), as well as their combinations with the attention mechanism (Choi and Hahn, 2021). However, these models require to be trained on parallel data; that is, both the source and target should be from the same speaker and have identical linguistic information but different emotions.

To reduce the models’ reliance on parallel training data, some novel frameworks are introduced into this field. Gao et al. (2018) proposed a nonparallel data-driven emotional VC method with an auto-encoder. Recently, Ding and Gutierrez-Osuna (2019) adopted vector quantized variational autoencoders (VQ-VAE) with the group latent embedding (GLE) for nonparallel data training. AC-VAE (Kameoka et al., 2019) has been proposed by designing an auxiliary classifier. Moreover, to better learn the mapping function between non-parallel data distributions, cycle-consistent adversarial network (CycleGAN) (Zhou et al., 2020a, Liu et al., 2020) and variational autoencoder-generative adversarial network (VAE-GAN) (Cao et al., 2020) were introduced into the EVC task. Furthermore, Moritani et al. (2021) employed starGAN to realize non-parallel spectral envelope transformation. These EVC models with CNN-based layers trained on non-parallel data all achieved a reasonable performance.

Despite the progress made in non-parallel data training, the quality of converted emotional voice is yet to be improved. Because speech is a time series with rich acoustic features, there are some interactive temporal relationships among frames. Although CNNs are well-known for their ability to handle temporal data, to process speech data, which is a sort of lengthy temporal sequence, CNN-based models must be stacked significantly deep to widen the temporal dependency (see Fig. 1). However, temporal intra-relations can be diluted per layer with this manner and make the model suffers from some instability problems (e.g., mispronunciations and skipped phonemes).

To enhance the model’s ability to capture contextual information and intra-relations among frames, transformers have been widely discussed in the field of computer vision (Dosovitskiy et al., 2020) and natural language processing (Wu et al., 2020a). In addition, its attention distance is explored. However, few studies have investigated the capabilities of transformers for a speech generation or conversion task. Therefore, to address these problems, the following are the contributions of this study:

•
We proposed a CycleGAN-based model with the transformer and investigated its ability in the EVC task; we called our model CycleTransGAN.
•
To enhance the model’s ability to convert emotional voices, we adopted curriculum learning to gradually increase the frame length during training. Furthermore, a fine-grained level discriminator was designed to distinguish how much each segment is close to the real samples.
•
The proposed method was evaluated on the Japanese emotional speech dataset and Emotional Speech Dataset (ESD). The model is compared to widely used baselines (i.e., ACVAE Kameoka et al., 2019, CycleGAN (Zhou et al., 2020a)), and our proposed model’s different configurations.
•
We discussed the proposed model’s temporal dependency augmented by the transformer.

The remainder of this paper is structured as follows. Related works are introduced in Section 2. The proposed method is described in detail in Section 3. The experiment and results are described in Section 4. Discussions are presented in Section 5, and the final Section 6 briefly summarizes our work.

Section snippets

Transformer

The transformer was first proposed by Vaswani et al. (2017), which adopts the attention mechanism to weigh the significance of each portion of input data. Similar to recurrent neural networks (RNNs), transformer is proposed to process sequential data, such as translation (Conneau and Lample, 2019), language understanding (Wang et al., 2018), and text classification (Devlin et al., 2018, Yang et al., 2019) in natural language processing. However, unlike RNNs, there is no need to process the data

Novel designs

In this study, we have three novel designs to achieve better performance from the model. The following contents demonstrate the details.

Dataset

In this study, we used the Japanese emotional speech dataset (Asai et al., 2020) and Emotional Speech Dataset (ESD, Zhou et al. (2022)) that contains happy, anger, sad, and neutral utterances.

Japanese emotional speech dataset: each category of this dataset has 1070 utterances in total. We used 1000 utterances in the training phase, in which 50 utterances were assigned to the validation set and the left 70 utterances for the testing set. The duration of each emotion is presented in Table 1. In

Discussion

Conclusion

In this paper, we proposed a CycleGAN-based emotional VC model with a transformer module, called CycleTransGAN. The model was evaluated on two datasets: the Japanese emotional speech dataset and Emotional Speech Dataset (ESD). Taking the advantage of the transformer, the model can take the contextual information into account with a wider range. This allows the generated speech to be more consistent in terms of temporal features, thereby improving the quality and naturalness of the converted

CRediT authorship contribution statement

Changzeng Fu: Conceptualization, Methodology, Implementation, Evaluation. Chaoran Liu: Conceptualization, Methodology. Carlos Toshinori Ishi: Conceptualization. Hiroshi Ishiguro: Conceptualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Grant-in-Aid for Scientific Research on Innovative Areas JP20H05576 (model training) and by JST, Moonshot R&D under Grant JPMJMS2011 (model evaluation).

References (48)

MoriseM.
CheapTrick, a spectral envelope estimator for high-quality speech synthesis
Speech Commun.
(2015)
ZhouK. et al.
Emotional voice conversion: Theory, databases and ESD
Speech Commun.
(2022)
AiharaR. et al.
GMM-based emotional voice conversion using spectrum and prosody features
Am. J. Signal Process.
(2012)
Asai, S., Yoshino, K., Shinagawa, S., Sakti, S., Nakamura, S., 2020. Emotional speech corpus for persuasive dialogue...
CaoY. et al.
Nonparallel emotional speech conversion using VAE-GAN
ChellaA. et al.
An emotional storyteller robot
ChenX. et al.
When vision transformers outperform ResNets without pretraining or strong data augmentations
(2021)
ChoiH. et al.
Sequence-to-sequence emotional voice conversion with strength control
IEEE Access
(2021)
ConneauA. et al.
Cross-lingual language model pretraining
Adv. Neural Inf. Process. Syst.
(2019)
DevlinJ. et al.
Bert: Pre-training of deep bidirectional transformers for language understanding
(2018)

DingS. et al.

Group latent embedding for vector quantized variational autoencoder in non-parallel voice conversion

DongL. et al.

Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition

DosovitskiyA. et al.

An image is worth 16x16 words: Transformers for image recognition at scale

(2020)

GaoJ. et al.

Nonparallel emotional speech conversion

(2018)

GulatiA. et al.

Conformer: Convolution-augmented transformer for speech recognition

(2020)

HelanderE. et al.

Voice conversion using partial least squares regression

IEEE Trans. Audio Speech Lang. Process.

(2010)

KameokaH. et al.

ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder

IEEE/ACM Trans. Audio Speech Lang. Process.

(2019)

KameokaH. et al.

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

IEEE/ACM Trans. Audio Speech Lang. Process.

(2020)

KanekoT. et al.

Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks

KawanamiH. et al.

GMM-based voice conversion applied to emotional speech synthesis

(2003)

KimT.-H. et al.

Emotional voice conversion using multitask learning with text-to-speech

Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., 2019. Neural speech synthesis with transformer network. In: Proceedings of...

Li, N., Liu, Y., Wu, Y., Liu, S., Zhao, S., Liu, M., 2020. Robutrans: A robust transformer-based text-to-speech model....

LiuS. et al.

Emotional voice conversion with cycle-consistent adversarial network

(2020)

Cited by (0)

View full text

An improved CycleGAN-based emotional voice conversion model by augmenting temporal dependency with a transformer

Highlights

Abstract

Introduction

Section snippets

Transformer

Novel designs

Dataset

Discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Speech Commun.

Speech Commun.

GMM-based emotional voice conversion using spectrum and prosody features

Am. J. Signal Process.

Nonparallel emotional speech conversion using VAE-GAN

An emotional storyteller robot

When vision transformers outperform ResNets without pretraining or strong data augmentations

Sequence-to-sequence emotional voice conversion with strength control

IEEE Access

Cross-lingual language model pretraining

Adv. Neural Inf. Process. Syst.

Bert: Pre-training of deep bidirectional transformers for language understanding

Group latent embedding for vector quantized variational autoencoder in non-parallel voice conversion

Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition

An image is worth 16x16 words: Transformers for image recognition at scale

Nonparallel emotional speech conversion

Conformer: Convolution-augmented transformer for speech recognition

Voice conversion using partial least squares regression

IEEE Trans. Audio Speech Lang. Process.

ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder

IEEE/ACM Trans. Audio Speech Lang. Process.

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

IEEE/ACM Trans. Audio Speech Lang. Process.

Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks

GMM-based voice conversion applied to emotional speech synthesis

Emotional voice conversion using multitask learning with text-to-speech

Emotional voice conversion with cycle-consistent adversarial network