Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation

doi:10.1016/j.specom.2021.07.001

Speech Communication

Volume 132, September 2021, Pages 132-145

https://doi.org/10.1016/j.specom.2021.07.001 Get rights and content

Under a Creative Commons license

open access

Highlights

•
Deep Gaussian processes are effective in multi-speaker text-to-speech synthesis.
•
A Deep Gaussian process with one-hot speaker codes outperforms a deep neural network.
•
Learning latent speaker representations improves speech quality with scarce data.
•
Learned speaker space can be utilized to generate voices of non-existent speakers.

Abstract

This paper proposes deep Gaussian process (DGP)-based frameworks for multi-speaker speech synthesis and speaker representation learning. A DGP has a deep architecture of Bayesian kernel regression, and it has been reported that DGP-based single speaker speech synthesis outperforms deep neural network (DNN)-based ones in the framework of statistical parametric speech synthesis. By extending this method to multiple speakers, it is expected that higher speech quality can be achieved with a smaller number of training utterances from each speaker. To apply DGPs to multi-speaker speech synthesis, we propose two methods: one using DGP with one-hot speaker codes, and the other using a deep Gaussian process latent variable model (DGPLVM). The DGP with one-hot speaker codes uses additional GP layers to transform speaker codes into latent speaker representations. The DGPLVM directly models the distribution of latent speaker representations and learns it jointly with acoustic model parameters. In this method, acoustic speaker similarity is expressed in terms of the similarity of the speaker representations, and thus, the voices of similar speakers are efficiently modeled. We experimentally evaluated the performance of the proposed methods in comparison with those of conventional DNN and variational autoencoder (VAE)-based frameworks, in terms of acoustic feature distortion and subjective speech quality. The experimental results demonstrate that (1) the proposed DGP-based and DGPLVM-based methods improve subjective speech quality compared with a feed-forward DNN-based method, and (2) even when the amount of training data for target speakers is limited, the DGPLVM-based method outperforms other methods, including the VAE-based one. Additionally, (3) by using a speaker representation randomly sampled from the learned speaker space, the DGPLVM-based method can generate voices of non-existent speakers.

Keywords

Text-to-speech synthesis

Multi-speaker modeling

Speaker representation

Gaussian process

Deep generative models