Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders

Okamoto, Takuma; Toda, Tomoki; Shiga, Yoshinori; Kawai, Hisashi

doi:10.21437/Interspeech.2019-1288

Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders

Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai

This paper investigates real-time high-fidelity neural text-to-speech (TTS) systems. For real-time neural vocoders, WaveGlow is introduced and single Gaussian (SG)WaveRNN is proposed. The proposed SG-WaveRNN can predict continuous valued speech waveforms with half the synthesis time compared with vanilla WaveRNN with dual-softmax for 16 bit audio prediction. Additionally, a sequence-to-sequence (seq2seq) acoustic model (AM) for pitch accent languages, such as Japanese, is investigated by introducing Tacotron 2 architecture. In the seq2seq AM, full-context labels extracted from a text analyzer are used as input and they are directly converted into mel-spectrograms. The results of subjective experiment using a Japanese female corpus indicate that the proposed SG-WaveRNN vocoder with noise shaping can synthesize high-quality speech waveforms and real-time high-fidelity neural TTS systems can be realized with the seq2seq AM and WaveGlow or SG-WaveRNN vocoders. Especially, the seq2seq AM and WaveGlow vocoder conditioned on mel-spectrograms with simple PyTorch implementations can be realized with real-time factors 0.06 and 0.10 for inference using a GPU.

doi: 10.21437/Interspeech.2019-1288

Cite as: Okamoto, T., Toda, T., Shiga, Y., Kawai, H. (2019) Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders. Proc. Interspeech 2019, 1308-1312, doi: 10.21437/Interspeech.2019-1288

@inproceedings{okamoto19_interspeech,
  author={Takuma Okamoto and Tomoki Toda and Yoshinori Shiga and Hisashi Kawai},
  title={{Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1308--1312},
  doi={10.21437/Interspeech.2019-1288}
}