ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions

Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang

The combination of the recently proposed LPCNet vocoder and a seq-to-seq acoustic model, i.e., Tacotron, has successfully achieved lightweight speech synthesis systems. However, the quality of synthesized speech is often unstable because the precision of the pitch parameters predicted by acoustic models is insufficient, especially for some tonal languages like Chinese and Japanese. In this paper, we propose an end-to-end speech synthesis system, TacoLPCNet, by conditioning LPCNet on Mel spectrogram predictions. First, we extend LPCNet for the Mel spectrogram instead of using explicit pitch information and pitch-related network. Furthermore, we optimize the system by model pruning, multi-frame inference, and increasing frame length, to enable it to meet the conditions required for real-time applications. The objective and subjective evaluation results for various languages show that the proposed system is more stable for tonal languages within the proposed optimization strategies. The experimental results also verify that our model improves synthesis runtime by 3.12 times than that of the baseline on a standard CPU while maintaining naturalness.


doi: 10.21437/Interspeech.2021-852

Cite as: Gong, C., Wang, L., Zhang, J., Guo, S., Wang, Y., Dang, J. (2021) TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions. Proc. Interspeech 2021, 111-115, doi: 10.21437/Interspeech.2021-852

@inproceedings{gong21_interspeech,
  author={Cheng Gong and Longbiao Wang and Ju Zhang and Shaotong Guo and Yuguang Wang and Jianwu Dang},
  title={{TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={111--115},
  doi={10.21437/Interspeech.2021-852}
}