A Unified Framework for the Generation of Glottal Signals in Deep Learning-based Parametric Speech Synthesis Systems

Hwang, Min-Jae; Song, Eunwoo; Kim, Jin-Seob; Kang, Hong-Goo

doi:10.21437/Interspeech.2018-1590

A Unified Framework for the Generation of Glottal Signals in Deep Learning-based Parametric Speech Synthesis Systems

Min-Jae Hwang, Eunwoo Song, Jin-Seob Kim, Hong-Goo Kang

In this paper, we propose a unified training framework for the generation of glottal signals in deep learning (DL)-based parametric speech synthesis systems. The glottal vocoding-based speech synthesis system, especially the modeling-by-generation (MbG) structure that we proposed recently, significantly improves the naturalness of synthesized speech by faithfully representing the noise component of the glottal excitation with an additional DL structure. Because the MbG method introduces a multistage processing pipeline, however, its training process is complicated and inefficient. To alleviate this problem, we propose a unified training approach that directly generates speech parameters by merging all the required models, such as acoustic, glottal and noise models into a single unified network. Considering the fact that noise analysis should be performed after training the glottal model, we also propose a stochastic noise analysis method that enables noise modeling to be included in the unified training process by iteratively analyzing the noise component in every epoch. Both objective and subjective test results verify the superiority of the proposed algorithm compared to conventional methods.

doi: 10.21437/Interspeech.2018-1590

Cite as: Hwang, M.-J., Song, E., Kim, J.-S., Kang, H.-G. (2018) A Unified Framework for the Generation of Glottal Signals in Deep Learning-based Parametric Speech Synthesis Systems. Proc. Interspeech 2018, 912-916, doi: 10.21437/Interspeech.2018-1590

@inproceedings{hwang18_interspeech,
  author={Min-Jae Hwang and Eunwoo Song and Jin-Seob Kim and Hong-Goo Kang},
  title={{A Unified Framework for the Generation of Glottal Signals in Deep Learning-based Parametric Speech Synthesis Systems}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={912--916},
  doi={10.21437/Interspeech.2018-1590}
}