Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework

Tachibana, Kentaro; Toda, Tomoki; Shiga, Yoshinori; Kawai, Hisashi

doi:10.21437/Interspeech.2016-1006

Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework

Kentaro Tachibana, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai

In this paper, we propose a model integration method for hidden Markov model (HMM) and deep neural network (DNN) based acoustic models using a product-of-experts (PoE) framework in statistical parametric speech synthesis. In speech parameter generation, DNN predicts a mean vector of the probability density function of speech parameters frame by frame while keeping its covariance matrix constant over all frames. On the other hand, HMM predicts the covariance matrix as well as the mean vector but they are fixed within the same HMM state, i.e., they can actually vary state by state. To make it possible to predict a better probability density function by leveraging advantages of individual models, the proposed method integrates DNN and HMM as PoE, generating a new probability density function satisfying conditions of both DNN and HMM. Furthermore, we propose a joint optimization method of DNN and HMM within the PoE framework by effectively using additional latent variables. We conducted objective and subjective evaluations, demonstrating that the proposed method significantly outperforms the DNN-based speech synthesis as well as the HMM-based speech synthesis.

doi: 10.21437/Interspeech.2016-1006

Cite as: Tachibana, K., Toda, T., Shiga, Y., Kawai, H. (2016) Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework. Proc. Interspeech 2016, 2288-2292, doi: 10.21437/Interspeech.2016-1006

@inproceedings{tachibana16_interspeech,
  author={Kentaro Tachibana and Tomoki Toda and Yoshinori Shiga and Hisashi Kawai},
  title={{Model Integration for HMM- and DNN-Based Speech Synthesis Using Product-of-Experts Framework}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={2288--2292},
  doi={10.21437/Interspeech.2016-1006}
}