UnitNet-Based Hybrid Speech Synthesis

Zhou, Xiao; Ling, Zhen-Hua; Dai, Li-Rong

doi:10.21437/Interspeech.2021-1092

UnitNet-Based Hybrid Speech Synthesis

Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai

This paper presents a hybrid speech synthesis method based on UnitNet, a unified sequence-to-sequence (Seq2Seq) acoustic model for both statistical parametric speech synthesis (SPSS) and concatenative speech synthesis (CSS). This method combines CSS and SPSS approaches to synthesize different segments in an utterance. Comparing with the Tacotron2 model for Seq2Seq speech synthesis, UnitNet utilizes the phone boundaries of training data and its decoder contains autoregressive structures at both phone and frame levels. This hierarchical architecture can not only extract embedding vectors for representing phone-sized units in the corpus but also measure the dependency among consecutive units, which makes UnitNet capable of guiding the selection of phone-sized units for CSS. Furthermore, hybrid synthesis can be achieved by integrating the units generated by SPSS into the framework of CSS for the target phones without appropriate candidates in the corpus. Experimental results show that UnitNet can achieve comparable naturalness with Tacotron2 for SPSS and outperform our previous Tacotron-based method for CSS. Besides, the naturalness and inference efficiency of SPSS can be further improved through hybrid synthesis.

doi: 10.21437/Interspeech.2021-1092

Cite as: Zhou, X., Ling, Z.-H., Dai, L.-R. (2021) UnitNet-Based Hybrid Speech Synthesis. Proc. Interspeech 2021, 4119-4123, doi: 10.21437/Interspeech.2021-1092

@inproceedings{zhou21f_interspeech,
  author={Xiao Zhou and Zhen-Hua Ling and Li-Rong Dai},
  title={{UnitNet-Based Hybrid Speech Synthesis}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4119--4123},
  doi={10.21437/Interspeech.2021-1092}
}