Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work.
Cite as: Lee, K.-Z., Cooper, E., Hirschberg, J. (2018) A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis. Proc. Interspeech 2018, 2873-2877, doi: 10.21437/Interspeech.2018-1313
@inproceedings{lee18b_interspeech, author={Kai-Zhan Lee and Erica Cooper and Julia Hirschberg}, title={{A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis}}, year=2018, booktitle={Proc. Interspeech 2018}, pages={2873--2877}, doi={10.21437/Interspeech.2018-1313} }