Adaptive Text to Speech for Spontaneous Style

Yan, Yuzi; Tan, Xu; Li, Bohan; Zhang, Guangyan; Qin, Tao; Zhao, Sheng; Shen, Yuan; Zhang, Wei-Qiang; Liu, Tie-Yan

doi:10.21437/Interspeech.2021-584

Adaptive Text to Speech for Spontaneous Style

Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu

While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.

doi: 10.21437/Interspeech.2021-584

Cite as: Yan, Y., Tan, X., Li, B., Zhang, G., Qin, T., Zhao, S., Shen, Y., Zhang, W.-Q., Liu, T.-Y. (2021) Adaptive Text to Speech for Spontaneous Style. Proc. Interspeech 2021, 4668-4672, doi: 10.21437/Interspeech.2021-584

@inproceedings{yan21d_interspeech,
  author={Yuzi Yan and Xu Tan and Bohan Li and Guangyan Zhang and Tao Qin and Sheng Zhao and Yuan Shen and Wei-Qiang Zhang and Tie-Yan Liu},
  title={{Adaptive Text to Speech for Spontaneous Style}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4668--4672},
  doi={10.21437/Interspeech.2021-584}
}