Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing

Sharma, Mayank; Virkar, Yogesh; Federico, Marcello; Barra-Chicote, Roberto; Enyedi, Robert

doi:10.21437/Interspeech.2021-1012

Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing

Mayank Sharma, Yogesh Virkar, Marcello Federico, Roberto Barra-Chicote, Robert Enyedi

Automatically dubbed speech of a video involves: (i) segmenting the target sentences into phrases to reflect the speech-pause arrangement used by the original speaker, and (ii) adjusting the speaking rate of the synthetic voice at the phrase-level to match the exact timing of each corresponding source phrase. In this work, we investigate a post-segmentation approach to control the speaking rate of neural Text-to-Speech (TTS) at the phrase-level after generating the entire sentence. Our post-segmentation method relies on the attention matrix generated by the context generation step to perform a force-alignment over pause markers inserted in the input text. We show that: (i) our approach can be more accurate than applying an off-the-shelf forced aligner, and (ii) post-segmentation method permits generation more fluent speech than pre-segmentation approach described in [1].

doi: 10.21437/Interspeech.2021-1012

Cite as: Sharma, M., Virkar, Y., Federico, M., Barra-Chicote, R., Enyedi, R. (2021) Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing. Proc. Interspeech 2021, 3151-3155, doi: 10.21437/Interspeech.2021-1012

@inproceedings{sharma21b_interspeech,
  author={Mayank Sharma and Yogesh Virkar and Marcello Federico and Roberto Barra-Chicote and Robert Enyedi},
  title={{Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3151--3155},
  doi={10.21437/Interspeech.2021-1012}
}