Automatically dubbed speech of a video involves: (i) segmenting the target sentences into phrases to reflect the speech-pause arrangement used by the original speaker, and (ii) adjusting the speaking rate of the synthetic voice at the phrase-level to match the exact timing of each corresponding source phrase. In this work, we investigate a post-segmentation approach to control the speaking rate of neural Text-to-Speech (TTS) at the phrase-level after generating the entire sentence. Our post-segmentation method relies on the attention matrix generated by the context generation step to perform a force-alignment over pause markers inserted in the input text. We show that: (i) our approach can be more accurate than applying an off-the-shelf forced aligner, and (ii) post-segmentation method permits generation more fluent speech than pre-segmentation approach described in [1].
Cite as: Sharma, M., Virkar, Y., Federico, M., Barra-Chicote, R., Enyedi, R. (2021) Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing. Proc. Interspeech 2021, 3151-3155, doi: 10.21437/Interspeech.2021-1012
@inproceedings{sharma21b_interspeech, author={Mayank Sharma and Yogesh Virkar and Marcello Federico and Roberto Barra-Chicote and Robert Enyedi}, title={{Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3151--3155}, doi={10.21437/Interspeech.2021-1012} }