Word emphasis prediction is an important part of expressive prosody generation in modern Text-To-Speech (TTS) systems. We present a method for predicting emphasized words for expressive TTS, based on a Deep Neural Network (DNN). We show that the presented method outperforms machine learning methods based on hand-crafted features in terms of objective metrics such as precision and recall. Using a listening test, we further demonstrate that the contribution of the predicted emphasized words to the expressiveness of the synthesized speech is subjectively perceivable.
Cite as: Mass, Y., Shechtman, S., Mordechay, M., Hoory, R., Sar Shalom, O., Lev, G., Konopnicki, D. (2018) Word Emphasis Prediction for Expressive Text to Speech. Proc. Interspeech 2018, 2868-2872, doi: 10.21437/Interspeech.2018-1159
@inproceedings{mass18_interspeech, author={Yosi Mass and Slava Shechtman and Moran Mordechay and Ron Hoory and Oren {Sar Shalom} and Guy Lev and David Konopnicki}, title={{Word Emphasis Prediction for Expressive Text to Speech}}, year=2018, booktitle={Proc. Interspeech 2018}, pages={2868--2872}, doi={10.21437/Interspeech.2018-1159} }