Loading [a11y]/accessibility-menu.js
Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition | IEEE Conference Publication | IEEE Xplore

Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition


Abstract:

End-to-end TTS model can directly take an utterance as reference, and generate speech from the text with prosody and speaker characteristics similar to the reference utte...Show More

Abstract:

End-to-end TTS model can directly take an utterance as reference, and generate speech from the text with prosody and speaker characteristics similar to the reference utterance. Ideally, the transcription of reference utterance does not need to match the text to be synthesized, so unsupervised style transfer can be achieved. However, in the previous model, because only the matched text and speech are used in training, given unmatched text and speech during testing would make the model synthesize blurry speech. In this paper, we propose to mitigate the problem by using the unmatched text and speech during training, and using the ASR accuracy of an end-to-end ASR model to guide the training procedure. The experimental results show that with the guidance of end-to-end ASR, both the ASR accuracy (objective evaluation) and the listener preference (subjective evaluation) of the speech generated by TTS model are improved. Moreover, we propose attention consistency loss as regularization, which is shown to accelerate the training.
Date of Conference: 18-21 December 2018
Date Added to IEEE Xplore: 14 February 2019
ISBN Information:
Conference Location: Athens, Greece

Contact IEEE to Subscribe

References

References is not available for this document.