Learning Contrastive Emotional Nuances in Speech Synthesis | IEEE Conference Publication | IEEE Xplore

Learning Contrastive Emotional Nuances in Speech Synthesis


Abstract:

Prosody is a crucial speech feature in emotional text - to-speech (TTS), as different emotions have distinct prosodic characteristics. Existing works in emotional TTS hav...Show More

Abstract:

Prosody is a crucial speech feature in emotional text - to-speech (TTS), as different emotions have distinct prosodic characteristics. Existing works in emotional TTS have primarily utilized emotion labels in the dataset by applying auxiliary emotion classification loss to enhance emotional nuances in the model. However, this approach may only partially leverage the potential of emotion labels. Accordingly, this paper proposes a supervised contrastive approach to effectively utilize emotion labels and enable the model to distinguish prosody from different emotions. Furthermore, this work also explores the unsupervised contrastive learning where the emotion labels are missing in emotional TTS. In particular, the proposed TTS architecture assures a cross-speaker emotion in transfer learning, allowing for an accurate speech generation even without specific prosody from a target speaker. The experimental results on emotional datasets demonstrate the effectiveness of the proposed method.
Date of Conference: 17-19 October 2024
Date Added to IEEE Xplore: 20 December 2024
ISBN Information:

ISSN Information:

Conference Location: Hsinchu City, Taiwan

Contact IEEE to Subscribe

References

References is not available for this document.