Loading [MathJax]/extensions/MathMenu.js
Zero-Shot Voice Conversion Based on Speaker Embedding Domain Generalization | IEEE Conference Publication | IEEE Xplore

Zero-Shot Voice Conversion Based on Speaker Embedding Domain Generalization


Abstract:

In this paper, a zero-shot voice conversion frame-work is constructed by effectively decoupling the semantic and speaker features in speech. The proposed method is based ...Show More

Abstract:

In this paper, a zero-shot voice conversion frame-work is constructed by effectively decoupling the semantic and speaker features in speech. The proposed method is based on the pre-trained wav2vec 2.0 model to extract semantic features from source speakers and a WavLM model to extract speaker features from target speakers. We propose the Robust-MAML model to map the speaker feature of the target speaker into a domain generalization space, making it directly applicable to any unregistered speaker domain. Finally, through transfer learning, the speech synthesis model FastSpeech2 integrates the semantic feature and domain-generalized speaker features to synthesize the target speaker's voice. Experimental results show that the proposed method outperforms the common baseline systems in both naturalness and speaker similarity.
Date of Conference: 23-25 December 2023
Date Added to IEEE Xplore: 22 March 2024
ISBN Information:

ISSN Information:

Conference Location: Hanoi, Vietnam

Contact IEEE to Subscribe

References

References is not available for this document.