Loading [a11y]/accessibility-menu.js
Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications | IEEE Conference Publication | IEEE Xplore

Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications


Abstract:

This paper describes a method for distilling a recurrent-based sequence-to-sequence (S2S) voice conversion (VC) model. Although the performance of recent VCs is becoming ...Show More

Abstract:

This paper describes a method for distilling a recurrent-based sequence-to-sequence (S2S) voice conversion (VC) model. Although the performance of recent VCs is becoming higher quality, streaming conversion is still a challenge when considering practical applications. To achieve streaming VC, the conversion model needs a streamable structure, a causal layer rather than a non-causal layer. Motivated by this constraint and recent advances in S2S learning, we apply the teacher-student framework to recurrent-based S2S- VC models. A major challenge is how to minimize degradation due to the use of causal layers which masks future input information. Experimental evaluations show that except for male-to-female speaker conversion, our approach is able to maintain the teacher model's performance in terms of subjective evaluations despite the streamable student model structure. Audio samples can be accessed on http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/dists2svc.
Date of Conference: 09-12 January 2023
Date Added to IEEE Xplore: 27 January 2023
ISBN Information:
Conference Location: Doha, Qatar

Contact IEEE to Subscribe

References

References is not available for this document.