Journals & Magazines >IEEE/ACM Transactions on Audi... >Volume: 28

Learning Structured Sparse Representations for Voice Conversion

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of ato...Show More

Metadata

Abstract:

Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.

Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 28)

Page(s): 343 - 354

Date of Publication: 22 November 2019

ISSN Information:

DOI: 10.1109/TASLP.2019.2955289

Funding Agency:

Contents

References is not available for this document.

Learning Structured Sparse Representations for Voice Conversion

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Learning Structured Sparse Representations for Voice Conversion

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?