skip to main content
10.1145/3573942.3574120acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Voicifier-LN: An Novel Approach to Elevate the Speaker Similarity for General Zero-shot Multi-Speaker TTS

Published: 16 May 2023 Publication History

Abstract

Speeches generated from neural network-based Text-to-Speech (TTS) have been becoming more natural and intelligible. However, the evident dropping performance still exists when synthesizing multi-speaker speeches in zero-shot manner, especially for those from different countries with different accents. To bridge this gap, we propose a novel method, called Voicifier. It firstly operates on high frequency mel-spectrogram bins to approximately remove the content and rhythm. Then Voicifier uses two strategies, from the shallow to the deep mixing, to further destroy the content and rhythm but retain the timbre. Furthermore, for better zero-shot performance, we propose Voice-Pin Layer Normalization (VPLN) which pins down the timbre according with the text feature. During inference, the model is allowed to synthesize high quality and similarity speeches with just around 1 sec target speech audio. Experiments and ablation studies prove that the methods are able to retain more target timbre while abandoning much more of the content and rhythm-related information. To our best knowledge, the methods are found to be universal that is to say it can be applied to most of the existing TTS systems to enhance the ability of cross-speaker synthesis.

References

[1]
Yuxuan Wang, Daisy Stanton, Yu Zhang 2018. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5180-5189, 2018.
[2]
Guangzhi Sun, Yu Zhang, Ron J. Weiss. 2020. GENERATING DIVERSE AND NATURAL TEXT-TO-SPEECH SAMPLES USING A QUANTIZED FINE-GRAINED VAE AND AUTOREGRESSIVE PROSODY PRIOR. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[3]
Xiang Li, Changhe Song, Jingbei Li. 2021. Towards Multi-Scale Style Control for Expressive Speech Synthesis. ICASSP 2020 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[4]
Arik, Sercan O, 2018. Neural Voice Cloning with a Few Samples. NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems.
[5]
Yutian Chen, 2019. SAMPLE EFFICIENT ADAPTIVE TEXT-TO-SPEECH. ICLR 7 th International Conference on Learning Representations.
[6]
Dongchan Min, 2019. Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7748-7759, 2021.
[7]
Erica Cooper, 2020. ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[8]
Yi Ren, 2022. PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH. ICASSP 2021 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9]
Kaizhi Qian, 2021. Unsupervised Speech Decomposition via Triple Information Bottleneck. Thirty-seventh International Conference on Machine Learning(ICML).
[10]
Guanlong Zhao, Sinem Sonsaat, Alif O. Silpachai, 2018. L2-arctic: A non-native english speech corpus. Interspeech 2018.
[11]
Yao Shi. 2021. AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES. Interspeech 2021.
[12]
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu. 2019. Interspeech 2019.
[13]
Yi Ren, Chenxu Hu, Xu Tan 2021. FASTSPEECH 2: FAST AND HIGH-QUALITY END-TOEND TEXT TO SPEECH. ICLR 9 th International Conference on Learning Representations.
[14]
Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur. 2015. LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS. ICASSP 2014 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[15]
Botao Zhao 2022. NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH. Ping An Technology (Shenzhen) Co., Ltd., China.
[16]
Yuxuan Wang, Daisy Stanton, Yu Zhang 2018. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5180-5189, 2018.
[17]
Xu Tan 2018. A Survey on Neural Speech Synthesis. Microsoft Research Asia.
[18]
RJ Skerry-Ryan 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. 33th International Conference on Machine Learning(ICML).
[19]
Yuzi Yan 2021. Adaspeech 3: Adaptive text to speech for spontaneous style. Interspeech 2021.
[20]
Mingjian Chen 2021. Adaspeech: Adaptive text to speech for custom voice. ICLR 9 th International Conference on Learning Representations.
[21]
Kaiyang Zhou 2021. Domain generalization with mixstyle. ICLR 9 th International Conference on Learning Representations.
[22]
Erica Cooper, Cheng-I Lai 2020. Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS. Interspeech 2020.
[23]
Yuxuan Wang, RJ Skerry-Ryan 2017. Tacotron: Towards end-to-end speech synthesis. Interspeech2017.
[24]
Kundan Kumar, Rithesh Kumar 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. 34th Conference on Neural Information Processing Systems.
[25]
Laurens van der Maaten 2019. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15 (2014) 1-21.
[26]
Yuan-Jui Chen, Hung-yi Lee 2019. End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. Interspeech 2019.
[27]
Dipjyoti Paul, Muhammed PV Shifas 2020. Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. Interspeech 2020.
[28]
Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee. 2019. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. Interspeech 2019.

Index Terms

  1. Voicifier-LN: An Novel Approach to Elevate the Speaker Similarity for General Zero-shot Multi-Speaker TTS

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. robust training strategy
    2. speech synthesis
    3. timbre
    4. zero-shot

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 45
      Total Downloads
    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media