skip to main content
10.1145/3672919.3672988acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaideConference Proceedingsconference-collections
research-article

Cross-lingual voice conversion based on F0 multi-scale modeling with VITS

Published: 24 July 2024 Publication History

Abstract

This paper introduces a cross-lingual voice conversion technology that utilizes an F0 predictor for multi-scale modeling of the fundamental frequency (f0). Based on the VITS architecture, this method achieves high-quality voice generation through an end-to-end approach. In cross-lingual conversion, the voice often carries an unnatural foreign accent due to the involvement of different languages in the source and target voices. To address this issue, Whisper is introduced as a content extraction tool aimed at capturing the detailed content of the speech, including the specific accent of the source voice, which is crucial for efficient cross-lingual conversion. Furthermore, by employing an F0 predictor for multi-scale modeling of f0 to predict the fundamental frequency contour, this strategy helps to retain the accent characteristics of the source voice during conversion. This research, trained solely on the VCTK English dataset, effectively achieves cross-lingual conversion across various languages and significantly reduces the impact of foreign accents. To further explore the contribution of the F0 predictor in the proposed model, a series of ablation experiments were designed. Through objective and subjective evaluations, the effectiveness of the proposed method is demonstrated.

References

[1]
Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das, and Haizhou Li, "Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6790–6794.
[2]
Jingyi Li, Weiping Tu, and Li Xiao. Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[3]
Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. Quickvc:Many-to-any voice conversion using inverse short-time fourier transform for faster conversion. arXiv preprint arXiv:2302.08296, 2023.
[4]
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. arXiv preprint arXiv:2106.10132, 2021.
[5]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv- Ryan, Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
[6]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and TieYan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
[7]
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-tospeech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
[8]
Daniel Erro and Asunción Moreno, "Frame alignment method for cross-lingual voice conversion," in INTERSPEECH, 2007, pp. 1969– 1972.
[9]
Daniel Erro, Asunción Moreno, and Antonio Bonafonte, "Inca algorithm for training voice conversion systems from nonparallel corpora," IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 944–953, 2010.
[10]
David Sundermann, Hermann Ney, and H Hoge, "Vtln-based cross-language voice conversion," in IEEE ASRU, 2003, pp. 676–681.
[11]
Yao Qian, Ji Xu, and Frank K Soong, "A frame mapping based hmm approach to cross-lingual voice transformation," in IEEE ICASSP, 2011, pp. 5120–5123.
[12]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, "Robust speech recognition via large-scale weak supervision," CoRR, vol. abs/2212.04356, 2022.
[13]
Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen Meng. Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1717– 1728, 2021.
[14]
Y. Zhou, M. Chen, Y. Lei, J. Zhu and W. Zhao, "VITS-Based Singing Voice Conversion System with DSPGAN Post-Processing for SVCC2023," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8.
[15]
Xin Wang, Shinji Takaki, and Junichi Yamagishi, "Neural source-filter waveform models for statistical parametric speech synthesis," Proc. TASLP, vol. 28, pp. 402–415, 2020.
[16]
H. Guo, C. Liu, C. T. Ishi and H. Ishiguro, "Using Joint Training Speaker Encoder With Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8.

Index Terms

  1. Cross-lingual voice conversion based on F0 multi-scale modeling with VITS

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CSAIDE '24: Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy
    March 2024
    676 pages
    ISBN:9798400718212
    DOI:10.1145/3672919
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    CSAIDE 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 31
      Total Downloads
    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media