research-article

Cross-lingual voice conversion based on F0 multi-scale modeling with VITS

Authors:

Zeyi ZhangAuthors Info & Claims

CSAIDE '24: Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy

Pages 375 - 379

https://doi.org/10.1145/3672919.3672988

Published: 24 July 2024 Publication History

Abstract

This paper introduces a cross-lingual voice conversion technology that utilizes an F0 predictor for multi-scale modeling of the fundamental frequency (f0). Based on the VITS architecture, this method achieves high-quality voice generation through an end-to-end approach. In cross-lingual conversion, the voice often carries an unnatural foreign accent due to the involvement of different languages in the source and target voices. To address this issue, Whisper is introduced as a content extraction tool aimed at capturing the detailed content of the speech, including the specific accent of the source voice, which is crucial for efficient cross-lingual conversion. Furthermore, by employing an F0 predictor for multi-scale modeling of f0 to predict the fundamental frequency contour, this strategy helps to retain the accent characteristics of the source voice during conversion. This research, trained solely on the VCTK English dataset, effectively achieves cross-lingual conversion across various languages and significantly reduces the impact of foreign accents. To further explore the contribution of the F0 predictor in the proposed model, a series of ablation experiments were designed. Through objective and subjective evaluations, the effectiveness of the proposed method is demonstrated.

References

[1]

Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das, and Haizhou Li, "Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6790–6794.

[2]

Jingyi Li, Weiping Tu, and Li Xiao. Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.

[3]

Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. Quickvc:Many-to-any voice conversion using inverse short-time fourier transform for faster conversion. arXiv preprint arXiv:2302.08296, 2023.

[4]

Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. arXiv preprint arXiv:2106.10132, 2021.

[5]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv- Ryan, Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.

[6]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and TieYan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.

[7]

Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-tospeech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.

[8]

Daniel Erro and Asunción Moreno, "Frame alignment method for cross-lingual voice conversion," in INTERSPEECH, 2007, pp. 1969– 1972.

[9]

Daniel Erro, Asunción Moreno, and Antonio Bonafonte, "Inca algorithm for training voice conversion systems from nonparallel corpora," IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 944–953, 2010.

Digital Library

[10]

David Sundermann, Hermann Ney, and H Hoge, "Vtln-based cross-language voice conversion," in IEEE ASRU, 2003, pp. 676–681.

[11]

Yao Qian, Ji Xu, and Frank K Soong, "A frame mapping based hmm approach to cross-lingual voice transformation," in IEEE ICASSP, 2011, pp. 5120–5123.

[12]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, "Robust speech recognition via large-scale weak supervision," CoRR, vol. abs/2212.04356, 2022.

[13]

Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen Meng. Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1717– 1728, 2021.

Digital Library

[14]

Y. Zhou, M. Chen, Y. Lei, J. Zhu and W. Zhao, "VITS-Based Singing Voice Conversion System with DSPGAN Post-Processing for SVCC2023," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8.

[15]

Xin Wang, Shinji Takaki, and Junichi Yamagishi, "Neural source-filter waveform models for statistical parametric speech synthesis," Proc. TASLP, vol. 28, pp. 402–415, 2020.

[16]

H. Guo, C. Liu, C. T. Ishi and H. Ishiguro, "Using Joint Training Speaker Encoder With Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8.

Index Terms

Cross-lingual voice conversion based on F0 multi-scale modeling with VITS
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis

For any given mixed-language text, a multilingual synthesizer synthesizes speech that is intelligible to human listener. However, as speech data are usually collected from native speakers to avoid foreign accent, synthesized speech shows speaker ...
Cross-Lingual Voice Conversion
Optimization of Cross-Lingual Voice Conversion With Linguistics Losses to Reduce Foreign Accents
Cross-lingual voice conversion (XVC) transforms the speaker identity of a source speaker to that of a target speaker who speaks a different language. Due to the intrinsic differences between languages, the converted speech may carry an unwanted foreign ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CSAIDE '24: Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy

March 2024

676 pages

ISBN:9798400718212

DOI:10.1145/3672919

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

CSAIDE 2024

CSAIDE 2024: 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy

March 1 - 3, 2024

Nanjing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
31
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten