skip to main content
10.1145/3581783.3613800acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Published: 27 October 2023 Publication History

Abstract

Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

References

[1]
Alexei Baevski, Steffen Schneider, and Michael Auli. 2019. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. In 8th International Conference on Learning Representations. OpenReview.net, Addis Ababa, Ethiopia, 1--12.
[2]
Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanvesky, and Ye Jia. 2019. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation. In 20th Annual Conference of the International Speech Communication Association. ISCA, Graz, Austria, 4115--4119.
[3]
Ju-Chieh Chou and Hung-yi Lee. 2019. One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. In 20th Annual Conference of the International Speech Communication Association. ISCA, Graz, Austria, 664--668.
[4]
Shaojin Ding and Ricardo Gutierrez-Osuna. 2019. Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion. In 20th Annual Conference of the International Speech Communication Association. ISCA, Graz, Austria, 724--728.
[5]
Wendong Gan, Bolong Wen, Ying Yan, Haitao Chen, Zhichao Wang, Hongqiang Du, Lei Xie, Kaixuan Guo, and Hai Li. 2022. IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion. arXiv:2201.00269, Vol. 00269, 2201 (2022), 1--5.
[6]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning. PMLR, JMLR.org, Lille, France, 1180--1189.
[7]
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 6111--6120.
[8]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems, Vol. 27 (2014), 1--5.
[9]
Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang. 2016. Voice conversion from non-parallel corpora using variational auto-encoder. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, IEEE, Jeju, South Korea, 1--6.
[10]
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop. IEEE, IEEE, Athens, Greece, 266--273.
[11]
Takuhiro Kaneko and Hirokazu Kameoka. 2018. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In 26th European Signal Processing Conference. IEEE, IEEE, Roma, Italy, 2100--2104.
[12]
Tom Kenter, Vincent Wan, Chun-an Chan, Rob Clark, and Jakub Vit. 2019. CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97. PMLR, Long Beach, California, 3331--3340.
[13]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations. ICLR, San Diego, CA, USA, 1--5.
[14]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthes Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.
[15]
Bac Nguyen and Fabien Cardinaux. 2022. Nvc-net: End-to-end adversarial voice conversion. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Virtual and Singapore, 7012--7016.
[16]
Adam Polyak and Lior Wolf. 2019a. Attention-based Wavenet Autoencoder for Universal Voice Conversion. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Brighton, United Kingdom, 6800--6804.
[17]
Adam Polyak and Lior Wolf. 2019b. Attention-based Wavenet Autoencoder for Universal Voice Conversion. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Brighton, United Kingdom, 6800--6804.
[18]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsupervised speech decomposition via triple information bottleneck. In Proceedings of the 37th International Conference on Machine Learning. PMLR, PMLR, Virtual Event, 7836--7846.
[19]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David D. Cox, and Mark Hasegawa-Johnson. 2021. Global Rhythm Style Transfer Without Text Transcriptions. arXiv:2106.08519, Vol. 08519, 2106 (2021), 1--5.
[20]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In Proceedings of the 36th International Conference on Machine Learning. PMLR, PMLR, Long Beach, California, 5210--5219.
[21]
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. 2021. AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. In 22nd Annual Conference of the International Speech Communication Association. ISCA, Incheon, Korea, 2756--2760.
[22]
R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, Stockholm, Sweden, 4700--4709.
[23]
Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Transactions on speech and audio processing, Vol. 6, 2 (1998), 131--142.
[24]
Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2022. AVQVC: One-shot Voice Conversion by Vector Quantization with applying contrastive learning. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Virtual and Singapore, 4613--4617.
[25]
Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2023 a. EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis. In 24th Annual Conference of the International Speech Communication Association. ISCA, Dublin, Ireland, 1--5.
[26]
Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2023 b. Learning Speech Representations with Flexible Hidden Feature Dimensions. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Rhodes, Greek, 1--5.
[27]
Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2023 c. QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Rhodes, Greek, 1--5.
[28]
Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2023 d. VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Rhodes, Greek, 1--5.
[29]
Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Zhen Zeng, Edward Xiao, and Jing Xiao. 2021. TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, IEEE, Cartagena, Colombia, 938--945.
[30]
Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S. Lempitsky. 2016. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In Proceedings of the 33nd International Conference on Machine Learning, Vol. 48. JMLR.org, New York City, NY, USA, 1349--1357.
[31]
Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. 2020. Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Barcelona, Spain, 6189--6193.
[32]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Calgary, AB, Canada, 4879--4883.
[33]
Jiashun Wang, Chao Wen, Yanwei Fu, Haitao Lin, Tianyun Zou, Xiangyang Xue, and Yinda Zhang. 2020. Neural Pose Transfer by Spatially Adaptive Instance Normalization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, Seattle, WA, USA, 5830--5838.
[34]
Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). PMLR, Stockholm, Sweden, 5167--5176.
[35]
Qicong Xie, Xiaohai Tian, Guanghou Liu, Kun Song, Lei Xie, Zhiyong Wu, Hai Li, Song Shi, Haizhou Li, Fen Hong, et al. 2021. The multi-speaker multi-style voice cloning challenge 2021. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Toronto, ON, Canada, 8613--8617.
[36]
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2021. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, IEEE, Toronto, ON, Canada, 920--924.
[37]
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2022. Emotional voice conversion: Theory, databases and ESD. Speech Communication, Vol. 137 (2022), 1--18.

Cited By

View all
  • (2024)EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651156(1-7)Online publication date: 30-Jun-2024
  • (2024)Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650293(1-7)Online publication date: 30-Jun-2024
  • (2024)Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447283(7150-7154)Online publication date: 14-Apr-2024
  • Show More Cited By

Index Terms

  1. PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 October 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. contrastive learning
        2. random prosody algorithm
        3. speech synthesis
        4. voice conversion

        Qualifiers

        • Research-article

        Funding Sources

        • the Key Research and Development Program of Guangdong Province

        Conference

        MM '23
        Sponsor:
        MM '23: The 31st ACM International Conference on Multimedia
        October 29 - November 3, 2023
        Ottawa ON, Canada

        Acceptance Rates

        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)149
        • Downloads (Last 6 weeks)16
        Reflects downloads up to 19 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651156(1-7)Online publication date: 30-Jun-2024
        • (2024)Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650293(1-7)Online publication date: 30-Jun-2024
        • (2024)Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447283(7150-7154)Online publication date: 14-Apr-2024
        • (2023)CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00182(1143-1148)Online publication date: 21-Dec-2023
        • (2023)Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00137(913-917)Online publication date: 6-Nov-2023
        • (2023)AOSR-Net: All-in-One Sandstorm Removal Network2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00100(641-645)Online publication date: 6-Nov-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media