skip to main content
10.1145/3581783.3612150acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching

Published: 27 October 2023 Publication History

Abstract

Though previous works have shown remarkable achievements in singing voice generation, most existing models focus on one specific application and there is a lack of unified singing voice synthesis models. In addition to low relevance among tasks, different input modalities are one of the most intractable hindrances. Current methods suffer from information confusion and they can not perform precise control. In this work, we propose UniSinger, a unified end-to-end singing voice synthesizer, which integrates three abilities related to singing voice generation: singing voice synthesis (SVS), singing voice conversion (SVC), and singing voice editing (SVE) into a single framework. Specifically, we perform representation disentanglement for controlling different attributes of the singing voice. We further propose a cross-modality information matching method to close the distribution gap between multi-modal inputs and achieve end-to-end training. The experiments conducted on the OpenSinger dataset demonstrate that UniSinger achieves state-of-the-art results in three applications. Further extensive experiments verify the capability of representation disentanglement and information matching, reflecting that UniSinger enjoys great superiority in sample quality, timbre similarity, and multi-task compatibility. Audio samples can be found in https://unisinger.github.io/Samples/.

References

[1]
Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning. PMLR, 2709--2720.
[2]
Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776 (2020).
[3]
Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. Club: A contrastive log-ratio upper bound of mutual information. In International conference on machine learning. PMLR, 1779--1788.
[4]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised Cross-lingual Representation Learning for Speech Recognition. arxiv: 2006.13979 [cs.CL]
[5]
Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, and Zhou Zhao. 2021. EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. arXiv preprint arXiv:2106.09317 (2021).
[6]
Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, and Zhou Zhao. 2022. VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement. arXiv preprint arXiv:2211.10666 (2022).
[7]
Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, and Dong Yu. 2020. Pitchnet: Unsupervised singing voice conversion with pitch adversarial network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7749--7753.
[8]
Jeff Donahue, Sander Dieleman, Miko?aj Bi?kowski, Erich Elsen, and Karen Simonyan. 2021. End-to-End Adversarial Text-to-Speech. arxiv: 2006.03575 [cs.SD]
[9]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative Adversarial Networks. Commun. ACM, Vol. 63, 11 (oct 2020), 139--144. https://doi.org/10.1145/3422622
[10]
Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. 2021. Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 1--5.
[11]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, Vol. 33 (2020), 6840--6851.
[12]
Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2021. Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 3945--3954. https://doi.org/10.1145/3474085.3475437
[13]
Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou Zhao, Baoxing Huai, and Zhefeng Wang. 2022a. Singgan: Generative adversarial network for high-fidelity singing voice generation. In Proceedings of the 30th ACM International Conference on Multimedia. 2525--2535.
[14]
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661 (2023).
[15]
Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022b. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. arXiv preprint arXiv:2204.09934 (2022).
[16]
Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. [n.,d.]. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech. In Advances in Neural Information Processing Systems.
[17]
Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, and Jinzheng He. 2022c. TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation. arXiv preprint arXiv:2205.12523 (2022).
[18]
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 266--273.
[19]
Takuhiro Kaneko and Hirokazu Kameoka. 2018. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2100--2104.
[20]
Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, Vol. 31 (2018).
[21]
Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arxiv: 1312.6114 [stat.ML]
[22]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.
[23]
Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. 2021. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514 (2021).
[24]
Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. 2019. Adversarially trained end-to-end korean singing voice synthesis system. arXiv preprint arXiv:1908.01919 (2019).
[25]
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. 2022. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658 (2022).
[26]
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6706--6713.
[27]
Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11020--11028.
[28]
Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. 2017. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International conference on machine learning. PMLR, 2391--2400.
[29]
Eliya Nachmani and Lior Wolf. 2019. Unsupervised singing voice conversion. arXiv preprint arXiv:1904.06590 (2019).
[30]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[31]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning. PMLR, 7836--7846.
[32]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning. PMLR, 5210--5219.
[33]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020a. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
[34]
Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. 2020b. Deepsinger: Singing voice synthesis with data mined from the web. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1979--1989.
[35]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779--4783.
[36]
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial Discriminative Domain Adaptation. arxiv: 1702.05464 [cs.CV]
[37]
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. 2021. Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. arXiv preprint arXiv:2106.10132 (2021).
[38]
Zhenhui Ye, Zhou Zhao, Yi Ren, and Fei Wu. 2022. SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech. arxiv: 2204.11792 [cs.SD]
[39]
Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, et al. 2022b. M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus. Advances in Neural Information Processing Systems, Vol. 35 (2022), 6914--6926.
[40]
Lichao Zhang, Zhou Zhao, Yi Ren, and Liqun Deng. [n.,d.]. EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling. ([n.,d.]).
[41]
Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, and Mengxiao Bi. 2022a. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7237--7241.

Cited By

View all
  • (2024)RPA-SCD: Rhythm and Pitch Aware Dual-Branch Network for Songs Conversion Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651233(1-8)Online publication date: 30-Jun-2024

Index Terms

  1. UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. singing voice conversion
      2. singing voice editing
      3. singing voice synthesis
      4. unified end-to-end model

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • National Key R&D Program of China

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)174
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 12 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)RPA-SCD: Rhythm and Pitch Aware Dual-Branch Network for Songs Conversion Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651233(1-8)Online publication date: 30-Jun-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media