research-article

UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching

Authors:

Zhou ZhaoAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 7569 - 7579

https://doi.org/10.1145/3581783.3612150

Published: 27 October 2023 Publication History

Abstract

Though previous works have shown remarkable achievements in singing voice generation, most existing models focus on one specific application and there is a lack of unified singing voice synthesis models. In addition to low relevance among tasks, different input modalities are one of the most intractable hindrances. Current methods suffer from information confusion and they can not perform precise control. In this work, we propose UniSinger, a unified end-to-end singing voice synthesizer, which integrates three abilities related to singing voice generation: singing voice synthesis (SVS), singing voice conversion (SVC), and singing voice editing (SVE) into a single framework. Specifically, we perform representation disentanglement for controlling different attributes of the singing voice. We further propose a cross-modality information matching method to close the distribution gap between multi-modal inputs and achieve end-to-end training. The experiments conducted on the OpenSinger dataset demonstrate that UniSinger achieves state-of-the-art results in three applications. Further extensive experiments verify the capability of representation disentanglement and information matching, reflecting that UniSinger enjoys great superiority in sample quality, timbre similarity, and multi-task compatibility. Audio samples can be found in https://unisinger.github.io/Samples/.

References

[1]

Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning. PMLR, 2709--2720.

[2]

Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776 (2020).

[3]

Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. Club: A contrastive log-ratio upper bound of mutual information. In International conference on machine learning. PMLR, 1779--1788.

[4]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised Cross-lingual Representation Learning for Speech Recognition. arxiv: 2006.13979 [cs.CL]

[5]

Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, and Zhou Zhao. 2021. EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. arXiv preprint arXiv:2106.09317 (2021).

[6]

Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, and Zhou Zhao. 2022. VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement. arXiv preprint arXiv:2211.10666 (2022).

[7]

Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, and Dong Yu. 2020. Pitchnet: Unsupervised singing voice conversion with pitch adversarial network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7749--7753.

[8]

Jeff Donahue, Sander Dieleman, Miko?aj Bi?kowski, Erich Elsen, and Karen Simonyan. 2021. End-to-End Adversarial Text-to-Speech. arxiv: 2006.03575 [cs.SD]

[9]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative Adversarial Networks. Commun. ACM, Vol. 63, 11 (oct 2020), 139--144. https://doi.org/10.1145/3422622

Digital Library

[10]

Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. 2021. Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 1--5.

[11]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, Vol. 33 (2020), 6840--6851.

[12]

Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2021. Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 3945--3954. https://doi.org/10.1145/3474085.3475437

Digital Library

[13]

Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou Zhao, Baoxing Huai, and Zhefeng Wang. 2022a. Singgan: Generative adversarial network for high-fidelity singing voice generation. In Proceedings of the 30th ACM International Conference on Multimedia. 2525--2535.

Digital Library

[14]

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661 (2023).

[15]

Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022b. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. arXiv preprint arXiv:2204.09934 (2022).

[16]

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. [n.,d.]. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech. In Advances in Neural Information Processing Systems.

[17]

Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, and Jinzheng He. 2022c. TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation. arXiv preprint arXiv:2205.12523 (2022).

[18]

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 266--273.

[19]

Takuhiro Kaneko and Hirokazu Kameoka. 2018. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2100--2104.

[20]

Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, Vol. 31 (2018).

[21]

Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arxiv: 1312.6114 [stat.ML]

[22]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.

[23]

Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. 2021. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514 (2021).

[24]

Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. 2019. Adversarially trained end-to-end korean singing voice synthesis system. arXiv preprint arXiv:1908.01919 (2019).

[25]

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. 2022. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658 (2022).

[26]

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6706--6713.

Digital Library

[27]

Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11020--11028.

[28]

Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. 2017. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International conference on machine learning. PMLR, 2391--2400.

[29]

Eliya Nachmani and Lior Wolf. 2019. Unsupervised singing voice conversion. arXiv preprint arXiv:1904.06590 (2019).

[30]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[31]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning. PMLR, 7836--7846.

[32]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning. PMLR, 5210--5219.

[33]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020a. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).

[34]

Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. 2020b. Deepsinger: Singing voice synthesis with data mined from the web. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1979--1989.

Digital Library

[35]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779--4783.

Digital Library

[36]

Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial Discriminative Domain Adaptation. arxiv: 1702.05464 [cs.CV]

[37]

Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. 2021. Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. arXiv preprint arXiv:2106.10132 (2021).

[38]

Zhenhui Ye, Zhou Zhao, Yi Ren, and Fei Wu. 2022. SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech. arxiv: 2204.11792 [cs.SD]

[39]

Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, et al. 2022b. M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus. Advances in Neural Information Processing Systems, Vol. 35 (2022), 6914--6926.

[40]

Lichao Zhang, Zhou Zhao, Yi Ren, and Liqun Deng. [n.,d.]. EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling. ([n.,d.]).

[41]

Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, and Mengxiao Bi. 2022a. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7237--7241.

Cited By

Du MWang HZhang RYan Z(2024)RPA-SCD: Rhythm and Pitch Aware Dual-Branch Network for Songs Conversion Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651233(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651233

Index Terms

UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

DeepSinger: Singing Voice Synthesis with Data Mined From the Web
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including ...
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity ...
Operatic Singing Voice Synthesis From Inexperienced Voice Considering Tempo and Vowel Change
MultiMedia Modeling
Abstract
The purpose of this study is to develop a system that can synthesize operatic singing voices from the speaking voice of a user who has no experience singing opera. In our previous work, a method that converts the speaker individuality of the input ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
329
Total Downloads

Downloads (Last 12 months)174
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Du MWang HZhang RYan Z(2024)RPA-SCD: Rhythm and Pitch Aware Dual-Branch Network for Songs Conversion Detection2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651233(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651233

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten