skip to main content
10.1145/3594806.3596536acmotherconferencesArticle/Chapter ViewAbstractPublication PagespetraConference Proceedingsconference-collections
research-article

Practical Study of Deep Learning Models for Speech Synthesis

Published: 10 August 2023 Publication History

Abstract

Speech synthesis systems, also known as Text-To-Speech (TTS) systems, are increasingly frequent nowadays, with multiple applications such as voice assistants and screen readers for visually impaired or blind people. These applications require strong real-time capabilities to be usable in practice, which can be at the cost of a reduced quality in the synthesized voices. Deep Learning models, which have shown impressive results in the task of audio generation, are hardly ever used for everyday TTS because of their high demand in computational resources. Training such models also requires a large amount of good quality data, which is not available for most languages. This paper explores the benefits of cross-lingual transfer learning, both in terms of training time and amount of data that is needed to obtain good quality models. Our contributions are evaluated with respect to other TTS systems available for the French language. The main observation is that good quality single-speaker models can be trained within half a week on a single GPU, with a limited number of good quality data, by combining transfer learning with few-shot learning.

References

[1]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2019. Common Voice: A Massively-Multilingual Speech Corpus. CoRR abs/1912.06670 (2019), 5 pages. http://arxiv.org/abs/1912.06670
[2]
Ujwala Baruah, Rabul Hussain Laskar, and Biswajit Purkayashtha. 2020. Speaker Verification Systems: A Comprehensive Review. In Smart Computing Paradigms: New Progresses and Challenges, Atilla Elçi, Pankaj Kumar Sa, Chirag N. Modi, Gustavo Olague, Manmath N. Sahoo, and Sambit Bakshi (Eds.). Springer Singapore, Singapore, 195–207.
[3]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735–1780.
[4]
Keith Ito. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.
[5]
Corentin Jemine. 2019. Real-Time Voice Cloning. Master’s thesis. ULiège. https://matheo.uliege.be/handle/2268.2/6801
[6]
Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez-Moreno, and Yonghui Wu. 2018. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. CoRR abs/1806.04558 (2018), 15 pages. arXiv:1806.04558http://arxiv.org/abs/1806.04558
[7]
Diederik P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. ArXiv abs/1807.03039 (2018), 15 pages. https://arxiv.org/abs/1807.03039
[8]
Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, 2015. Siamese Neural Networks for One-Shot Image Recognition. In ICML deep learning workshop, Vol. 2. W&CP, Lille, 8 pages.
[9]
Ken MacLean. 2018. Voxforge. http://www.voxforge.org/home
[10]
NVIDIA. 2020. Tacotron 2 (without wavenet). https://github.com/NVIDIA/tacotron2
[11]
Kyubyong Park and Tommy Mulc. 2018. French Single Speaker Speech Dataset. https://www.kaggle.com/datasets/bryanpark/french-single-speaker-speech-dataset
[12]
Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2018. WaveGlow: A Flow-based Generative Network for Speech Synthesis. CoRR abs/1811.00002 (2018), 5 pages. http://arxiv.org/abs/1811.00002
[13]
Tim Sainburg. 2019. timsainb/noisereduce v1.0. https://doi.org/10.5281/zenodo.3243139
[14]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. CoRR abs/1712.05884 (2017), 5 pages. http://arxiv.org/abs/1712.05884
[15]
TTSFree.com. 2023. TTSFree – Free Text-To-Speech and Text-to-MP3 for French. https://ttsfree.com/text-to-speech/french/
[16]
TTSFree.com. 2023. TTSMP3 – Free Text-To-Speech and Text-to-MP3 for French. https://ttsmp3.com/text-to-speech/French/
[17]
Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung-yi Lee. 2019. End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning. CoRR abs/1904.06508 (2019), 5 pages. http://arxiv.org/abs/1904.06508
[18]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. CoRR abs/1609.03499 (2016), 15 pages. http://arxiv.org/abs/1609.03499
[19]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2017. Generalized End-to-End Loss for Speaker Verification. https://doi.org/10.48550/ARXIV.1710.10467
[20]
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. https://doi.org/10.48550/ARXIV.2301.02111
[21]
Junichi Yamagishi, Pierre-Edouard Honnet, Philip Garner, and Alexandros Lazaridis. 2017. The SIWIS French Speech Synthesis Database. https://doi.org/10.7488/DS/1705
[22]
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2019. A Comprehensive Survey on Transfer Learning. CoRR abs/1911.02685 (2019), 31 pages. http://arxiv.org/abs/1911.02685

Cited By

View all
  • (2024)Embeddings for Motor Imagery Classification2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP58920.2024.10734730(1-6)Online publication date: 22-Sep-2024
  • (2024)GPGAN-VC: Enhancing Voice Conversion using Gradient Penalty2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC63619.2025.10848670(1-6)Online publication date: 3-Dec-2024
  • (2023)Audio Reading Assistant for Visually Impaired PeopleAdvances in Cyber-Physical Systems10.23939/acps2023.02.0818:2(81-88)Online publication date: 10-Nov-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PETRA '23: Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments
July 2023
797 pages
ISBN:9798400700699
DOI:10.1145/3594806
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep Learning
  2. Speech Synthesis
  3. Text-to-Speech
  4. Transfer Learning
  5. Voice Cloning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PETRA '23

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)7
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Embeddings for Motor Imagery Classification2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP58920.2024.10734730(1-6)Online publication date: 22-Sep-2024
  • (2024)GPGAN-VC: Enhancing Voice Conversion using Gradient Penalty2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC63619.2025.10848670(1-6)Online publication date: 3-Dec-2024
  • (2023)Audio Reading Assistant for Visually Impaired PeopleAdvances in Cyber-Physical Systems10.23939/acps2023.02.0818:2(81-88)Online publication date: 10-Nov-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media