research-article

Practical Study of Deep Learning Models for Speech Synthesis

Authors:

Quentin Langlois,

Sébastien JodogneAuthors Info & Claims

PETRA '23: Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments

Pages 700 - 706

https://doi.org/10.1145/3594806.3596536

Published: 10 August 2023 Publication History

Abstract

Speech synthesis systems, also known as Text-To-Speech (TTS) systems, are increasingly frequent nowadays, with multiple applications such as voice assistants and screen readers for visually impaired or blind people. These applications require strong real-time capabilities to be usable in practice, which can be at the cost of a reduced quality in the synthesized voices. Deep Learning models, which have shown impressive results in the task of audio generation, are hardly ever used for everyday TTS because of their high demand in computational resources. Training such models also requires a large amount of good quality data, which is not available for most languages. This paper explores the benefits of cross-lingual transfer learning, both in terms of training time and amount of data that is needed to obtain good quality models. Our contributions are evaluated with respect to other TTS systems available for the French language. The main observation is that good quality single-speaker models can be trained within half a week on a single GPU, with a limited number of good quality data, by combining transfer learning with few-shot learning.

References

[1]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2019. Common Voice: A Massively-Multilingual Speech Corpus. CoRR abs/1912.06670 (2019), 5 pages. http://arxiv.org/abs/1912.06670

[2]

Ujwala Baruah, Rabul Hussain Laskar, and Biswajit Purkayashtha. 2020. Speaker Verification Systems: A Comprehensive Review. In Smart Computing Paradigms: New Progresses and Challenges, Atilla Elçi, Pankaj Kumar Sa, Chirag N. Modi, Gustavo Olague, Manmath N. Sahoo, and Sambit Bakshi (Eds.). Springer Singapore, Singapore, 195–207.

[3]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735–1780.

Digital Library

[4]

Keith Ito. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.

[5]

Corentin Jemine. 2019. Real-Time Voice Cloning. Master’s thesis. ULiège. https://matheo.uliege.be/handle/2268.2/6801

[6]

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez-Moreno, and Yonghui Wu. 2018. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. CoRR abs/1806.04558 (2018), 15 pages. arXiv:1806.04558http://arxiv.org/abs/1806.04558

[7]

Diederik P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions. ArXiv abs/1807.03039 (2018), 15 pages. https://arxiv.org/abs/1807.03039

[8]

Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, 2015. Siamese Neural Networks for One-Shot Image Recognition. In ICML deep learning workshop, Vol. 2. W&CP, Lille, 8 pages.

[9]

Ken MacLean. 2018. Voxforge. http://www.voxforge.org/home

[10]

NVIDIA. 2020. Tacotron 2 (without wavenet). https://github.com/NVIDIA/tacotron2

[11]

Kyubyong Park and Tommy Mulc. 2018. French Single Speaker Speech Dataset. https://www.kaggle.com/datasets/bryanpark/french-single-speaker-speech-dataset

[12]

Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2018. WaveGlow: A Flow-based Generative Network for Speech Synthesis. CoRR abs/1811.00002 (2018), 5 pages. http://arxiv.org/abs/1811.00002

[13]

Tim Sainburg. 2019. timsainb/noisereduce v1.0. https://doi.org/10.5281/zenodo.3243139

[14]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. CoRR abs/1712.05884 (2017), 5 pages. http://arxiv.org/abs/1712.05884

[15]

TTSFree.com. 2023. TTSFree – Free Text-To-Speech and Text-to-MP3 for French. https://ttsfree.com/text-to-speech/french/

[16]

TTSFree.com. 2023. TTSMP3 – Free Text-To-Speech and Text-to-MP3 for French. https://ttsmp3.com/text-to-speech/French/

[17]

Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung-yi Lee. 2019. End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning. CoRR abs/1904.06508 (2019), 5 pages. http://arxiv.org/abs/1904.06508

[18]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. CoRR abs/1609.03499 (2016), 15 pages. http://arxiv.org/abs/1609.03499

[19]

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2017. Generalized End-to-End Loss for Speaker Verification. https://doi.org/10.48550/ARXIV.1710.10467

[20]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. https://doi.org/10.48550/ARXIV.2301.02111

[21]

Junichi Yamagishi, Pierre-Edouard Honnet, Philip Garner, and Alexandros Lazaridis. 2017. The SIWIS French Speech Synthesis Database. https://doi.org/10.7488/DS/1705

[22]

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2019. A Comprehensive Survey on Transfer Learning. CoRR abs/1911.02685 (2019), 31 pages. http://arxiv.org/abs/1911.02685

Cited By

Langlois QJodogne S(2024)Embeddings for Motor Imagery Classification2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP58920.2024.10734730(1-6)Online publication date: 22-Sep-2024
https://doi.org/10.1109/MLSP58920.2024.10734730
Purohit RVaghera DPatil H(2024)GPGAN-VC: Enhancing Voice Conversion using Gradient Penalty2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC63619.2025.10848670(1-6)Online publication date: 3-Dec-2024
https://doi.org/10.1109/APSIPAASC63619.2025.10848670
Chypak YMorozov Y(2023)Audio Reading Assistant for Visually Impaired PeopleAdvances in Cyber-Physical Systems10.23939/acps2023.02.0818:2(81-88)Online publication date: 10-Nov-2023
https://doi.org/10.23939/acps2023.02.081

Index Terms

Practical Study of Deep Learning Models for Speech Synthesis

Index terms have been assigned to the content through auto-classification.

Recommendations

Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Multi-Feature Cross-Lingual Transfer Learning Approach for Low-Resource Vietnamese Speech Synthesis
AI2A '23: Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms

Abstract—Based on neural network end-to-end speech synthesis systems, high-quality speech can be synthesized when there is sufficient training data. However, it is difficult for languages with small datasets to synthesize speech with high quality and ...
Speech Synthesis Research Based on EGG
GREENCOM-ITHINGS-CPSCOM '13: Proceedings of the 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing

Objective: In this paper, Electroglottograph (EGG) is adopted to improve the naturalness of low bit-rate formant speech synthesis. Methods: EGG inverting waveform is used as glottal excitation of the formant speech synthesis. A new SUV divided method ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

PETRA '23: Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments

July 2023

797 pages

ISBN:9798400700699

DOI:10.1145/3594806

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

PETRA '23

PETRA '23: Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments

July 5 - 7, 2023

Corfu, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
53
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)7

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Langlois QJodogne S(2024)Embeddings for Motor Imagery Classification2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP58920.2024.10734730(1-6)Online publication date: 22-Sep-2024
https://doi.org/10.1109/MLSP58920.2024.10734730
Purohit RVaghera DPatil H(2024)GPGAN-VC: Enhancing Voice Conversion using Gradient Penalty2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC63619.2025.10848670(1-6)Online publication date: 3-Dec-2024
https://doi.org/10.1109/APSIPAASC63619.2025.10848670
Chypak YMorozov Y(2023)Audio Reading Assistant for Visually Impaired PeopleAdvances in Cyber-Physical Systems10.23939/acps2023.02.0818:2(81-88)Online publication date: 10-Nov-2023
https://doi.org/10.23939/acps2023.02.081

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten