CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

Oliveira, Frederico S.; Casanova, Edresson; Junior, Arnaldo Candido; Soares, Anderson S.; Galvão Filho, Arlindo R.

doi:10.1007/978-3-031-40498-6_17

Frederico S. Oliveira¹⁰,
Edresson Casanova¹⁰,
Arnaldo Candido Junior¹⁰,
Anderson S. Soares¹⁰ &
…
Arlindo R. Galvão Filho¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

628 Accesses
2 Citations

Abstract

In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 h from CML-TTS and also with 245.07 h from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is publicly available under the CC-BY 4.0 license (https://freds0.github.io/CML-TTS-Dataset).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HUI-Audio-Corpus-German: A High Quality TTS Dataset

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Application of the Multilingual Acoustic Representation Model XLSR-53 for the Transcription of Ewondo

Notes

References

Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4218–4222. European Language Resources Association, Marseille, France, May 2020. https://aclanthology.org/2020.lrec-1.520
Casanova, E., et al.: TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Lang. Resour. Eval. 1–13 (2022)
Google Scholar
Casanova, E., et al.: SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model (2021). https://doi.org/10.48550/ARXIV.2104.05557, https://arxiv.org/abs/2104.05557
Casanova, E., Weber, J., Shulby, C.D., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning, pp. 2709–2720. PMLR (2022)
Google Scholar
Chung, J.S., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. CoRR abs/1806.05622 (2018). http://arxiv.org/abs/1806.05622
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021, pp. 2426–2430. ISCA (2021). https://doi.org/10.21437/Interspeech.2021-329
Dempsey, P.: The teardown: google home personal assistant. Eng. Technol. 12(3), 80–81 (2017)
Article Google Scholar
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016)
Google Scholar
Gruber, T.R.: Siri, a virtual personal assistant-bringing intelligence to the interface. In: Semantic Technologies Conference (2009)
Google Scholar
Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020 (2020). https://doi.org/10.48550/ARXIV.2009.14153, https://arxiv.org/abs/2009.14153
Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y.: ProDiff: progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605 (2022)
Google Scholar
Ito, K., Johnson, L.: The LJSpeech Dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jemine, C.: Master Thesis: Real-time voice cloning. Master’s thesis, Faculté des Sciences Appliquèes (2019)
Google Scholar
Kim, C., Stern, R.M.: Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In: Ninth Annual Conference of the International Speech Communication Association (2008)
Google Scholar
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 33, 8067–8077 (2020)
Google Scholar
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis (2020)
Google Scholar
Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W.: Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5621–5625. IEEE (2019)
Google Scholar
Lux, F., Vu, N.T.: Language-agnostic meta-learning for low-resource text-to-speech with articulatory features. arXiv preprint arXiv:2203.03191 (2022)
Munich Artificial Intelligence Laboratories GmbH: The M-AILABS Speech Dataset (2017). https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/feld.de/content/bworld-robot-control-software/. Accessed 05 Nov 2022
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Google Scholar
Nekvinda, T., Dušek, O.: One model, many languages: meta-learning for multilingual text-to-speech. arXiv preprint arXiv:2008.00768 (2020)
van den Oord, A., et al: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Google Scholar
Ping, W., et al.: Deep voice 3: 2000-speaker neural text-to-speech. arXiv preprint arXiv:1710.07654 (2017)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020. ISCA, October 2020. https://doi.org/10.21437/interspeech.2020-2826, https://doi.org/10.21437%2Finterspeech.2020-2826
Purington, A., Taft, J.G., Sannon, S., Bazarova, N.N., Taylor, S.H.: Alexa is my new BFF social roles, user satisfaction, and personification of the Amazon Echo. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2853–2859 (2017)
Google Scholar
Salesky, E., et al.: The multilingual TEDx corpus for speech recognition and translation. CoRR abs/2102.01757 (2021). https://arxiv.org/abs/2102.01757
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis. In: International Conference on Learning Representations, Workshop (2017)
Google Scholar
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint arXiv:1710.08969 (2017)
Valle, R., Shih, K.J., Prenger, R., Catanzaro, B.: Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In: International Conference on Learning Representations (2020)
Google Scholar
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Google Scholar
Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model. arXiv preprint arXiv:1703.10135 (2017)
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (2019). https://datashare.ed.ac.uk/handle/10283/3443. Accessed 05 Nov 2022
Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966. IEEE (2013)
Google Scholar
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448 (2019)

Download references

Acknowledgements

The authors are grateful to CEIA at UFG for their support and to Coqui and CyberLabs for their valuable assistance. We also thank the LibriVox volunteers for making this project possible.

Author information

Authors and Affiliations

UFG, Goiás, GO, Brazil
Frederico S. Oliveira, Edresson Casanova, Arnaldo Candido Junior, Anderson S. Soares & Arlindo R. Galvão Filho

Authors

Frederico S. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Edresson Casanova
View author publications
You can also search for this author in PubMed Google Scholar
Arnaldo Candido Junior
View author publications
You can also search for this author in PubMed Google Scholar
Anderson S. Soares
View author publications
You can also search for this author in PubMed Google Scholar
Arlindo R. Galvão Filho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Frederico S. Oliveira , Edresson Casanova , Arnaldo Candido Junior , Anderson S. Soares or Arlindo R. Galvão Filho .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oliveira, F.S., Casanova, E., Junior, A.C., Soares, A.S., Galvão Filho, A.R. (2023). CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_17
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages