Abstract
In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 h from CML-TTS and also with 245.07 h from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is publicly available under the CC-BY 4.0 license (https://freds0.github.io/CML-TTS-Dataset).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4218–4222. European Language Resources Association, Marseille, France, May 2020. https://aclanthology.org/2020.lrec-1.520
Casanova, E., et al.: TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Lang. Resour. Eval. 1–13 (2022)
Casanova, E., et al.: SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model (2021). https://doi.org/10.48550/ARXIV.2104.05557, https://arxiv.org/abs/2104.05557
Casanova, E., Weber, J., Shulby, C.D., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning, pp. 2709–2720. PMLR (2022)
Chung, J.S., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. CoRR abs/1806.05622 (2018). http://arxiv.org/abs/1806.05622
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021, pp. 2426–2430. ISCA (2021). https://doi.org/10.21437/Interspeech.2021-329
Dempsey, P.: The teardown: google home personal assistant. Eng. Technol. 12(3), 80–81 (2017)
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016)
Gruber, T.R.: Siri, a virtual personal assistant-bringing intelligence to the interface. In: Semantic Technologies Conference (2009)
Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020 (2020). https://doi.org/10.48550/ARXIV.2009.14153, https://arxiv.org/abs/2009.14153
Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y.: ProDiff: progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605 (2022)
Ito, K., Johnson, L.: The LJSpeech Dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jemine, C.: Master Thesis: Real-time voice cloning. Master’s thesis, Faculté des Sciences Appliquèes (2019)
Kim, C., Stern, R.M.: Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In: Ninth Annual Conference of the International Speech Communication Association (2008)
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 33, 8067–8077 (2020)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis (2020)
Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W.: Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5621–5625. IEEE (2019)
Lux, F., Vu, N.T.: Language-agnostic meta-learning for low-resource text-to-speech with articulatory features. arXiv preprint arXiv:2203.03191 (2022)
Munich Artificial Intelligence Laboratories GmbH: The M-AILABS Speech Dataset (2017). https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/feld.de/content/bworld-robot-control-software/. Accessed 05 Nov 2022
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Nekvinda, T., Dušek, O.: One model, many languages: meta-learning for multilingual text-to-speech. arXiv preprint arXiv:2008.00768 (2020)
van den Oord, A., et al: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Ping, W., et al.: Deep voice 3: 2000-speaker neural text-to-speech. arXiv preprint arXiv:1710.07654 (2017)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020. ISCA, October 2020. https://doi.org/10.21437/interspeech.2020-2826, https://doi.org/10.21437%2Finterspeech.2020-2826
Purington, A., Taft, J.G., Sannon, S., Bazarova, N.N., Taylor, S.H.: Alexa is my new BFF social roles, user satisfaction, and personification of the Amazon Echo. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2853–2859 (2017)
Salesky, E., et al.: The multilingual TEDx corpus for speech recognition and translation. CoRR abs/2102.01757 (2021). https://arxiv.org/abs/2102.01757
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis. In: International Conference on Learning Representations, Workshop (2017)
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint arXiv:1710.08969 (2017)
Valle, R., Shih, K.J., Prenger, R., Catanzaro, B.: Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In: International Conference on Learning Representations (2020)
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model. arXiv preprint arXiv:1703.10135 (2017)
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (2019). https://datashare.ed.ac.uk/handle/10283/3443. Accessed 05 Nov 2022
Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966. IEEE (2013)
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448 (2019)
Acknowledgements
The authors are grateful to CEIA at UFG for their support and to Coqui and CyberLabs for their valuable assistance. We also thank the LibriVox volunteers for making this project possible.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Oliveira, F.S., Casanova, E., Junior, A.C., Soares, A.S., Galvão Filho, A.R. (2023). CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-40498-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)