Skip to main content

CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2023)

Abstract

In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 h from CML-TTS and also with 245.07 h from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is publicly available under the CC-BY 4.0 license (https://freds0.github.io/CML-TTS-Dataset).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://librivox.org/.

  2. 2.

    https://www.gutenberg.org/.

  3. 3.

    https://www.readbeyond.it/aeneas/docs/index.html.

  4. 4.

    https://www.instituto-camoes.pt/en/activity-camoes/what-we-do/teach-portuguese/orthographic-agreement.

References

  1. Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4218–4222. European Language Resources Association, Marseille, France, May 2020. https://aclanthology.org/2020.lrec-1.520

  2. Casanova, E., et al.: TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Lang. Resour. Eval. 1–13 (2022)

    Google Scholar 

  3. Casanova, E., et al.: SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model (2021). https://doi.org/10.48550/ARXIV.2104.05557, https://arxiv.org/abs/2104.05557

  4. Casanova, E., Weber, J., Shulby, C.D., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning, pp. 2709–2720. PMLR (2022)

    Google Scholar 

  5. Chung, J.S., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)

  6. Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. CoRR abs/1806.05622 (2018). http://arxiv.org/abs/1806.05622

  7. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. In: Hermansky, H., Cernocký, H., Burget, L., Lamel, L., Scharenborg, O., Motlícek, P. (eds.) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August–3 September 2021, pp. 2426–2430. ISCA (2021). https://doi.org/10.21437/Interspeech.2021-329

  8. Dempsey, P.: The teardown: google home personal assistant. Eng. Technol. 12(3), 80–81 (2017)

    Article  Google Scholar 

  9. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016)

    Google Scholar 

  10. Gruber, T.R.: Siri, a virtual personal assistant-bringing intelligence to the interface. In: Semantic Technologies Conference (2009)

    Google Scholar 

  11. Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020 (2020). https://doi.org/10.48550/ARXIV.2009.14153, https://arxiv.org/abs/2009.14153

  12. Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y.: ProDiff: progressive fast diffusion model for high-quality text-to-speech. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605 (2022)

    Google Scholar 

  13. Ito, K., Johnson, L.: The LJSpeech Dataset (2017). https://keithito.com/LJ-Speech-Dataset/

  14. Jemine, C.: Master Thesis: Real-time voice cloning. Master’s thesis, Faculté des Sciences Appliquèes (2019)

    Google Scholar 

  15. Kim, C., Stern, R.M.: Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In: Ninth Annual Conference of the International Speech Communication Association (2008)

    Google Scholar 

  16. Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 33, 8067–8077 (2020)

    Google Scholar 

  17. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)

    Google Scholar 

  18. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis (2020)

    Google Scholar 

  19. Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W.: Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5621–5625. IEEE (2019)

    Google Scholar 

  20. Lux, F., Vu, N.T.: Language-agnostic meta-learning for low-resource text-to-speech with articulatory features. arXiv preprint arXiv:2203.03191 (2022)

  21. Munich Artificial Intelligence Laboratories GmbH: The M-AILABS Speech Dataset (2017). https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/feld.de/content/bworld-robot-control-software/. Accessed 05 Nov 2022

  22. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)

    Google Scholar 

  23. Nekvinda, T., Dušek, O.: One model, many languages: meta-learning for multilingual text-to-speech. arXiv preprint arXiv:2008.00768 (2020)

  24. van den Oord, A., et al: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499

  25. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)

    Google Scholar 

  26. Ping, W., et al.: Deep voice 3: 2000-speaker neural text-to-speech. arXiv preprint arXiv:1710.07654 (2017)

  27. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020. ISCA, October 2020. https://doi.org/10.21437/interspeech.2020-2826, https://doi.org/10.21437%2Finterspeech.2020-2826

  28. Purington, A., Taft, J.G., Sannon, S., Bazarova, N.N., Taylor, S.H.: Alexa is my new BFF social roles, user satisfaction, and personification of the Amazon Echo. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2853–2859 (2017)

    Google Scholar 

  29. Salesky, E., et al.: The multilingual TEDx corpus for speech recognition and translation. CoRR abs/2102.01757 (2021). https://arxiv.org/abs/2102.01757

  30. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)

    Google Scholar 

  31. Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis. In: International Conference on Learning Representations, Workshop (2017)

    Google Scholar 

  32. Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint arXiv:1710.08969 (2017)

  33. Valle, R., Shih, K.J., Prenger, R., Catanzaro, B.: Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In: International Conference on Learning Representations (2020)

    Google Scholar 

  34. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)

    Google Scholar 

  35. Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model. arXiv preprint arXiv:1703.10135 (2017)

  36. Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (2019). https://datashare.ed.ac.uk/handle/10283/3443. Accessed 05 Nov 2022

  37. Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966. IEEE (2013)

    Google Scholar 

  38. Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)

  39. Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448 (2019)

Download references

Acknowledgements

The authors are grateful to CEIA at UFG for their support and to Coqui and CyberLabs for their valuable assistance. We also thank the LibriVox volunteers for making this project possible.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Frederico S. Oliveira , Edresson Casanova , Arnaldo Candido Junior , Anderson S. Soares or Arlindo R. Galvão Filho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Oliveira, F.S., Casanova, E., Junior, A.C., Soares, A.S., Galvão Filho, A.R. (2023). CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40498-6_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics