Curriculum Learning Based Approach for Faster Convergence of TTS Model

Kaur, Navneet; Ghosh, Prasanta Kumar

doi:10.1007/978-3-031-48312-7_17

Navneet Kaur¹³ &
Prasanta Kumar Ghosh¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14339))

Included in the following conference series:

International Conference on Speech and Computer

288 Accesses

Abstract

With the advent of deep learning, Text-to-Speech technology has been revolutionized, and current state-of-the-art models are capable of synthesizing almost human-like speech. Recent Text-to-Speech models use a sequence-to-sequence architecture that directly converts text or phoneme sequence into low-level acoustic representation such as spectrogram. These end-to-end models need a large dataset for training, and with conventional learning methodology, they need days of training to generate intelligible and natural voice. ‘How to use a large dataset to efficiently train a TTS model?’ has not been studied in the past. ‘Curriculum learning’ has been proven to speed up the convergence of models in other machine learning areas. For TTS task, the challenge in creating curriculum is to establish the difficulty criteria for the training samples. In this paper, we have experimented with various scoring functions based on text and acoustic features and achieved faster convergence of the end-to-end TTS model. We found ’text-length’ or the number of phonemes/characters in text to be a simple yet most effective measure of difficulty for designing curriculum for Text-to-Speech task. Using text-length based curriculum, we validated the faster convergence of TTS model using three datasets of different languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)
Google Scholar
Black, A., Tokuda, K., Zen, H.: An hmm-based speech synthesis system applied to English. In: Proceedings of (2002)
Google Scholar
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition In: Advances in Neural Information Processing Systems 28 (2015)
Google Scholar
Cooper, E., Levitan, Y., Hirschberg, J.: Data selection for naturalness in hmm-based speech synthesis. In: Speech Prosody (2016)
Google Scholar
Cooper, E., Wang, X.: Utterance selection for optimizing intelligibility of TTS voices trained on ASR data. Interspeech 2017, 1 (2017)
Google Scholar
Fang, W., Chung, Y.A., Glass, J.: Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models. arXiv preprint arXiv:1906.07307 (2019)
Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: International Conference on Machine Learning, pp. 2535–2544. PMLR (2019)
Google Scholar
Hwang, S.W., Chang, J.H.: Document-level neural TTS using curriculum learning and attention masking. IEEE Access 9, 8954–8960 (2021)
Article Google Scholar
Jurafsky, D., Martin, J.H.: Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition
Google Scholar
Kumar, K., et al.: Melgan: generative adversarial networks for conditional waveform synthesis. In: dvances in Neural Information Processing Systems 32 (2019)
Google Scholar
Kuo, F.Y., Ouyang, I.C., Aryal, S., Lanchantin, P.: Selection and training schemes for improving TTS voice built on found data. In: INTERSPEECH, pp. 1516–1520 (2019)
Google Scholar
Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713 (2019)
Google Scholar
Lotfian, R., Busso, C.: Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27(4), 815–826 (2019)
Article Google Scholar
Onaolapo, J., Idachaba, F., Badejo, J., Odu, T., Adu, O.: A simplified overview of text-to-speech synthesis. Proc. World Congr. Eng 1, 5–7 (2014)
Google Scholar
Oord, A.v.d., et al.: Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Prenger, R., Valle, R., Catanzaro, B.: Waveglow: a flow-based generative network for speech synthesis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Google Scholar
Ranjan, S., Hansen, J.H.: Curriculum learning based approaches for robust end-to-end far-field speech recognition. Speech Commun. 132, 123–131 (2021)
Article Google Scholar
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 (2020)
Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4555–4576 (2021)
Google Scholar
Weinshall, D., Cohen, G., Amir, D.: Curriculum learning by transfer learning: Theory and experiments with deep networks. In: International Conference on Machine Learning, pp. 5238–5246. PMLR (2018)
Google Scholar
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: SSW, pp. 202–207 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

SPIRE Lab, Indian Institute of Science, Bengaluru, 560012, India
Navneet Kaur & Prasanta Kumar Ghosh

Authors

Navneet Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Prasanta Kumar Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Navneet Kaur .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaur, N., Ghosh, P.K. (2023). Curriculum Learning Based Approach for Faster Convergence of TTS Model. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-48312-7_17
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Curriculum Learning Based Approach for Faster Convergence of TTS Model