Skip to main content

How Much End-to-End is Tacotron 2 End-to-End TTS System

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

  • 1365 Accesses

Abstract

In recent years, the concept of end-to-end text-to-speech synthesis has begun to attract the attention of researchers. The motivation is simple – replacing the individual modules that TTS traditionally built on with a powerful deep neural network simplifies the architecture of the entire system. However, how capable are such end-to-end systems of dealing with classic tasks such as G2P, text normalisation, homograph disambiguation and other issues inseparably linked to text-to-speech systems?

In the present paper, we explore three free implementations of the Tacotron 2-based speech synthesizers, focusing on their abilities to transform the input text into correct pronunciation, not only in terms of G2P conversion but also in handling issues related to text analysis and the prosody patterns used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech (2021)

    Google Scholar 

  2. Dyson, P., Coombs, J.R.: inflect 5.3.0 (2021). https://pypi.org/project/inflect/

  3. Řezáčková, M., Tihelka, D., Švec, J.: T5g2p: Using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: Interspeech 2021, Brno, Czech Republic (2021)

    Google Scholar 

  4. Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)

    Article  Google Scholar 

  5. Ito, K., Johnson, L.: The lj speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/

  6. Jůzová, M., Tihelka, D.: Difficulties with wh-questions in czech tts system. In: Text, Speech, and Dialogue. Lecture Notes in Computer Science, vol. 9924, pp. 359–366. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-319-45510-5_41

  7. Jůzová, M., Tihelka, D., Volín, J.: On the extension of the formal prosody model for TTS. In: Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 11107, pp. 351–359. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-030-00794-2_38

  8. Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 2410–2419. PMLR (2018)

    Google Scholar 

  9. Kulkarni, A., Colotte, V., Jouvet, D.: Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis (2021). https://hal.archives-ouvertes.fr/hal-02978485, working paper or preprint

  10. Kumar, K., et al.: Melgan: generative adversarial networks for conditional waveform synthesis (2019)

    Google Scholar 

  11. Lu, Y., Dong, M., Chen, Y.: Implementing prosodic phrasing in Chinese end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7050–7054 (2019). https://doi.org/10.1109/ICASSP.2019.8682368

  12. Mama, R.: Tacotron-2 (2021). https://github.com/Rayhane-mamah/Tacotron-2

  13. McCarthy, O.: Wavernn (2021). https://github.com/fatchord/WaveRNN

  14. NVIDIA: Tacotron 2 (without wavenet) (2021). https://github.com/NVIDIA/tacotron2

  15. NVIDIA: Waveglow: a flow-based generative network for speech synthesis (2021). https://github.com/NVIDIA/WaveGlow

  16. van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). https://arxiv.org/abs/1609.03499

  17. Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech (2021)

    Google Scholar 

  18. Ren, Y., et al.: Fastspeech: fast, robust and controllable text to speech (2019)

    Google Scholar 

  19. Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48

    Chapter  Google Scholar 

  20. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368

  21. Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)

    Book  Google Scholar 

  22. Tensorflowtts: Real-time state-of-the-art speech synthesis for tensorflow 2 (2021). https://github.com/TensorSpeech/TensorFlowTTS

  23. Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40

    Chapter  Google Scholar 

  24. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017). https://doi.org/10.21437/Interspeech.2017-1452

  25. Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053795

  26. Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band melgan: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498 (2021). https://doi.org/10.1109/SLT48900.2021.9383551

Download references

Acknowledgement

This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Tihelka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tihelka, D., Matoušek, J., Tihelková, A. (2021). How Much End-to-End is Tacotron 2 End-to-End TTS System. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83527-9_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics