How Much End-to-End is Tacotron 2 End-to-End TTS System

Tihelka, Daniel; Matoušek, Jindřich; Tihelková, Alice

doi:10.1007/978-3-030-83527-9_44

Daniel Tihelka¹¹,
Jindřich Matoušek¹¹ &
Alice Tihelková¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1365 Accesses

Abstract

In recent years, the concept of end-to-end text-to-speech synthesis has begun to attract the attention of researchers. The motivation is simple – replacing the individual modules that TTS traditionally built on with a powerful deep neural network simplifies the architecture of the entire system. However, how capable are such end-to-end systems of dealing with classic tasks such as G2P, text normalisation, homograph disambiguation and other issues inseparably linked to text-to-speech systems?

In the present paper, we explore three free implementations of the Tacotron 2-based speech synthesizers, focusing on their abilities to transform the input text into correct pronunciation, not only in terms of G2P conversion but also in handling issues related to text analysis and the prosody patterns used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

End-to-End Speech Synthesis for the Serbian Language Based on Tacotron

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Text Processing for Marathi Text-To-Speech Synthesis

References

Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech (2021)
Google Scholar
Dyson, P., Coombs, J.R.: inflect 5.3.0 (2021). https://pypi.org/project/inflect/
Řezáčková, M., Tihelka, D., Švec, J.: T5g2p: Using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: Interspeech 2021, Brno, Czech Republic (2021)
Google Scholar
Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)
Article Google Scholar
Ito, K., Johnson, L.: The lj speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jůzová, M., Tihelka, D.: Difficulties with wh-questions in czech tts system. In: Text, Speech, and Dialogue. Lecture Notes in Computer Science, vol. 9924, pp. 359–366. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-319-45510-5_41
Jůzová, M., Tihelka, D., Volín, J.: On the extension of the formal prosody model for TTS. In: Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 11107, pp. 351–359. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-030-00794-2_38
Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 2410–2419. PMLR (2018)
Google Scholar
Kulkarni, A., Colotte, V., Jouvet, D.: Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis (2021). https://hal.archives-ouvertes.fr/hal-02978485, working paper or preprint
Kumar, K., et al.: Melgan: generative adversarial networks for conditional waveform synthesis (2019)
Google Scholar
Lu, Y., Dong, M., Chen, Y.: Implementing prosodic phrasing in Chinese end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7050–7054 (2019). https://doi.org/10.1109/ICASSP.2019.8682368
Mama, R.: Tacotron-2 (2021). https://github.com/Rayhane-mamah/Tacotron-2
McCarthy, O.: Wavernn (2021). https://github.com/fatchord/WaveRNN
NVIDIA: Tacotron 2 (without wavenet) (2021). https://github.com/NVIDIA/tacotron2
NVIDIA: Waveglow: a flow-based generative network for speech synthesis (2021). https://github.com/NVIDIA/WaveGlow
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). https://arxiv.org/abs/1609.03499
Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech (2021)
Google Scholar
Ren, Y., et al.: Fastspeech: fast, robust and controllable text to speech (2019)
Google Scholar
Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48
Chapter Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368
Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)
Book Google Scholar
Tensorflowtts: Real-time state-of-the-art speech synthesis for tensorflow 2 (2021). https://github.com/TensorSpeech/TensorFlowTTS
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017). https://doi.org/10.21437/Interspeech.2017-1452
Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053795
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band melgan: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498 (2021). https://doi.org/10.1109/SLT48900.2021.9383551

Download references

Acknowledgement

This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Author information

Authors and Affiliations

New Technologies for the Information Society, Pilsen, Czech Republic
Daniel Tihelka & Jindřich Matoušek
Department of English Language an Literature, Faculty of Arts, University of West Bohemia, Pilsen, Czech Republic
Alice Tihelková

Authors

Daniel Tihelka
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar
Alice Tihelková
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Tihelka .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tihelka, D., Matoušek, J., Tihelková, A. (2021). How Much End-to-End is Tacotron 2 End-to-End TTS System. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_44
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How Much End-to-End is Tacotron 2 End-to-End TTS System

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

End-to-End Speech Synthesis for the Serbian Language Based on Tacotron

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Text Processing for Marathi Text-To-Speech Synthesis

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

How Much End-to-End is Tacotron 2 End-to-End TTS System

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

End-to-End Speech Synthesis for the Serbian Language Based on Tacotron

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Text Processing for Marathi Text-To-Speech Synthesis

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation