VITS, Tacotron or FastSpeech? Challenging Some of the Most Popular Synthesizers

Matoušek, Jindřich; Tihelka, Daniel; Tihelková, Alice

doi:10.1007/978-3-031-47665-5_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14408))

Included in the following conference series:

Asian Conference on Pattern Recognition

474 Accesses

Abstract

The paper presents a comparative study of three neural speech synthesizers, namely VITS, Tacotron2 and FastSpeech2, which belong among the most popular TTS systems nowadays. Due to their varying nature, they have been tested from several points of view, analysing not only the overall quality of the synthesized speech, but also the capability of processing either orthographic or phonetic inputs. The analysis has been carried out on two English and one Czech voices.

This research was supported by the Czech Science Foundation (GA CR), project No. GA22-27800S, and by the grant of the University of West Bohemia, project No. SGS-2022-017. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On Comparison of Phonetic Representations for Czech Neural Speech Synthesis

ITAcotron 2: The Power of Transfer Learning in Expressive TTS Synthesis

Sentences vs Phrases in Neural Speech Synthesis

References

Beerends, J., et al.: Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I-temporal alignment. AES: J. Audio Eng. Soc. 61, 366–384 (2013)
Google Scholar
Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone (2021)
Google Scholar
Cho, H., Jung, W., Lee, J., Woo, S.H.: SANE-TTS: stable and natural end-to-end multilingual text-to-speech. In: Ko, H., Hansen, J.H.L. (eds.) 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, 18–22 September 2022, pp. 1–5. ISCA (2022). https://doi.org/10.21437/Interspeech.2022-46
Delalez, S., Akue, L.: Neural TTS in French: comparing graphemic and phonetic inputs using the SynPaFlex-Corpus and Tacotron2 (2023)
Google Scholar
Elias, I., et al.: Parallel Tacotron 2: a non-autoregressive neural TTS model with differentiable duration modeling. In: Proceedings of the Interspeech 2021, pp. 141–145 (2021). https://doi.org/10.21437/Interspeech.2021-1461
Řezáčková, M., Tihelka, D., Švec, J.: T5G2P: using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: Interspeech 2021, Brno, Czech Republic (2021)
Google Scholar
Fong, J., Taylor, J., Richmond, K., King, S.: A comparison of letters and phones as input to sequence-to-sequence models for speech synthesis. In: Speech Synthesis Workshop, Vienna, Austria, pp. 223–227 (2019). https://doi.org/10.21437/SSW.2019-40
Gölge, E.: Coqui TTS (2021). https://doi.org/10.5281/zenodo.6334862
Grůber, M., Chýlek, A., Matoušek, J.: Framework for conducting tasks requiring human assessment. In: Proceedings of the Interspeech 2019, pp. 4626–4627 (2019)
Google Scholar
Hanzlíček, Z., Vít, J.: LSTM-based speech segmentation trained on different foreign languages. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 456–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_49
Chapter Google Scholar
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Chapter Google Scholar
Ito, K., Johnson, L.: The LJ speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
ITU Recommendation BS.1534-2: Method for the subjective assessment of intermediate quality level of coding systems. Technical report, International Telecommunication Union (2014)
Google Scholar
Jůzová, M., Tihelka, D., Vít, J.: Unified language-independent DNN-based G2P converter. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, pp. 2085–2089. ISCA (2019). https://doi.org/10.21437/Interspeech.2019-2335
Kögel, F., Nguyen, B., Cardinaux, F.: Towards robust FastSpeech 2 by modelling residual multimodality (2023)
Google Scholar
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. In: Conference on Neural Information Processing Systems, Vancouver, Canada (2020)
Google Scholar
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540 (2021)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego, USA (2015)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Conference on Neural Information Processing Systems (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, New Orleans, USA (2019)
Google Scholar
Matoušek, J., Tihelka, D.: On comparison of phonetic representations for Czech neural speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue, TSD 2022. LNCS, vol. 13502, pp. 410–422. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_34
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)
Google Scholar
Prasad, A., Zuluaga-Gomez, J., Motlicek, P., Sarfjoo, S., Nigmatulina, I., Vesely, K.: Speech and natural language processing technologies for pseudo-pilot simulator (2022)
Google Scholar
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech (2021)
Google Scholar
Rix, A., Beerends, J., Hollier, M., Hekstra, A.: Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749–752 (2001). https://doi.org/10.1109/ICASSP.2001.941023
Shang, Z., Shi, P., Zhang, P., Wang, L., Zhao, G.: HierTTS: expressive end-to-end text-to-waveform using a multi-scale hierarchical variational auto-encoder. Appl. Sci. 13(2) (2023). https://doi.org/10.3390/app13020868
Shen, J., et al.: Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011). https://doi.org/10.1109/TASL.2011.2114881
Article Google Scholar
Tan, X., Qin, T., Soong, F., Liu, T.Y.: A Survey on Neural Speech Synthesis (2021)
Google Scholar
TensorFlowTTS: Real-time state-of-the-art speech synthesis for TensorFlow 2 (2021). https://github.com/TensorSpeech/TensorFlowTTS
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Tihelka, D., Matoušek, J., Tihelková, A.: How much end-to-end is Tacotron 2 end-to-end TTS system. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 511–522. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_44
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need (2017)
Google Scholar
Vervloesem, K., Bachmann, M.: gruut 2.2.0 (2021). https://github.com/rhasspy/gruut
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 492–498 (2021). https://doi.org/10.1109/SLT48900.2021.9383551
Zhao, W., Lian, Y., Chai, J., Tu, Z.: Multi-speaker Chinese news broadcasting system based on improved Tacotron2. Multimedia Tools Appl. 4391, 89–100 (2023). https://doi.org/10.1007/s11042-023-15279-z
Article Google Scholar
Zhou, Z., Liu, S.: Learning to auto-correct for high-quality spectrograms. In: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023, pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10094762

Download references

Author information

Authors and Affiliations

New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czechia
Jindřich Matoušek, Daniel Tihelka & Alice Tihelková
Department of English Language and Literature, Faculty of Arts, University of West Bohemia, Pilsen, Czechia
Jindřich Matoušek, Daniel Tihelka & Alice Tihelková

Authors

Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Tihelka
View author publications
You can also search for this author in PubMed Google Scholar
Alice Tihelková
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Tihelka .

Editor information

Editors and Affiliations

Kyushu Institute of Technology, Kitakyushu, Fukuoka, Japan
Huimin Lu
The University of Sydney, Sydney, NSW, Australia
Michael Blumenstein
Yonsei University, Seoul, Korea (Republic of)
Sung-Bae Cho
Chinese Academy of Sciences, Bejing, China
Cheng-Lin Liu
Osaka University, Osaka, Ibaraki, Japan
Yasushi Yagi
Kyushu Institute of Technology, Kitakyushu, Japan
Tohru Kamiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matoušek, J., Tihelka, D., Tihelková, A. (2023). VITS, Tacotron or FastSpeech? Challenging Some of the Most Popular Synthesizers. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14408. Springer, Cham. https://doi.org/10.1007/978-3-031-47665-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-47665-5_26
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47664-8
Online ISBN: 978-3-031-47665-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VITS, Tacotron or FastSpeech? Challenging Some of the Most Popular Synthesizers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On Comparison of Phonetic Representations for Czech Neural Speech Synthesis

ITAcotron 2: The Power of Transfer Learning in Expressive TTS Synthesis

Sentences vs Phrases in Neural Speech Synthesis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VITS, Tacotron or FastSpeech? Challenging Some of the Most Popular Synthesizers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On Comparison of Phonetic Representations for Czech Neural Speech Synthesis

ITAcotron 2: The Power of Transfer Learning in Expressive TTS Synthesis

Sentences vs Phrases in Neural Speech Synthesis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation