VITS: Quality Vs. Speed Analysis

Matoušek, Jindřich; Tihelka, Daniel

doi:10.1007/978-3-031-40498-6_19

Jindřich Matoušek^10,11 &
Daniel Tihelka¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

664 Accesses

Abstract

In this paper, we analyze the performance of a modern end-to-end speech synthesis model called Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS). We build on the original VITS model and examine how different modifications to its architecture affect synthetic speech quality and computational complexity. Experiments with two Czech voices, a male and a female, were carried out. To assess the quality of speech synthesized by the different modified models, MUSHRA listening tests were performed. The computational complexity was measured in terms of synthesis speed over real time. While the original VITS model is still preferred regarding speech quality, we present a modification of the original structure with a significantly better response yet providing acceptable output quality. Such a configuration can be used when system response latency is critical.

This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TL05000546.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Article 15 August 2020

Data Alignment and Duration Modelling in VITS

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Notes

1.
https://github.com/coqui-ai/TTS.

References

Method for the subjective assessment of intermediate quality level of coding systems. Technical report BS.1534-2, International Telecommunication Union (2014)
Google Scholar
Arik, S., et al.: Deep voice: real-time neural text-to-speech. In: International Conference on Machine Learning (2017)
Google Scholar
Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning. Baltimore, USA (2022)
Google Scholar
Chen, M., et al.: MultiSpeech: Multi-Speaker text to speech with transformer. In: INTERSPEECH, pp. 4024–4028. International Speech Communication Association, Shanghai, China (2020). https://doi.org/10.21437/Interspeech.2020-3139
Cho, H., Jung, W., Lee, J., Woo, S.H.: SANE-TTS: stable and natural end-to-end multilingual text-to-speech. In: INTERSPEECH, pp. 1–5. Incheon, Korea (2022). https://doi.org/10.21437/Interspeech.2022-46
Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech. In: International Conference on Learning Representations (2021)
Google Scholar
Hanzlíček, Z., Matoušek, J.: Phonetic speech segmentation of audiobooks by using adapted LSTM-based acoustic models. In: Garcia, A.C.B., Ferro, M., Ribón, J.C.R. (eds.) IBERAMIA 2022. Lecture Notes in Computer Science, vol. 13788, pp. 317–327. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-22419-5_27
Chapter Google Scholar
Hanzlíček, Z., Vít, J.: LSTM-based speech segmentation trained on different foreign languages. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 456–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_49
Chapter Google Scholar
Henter, G.E., Merritt, T., Shannon, M., Mayo, C., King, S.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: INTERSPEECH, pp. 1504–1508. Singapore (2014)
Google Scholar
Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. In: INTERSPEECH, pp. 3605–3609. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-469
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. In: Neural Information Processing Systems. Vancouver, Canada (2020)
Google Scholar
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540 (2021)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, pp. 1–14. No. Ml, Banff, Canada (2014)
Google Scholar
Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean cepstral distortion. In: Speech Technology for Under-Resourced Languages, pp. 63–68. Hanoi, Vietnam (2008)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Conference on Neural Information Processing Systems. Vancouver, Canada (2020)
Google Scholar
Kubichek, R.F.: Mel-cepstral distance measure for objective speech quality assessment. In: IEEE Pacific Rim Conference on Communications Computers and Signal Processing, pp. 125–128. Victoria, Canada (1993). https://doi.org/10.1109/pacrim.1993.407206
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations. New Orleans, USA (2019)
Google Scholar
Matoušek, J., Tihelka, D.: On comparison of phonetic representations for Czech neural speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) Text, Speech, and Dialogue, TSD 2022. Lecture Notes in Computer Science, vol. 13502, pp. 410–422. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16270-1_34
Chapter Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, pp. 1296–1299. Marrakech, Morocco (2008)
Google Scholar
Merritt, T., Clark, R.A.J., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5145–5149. Shanghai, China (2016)
Google Scholar
van den Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR (2017). arxiv.org/abs/1711.10433
Ping, W., Peng, K., Chen, J.: Clarinet: parallel wave generation in end-to-end text-to-speech. In: International Conference on Learning Representations. New Orleans, USA (2019)
Google Scholar
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 3617–3621. Brighton, United Kingdom (2019). https://doi.org/10.1109/ICASSP.2019.8683143
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Google Scholar
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning. Lille, France (2015)
Google Scholar
Shang, Z., Shi, P., Zhang, P., Wang, L., Zhao, G.: HierTTS : expressive end-to-end text-to-waveform using a multi-scale hierarchical variational auto-encoder. Appl. Sci. (Switzerland) 13(2), 868 (2023)
Google Scholar
Shirahata, Y., Yamamoto, R., Song, E., Terashima, R., Kim, J.M., Tachibana, K.: Period VITS: variational inference with explicit pitch modeling for end-to-end emotional speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing. Rhodes Island, Greece (2023). https://doi.org/10.1109/ICASSP49357.2023.10096480
Song, K., et al.: AdaVITS: Tiny VITS for low computing resource speaker adaptation. In: International Symposium on Chinese Language Processing. Singapore (2022). https://doi.org/10.1109/ISCSLP57327.2022.10037585
Tan, X., et al.: NaturalSpeech: end-to-end text to speech synthesis with human-level quality. CoRR, pp. 1–19 (2022). arxiv.org/abs/2205.04421
Tan, X., Qin, T., Soong, F.K., Liu, T.Y.: A Survey on neural speech synthesis. CoRR abs/2106.1 (2021). https://doi.org/10.48550/arXiv.2106.15561
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010. Long Beach, CA, USA (2017)
Google Scholar
Vít, J., Hanzlíček, Z., Matoušek, J.: On the analysis of training data for WaveNet-based speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5684–5688. Calgary, Canada (2018). https://doi.org/10.1109/ICASSP.2018.8461960
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: INTERSPEECH, pp. 4006–4010. Stockholm, Sweden (2017). https://doi.org/10.21437/Interspeech.2017-1452
Řezáčková, M., Švec, J., Tihelka, D.: T5G2P: using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 6–10. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-546

Download references

Acknowledgements

Computational resources were provided by the e-INFRA CZ project (ID:90140), supported by the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Jindřich Matoušek
New Technology for the Information Society (NTIS), Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Jindřich Matoušek & Daniel Tihelka

Authors

Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Tihelka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jindřich Matoušek .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matoušek, J., Tihelka, D. (2023). VITS: Quality Vs. Speed Analysis. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_19
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VITS: Quality Vs. Speed Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Data Alignment and Duration Modelling in VITS

Conventional and contemporary approaches used in text to speech synthesis: a review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VITS: Quality Vs. Speed Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Data Alignment and Duration Modelling in VITS

Conventional and contemporary approaches used in text to speech synthesis: a review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation