Skip to main content

On Comparison of Phonetic Representations for Czech Neural Speech Synthesis

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2022)

Abstract

In this paper, we investigate two research questions related to the phonetic representation of input text in Czech neural speech synthesis: 1) whether we can afford to reduce the phonetic alphabet, and 2) whether we can remove pauses from phonetic transcription and let the speech synthesis model predict the pause positions itself. In our experiments, three different modern speech synthesis models (FastSpeech 2 + Multi-band MelGAN, Glow-TTS + UnivNet, and VITS) were employed. We have found that the reduced phonetic alphabet outperforms the traditionally used full phonetic alphabet. On the other hand, removing pauses does not help. The presence of pauses (predicted by an external pause prediction tool) in phonetic transcription leads to a slightly better quality of synthetic speech.

This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TL05000546.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For phonetic transcription we use International Phonetic Alphabet (IPA), https://www.internationalphoneticassociation.org/content/ipa-chart.

  2. 2.

    Czech SAMPA, http://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm.

References

  1. Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. CoRR abs/2112.0 (2021), http://arxiv.org/abs/2112.02418

  2. Fong, J., Gallegos, P.O., Hodari, Z., King, S.: Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data. In: INTERSPEECH, pp. 1546–1550. Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1824

  3. Fong, J., Taylor, J., Richmond, K., King, S.: A comparison of letters and phones as input to sequence-to-sequence models for speech synthesis. In: Speech Synthesis Workshop, pp. 223–227. Vienna, Austria (2019). https://doi.org/10.21437/SSW.2019-40

  4. Gölge, E., The coqui TTS team: coqui TTS (2021). https://doi.org/10.5281/zenodo.6334862, https://github.com/coqui-ai/TTS

  5. Hanzlíček, Z., Vít, J.: LSTM-based speech segmentation trained on different foreign languages. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 456–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_49

    Chapter  Google Scholar 

  6. Jang, W., Lim, D., Yoon, J., Kim, B., Kim, J.: UnivNet: a neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. In: INTERSPEECH, pp. 2207–2211. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-1016

  7. Jůzová, M., Tihelka, D., Vít, J.: Unified language-independent DNN-based G2P converter. In: INTERSPEECH, pp. 2085–2089. Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-2335

  8. Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. In: Neural Information Processing Systems. Vancouver, Canada (2020)

    Google Scholar 

  9. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540 (2021)

    Google Scholar 

  10. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego, USA (2015)

    Google Scholar 

  11. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Conference on Neural Information Processing Systems, Vancouver, Canada (2020). http://arxiv.org/abs/2010.05646

  12. Liu, L., et al.: On the variance of the adaptive learning rate and beyond. In: International Conference on Learning Representations, Addis Ababa, Ethiopia (2020). http://arxiv.org/abs/1908.03265

  13. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, New Orleans, USA (2019)

    Google Scholar 

  14. Matoušek, J., Kala, J.: On modelling glottal stop in Czech text-to-speech synthesis. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 257–264. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_33

    Chapter  Google Scholar 

  15. Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: IASTED International Conference on Computational Intelligence, pp. 442–447. San Francisco, USA (2006)

    Google Scholar 

  16. Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: INTERSPEECH, pp. 1511–1515. Lyon, France (2013)

    Google Scholar 

  17. Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, pp. 1296–1299. Marrakech, Morocco (2008)

    Google Scholar 

  18. Matoušek, J., Tihelka, D., Romportl, J., Psutka, J.: Slovak unit-selection speech synthesis: creating a new Slovak voice within a Czech TTS system ARTIC. IAENG Int. J. Comput. Sci. 39(2), 147–154 (2012)

    Google Scholar 

  19. Nguyen, M., et al.: TensorflowTTS (2020). https://github.com/TensorSpeech/TensorFlowTTS

  20. Nouza, J., Psutka, J., Uhlíř, J.: Phonetic alphabet for speech recognition of Czech. Radioengineering 6(4), 16–20 (1997)

    Google Scholar 

  21. Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 4225–4229 (2015). https://doi.org/10.1109/ICASSP.2015.7178767

  22. Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)

    Google Scholar 

  23. Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48

    Chapter  Google Scholar 

  24. Shen, J., et a;.: Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 4779–4783. Calgary, Canada (2018)

    Google Scholar 

  25. Skarnitzl, R.: Allophonic variability in Czech from the perspective of speech synthesis. Akustické listy 24(1–2), 15–20 (2018)

    Google Scholar 

  26. Tan, X., Qin, T., Soong, F.K., Liu, T.Y.: A survey on neural speech synthesis. CoRR abs/2106.1 (2021), https://arxiv.org/abs/2106.15561

  27. Tihelka, D., Matoušek, J., Tihelková, A.: How much end-to-end is Tacotron 2 end-to-end TTS system. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 511–522. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_44

    Chapter  Google Scholar 

  28. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010. Long Beach, CA, USA (2017)

    Google Scholar 

  29. Vít, J., Hanzlíček, Z., Matoušek, J.: On the analysis of training data for Wavenet-based speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5684–5688. Calgary, Canada (2018). https://doi.org/10.1109/ICASSP.2018.8461960

  30. Volín, J., Uhrinová, M., Skarnitzl, R.: The effect of word-initial glottalization on word monitoring in Slovak speakers of English. Res. Lang. 10(2), 173–181 (2012). https://doi.org/10.2478/v10015-011-0030-0

    Article  Google Scholar 

  31. Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. CoRR abs/2005.0 (2020), https://arxiv.org/abs/2005.05106

  32. Yolchuyeva, S., Németh, G., Gyires-Tóth, B.: Transformer based grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 2095–2099. Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1954

  33. Řezáčková, M., Švec, J., Tihelka, D.: T5G2P: using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 6–10. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-546

Download references

Acknowledgements

Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA CZ LM2018140) supported by the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jindřich Matoušek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Matoušek, J., Tihelka, D. (2022). On Comparison of Phonetic Representations for Czech Neural Speech Synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16270-1_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16269-5

  • Online ISBN: 978-3-031-16270-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics