Abstract
In this paper, we investigate two research questions related to the phonetic representation of input text in Czech neural speech synthesis: 1) whether we can afford to reduce the phonetic alphabet, and 2) whether we can remove pauses from phonetic transcription and let the speech synthesis model predict the pause positions itself. In our experiments, three different modern speech synthesis models (FastSpeech 2 + Multi-band MelGAN, Glow-TTS + UnivNet, and VITS) were employed. We have found that the reduced phonetic alphabet outperforms the traditionally used full phonetic alphabet. On the other hand, removing pauses does not help. The presence of pauses (predicted by an external pause prediction tool) in phonetic transcription leads to a slightly better quality of synthetic speech.
This research was supported by the Technology Agency of the Czech Republic (TA CR), project No. TL05000546.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For phonetic transcription we use International Phonetic Alphabet (IPA), https://www.internationalphoneticassociation.org/content/ipa-chart.
- 2.
Czech SAMPA, http://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm.
References
Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. CoRR abs/2112.0 (2021), http://arxiv.org/abs/2112.02418
Fong, J., Gallegos, P.O., Hodari, Z., King, S.: Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data. In: INTERSPEECH, pp. 1546–1550. Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1824
Fong, J., Taylor, J., Richmond, K., King, S.: A comparison of letters and phones as input to sequence-to-sequence models for speech synthesis. In: Speech Synthesis Workshop, pp. 223–227. Vienna, Austria (2019). https://doi.org/10.21437/SSW.2019-40
Gölge, E., The coqui TTS team: coqui TTS (2021). https://doi.org/10.5281/zenodo.6334862, https://github.com/coqui-ai/TTS
Hanzlíček, Z., Vít, J.: LSTM-based speech segmentation trained on different foreign languages. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 456–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_49
Jang, W., Lim, D., Yoon, J., Kim, B., Kim, J.: UnivNet: a neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. In: INTERSPEECH, pp. 2207–2211. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-1016
Jůzová, M., Tihelka, D., Vít, J.: Unified language-independent DNN-based G2P converter. In: INTERSPEECH, pp. 2085–2089. Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-2335
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. In: Neural Information Processing Systems. Vancouver, Canada (2020)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540 (2021)
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego, USA (2015)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Conference on Neural Information Processing Systems, Vancouver, Canada (2020). http://arxiv.org/abs/2010.05646
Liu, L., et al.: On the variance of the adaptive learning rate and beyond. In: International Conference on Learning Representations, Addis Ababa, Ethiopia (2020). http://arxiv.org/abs/1908.03265
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, New Orleans, USA (2019)
Matoušek, J., Kala, J.: On modelling glottal stop in Czech text-to-speech synthesis. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 257–264. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_33
Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: IASTED International Conference on Computational Intelligence, pp. 442–447. San Francisco, USA (2006)
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: INTERSPEECH, pp. 1511–1515. Lyon, France (2013)
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Language Resources and Evaluation Conference, pp. 1296–1299. Marrakech, Morocco (2008)
Matoušek, J., Tihelka, D., Romportl, J., Psutka, J.: Slovak unit-selection speech synthesis: creating a new Slovak voice within a Czech TTS system ARTIC. IAENG Int. J. Comput. Sci. 39(2), 147–154 (2012)
Nguyen, M., et al.: TensorflowTTS (2020). https://github.com/TensorSpeech/TensorFlowTTS
Nouza, J., Psutka, J., Uhlíř, J.: Phonetic alphabet for speech recognition of Czech. Radioengineering 6(4), 16–20 (1997)
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 4225–4229 (2015). https://doi.org/10.1109/ICASSP.2015.7178767
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48
Shen, J., et a;.: Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 4779–4783. Calgary, Canada (2018)
Skarnitzl, R.: Allophonic variability in Czech from the perspective of speech synthesis. Akustické listy 24(1–2), 15–20 (2018)
Tan, X., Qin, T., Soong, F.K., Liu, T.Y.: A survey on neural speech synthesis. CoRR abs/2106.1 (2021), https://arxiv.org/abs/2106.15561
Tihelka, D., Matoušek, J., Tihelková, A.: How much end-to-end is Tacotron 2 end-to-end TTS system. In: Ekštein, K., Pártl, F., Konopík, M. (eds.) TSD 2021. LNCS (LNAI), vol. 12848, pp. 511–522. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-83527-9_44
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010. Long Beach, CA, USA (2017)
Vít, J., Hanzlíček, Z., Matoušek, J.: On the analysis of training data for Wavenet-based speech synthesis. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 5684–5688. Calgary, Canada (2018). https://doi.org/10.1109/ICASSP.2018.8461960
Volín, J., Uhrinová, M., Skarnitzl, R.: The effect of word-initial glottalization on word monitoring in Slovak speakers of English. Res. Lang. 10(2), 173–181 (2012). https://doi.org/10.2478/v10015-011-0030-0
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. CoRR abs/2005.0 (2020), https://arxiv.org/abs/2005.05106
Yolchuyeva, S., Németh, G., Gyires-Tóth, B.: Transformer based grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 2095–2099. Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1954
Řezáčková, M., Švec, J., Tihelka, D.: T5G2P: using text-to-text transfer transformer for grapheme-to-phoneme conversion. In: INTERSPEECH, pp. 6–10. Brno, Czechia (2021). https://doi.org/10.21437/Interspeech.2021-546
Acknowledgements
Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA CZ LM2018140) supported by the Ministry of Education, Youth and Sports of the Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Matoušek, J., Tihelka, D. (2022). On Comparison of Phonetic Representations for Czech Neural Speech Synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-16270-1_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16269-5
Online ISBN: 978-3-031-16270-1
eBook Packages: Computer ScienceComputer Science (R0)