Abstract
While unit selection speech synthesis is not at the centre of research nowadays, it shows its strengths in deployments where fast fixes and tuning possibilities are required. The key part of this method is target and concatenation costs, usually consisting of features manually designed. When there is a flaw in a feature design, the selection may behave in an unexpected way, not necessarily causing a bad quality speech output. One of such features in our systems was the requirement on the match between expected and real units voicing. Due to the flexibility of the method, we were able to narrow the behaviour of the selection algorithm without worsening the quality of synthesised speech.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Železný, M., Krňıoul, Z., Císař, P., Matoušek, J.: Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis. Sig. Process. 12, 3657–3673 (2006)
Hanzlíček, Z., Vít, J., Tihelka, D.: WaveNet-based speech synthesis applied to Czech: a comparison with the traditional synthesis methods. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 445–452. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_48
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: ICASSP 1996, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, IEEE, Atlanta, Georgia, vol. 1, pp. 373–376 (1996)
Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: ICASSP 2014, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, IEEE, Florence, Italy, pp. 2569–2573 (2014)
Kalchbrenner, N., et al.: Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018)
Legát, M., Matoušek, J., Tihelka, D.: A robust multi-phase pitch-mark detection algorithm. In: Interspeech, vol. 2007, pp. 1641–1644 (2007)
Lorenzo-Trueba, J., et al.: Towards achieving robust universal neural vocoding, pp. 181–185 (2019)
Machač, P., Skarnitzl, R.: Principles of Phonetic Segmentation. Epocha, Prague (2013)
Matoušek, J., Legát, M.: Is unit selection aware of audible artifacts? In: SSW 2013, Proceedings of the 8th Speech Synthesis Workshop, ISCA, Barcelona, Spain, pp. 267–271 (2013)
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH 2008, Proceedings of 9th Annual Conference of International Speech Communication Association, ISCA, Brisbane, Australia, pp. 1626–1629 (2008)
Matoušek, J., Tihelka, D.: Using extreme gradient boosting to detect glottal closure instants in speech signal. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, Great Britain, pp. 6515–6519 (2019)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of the Speech Prosody 2006 Conference, pp. 549–552. TUDpress, Dresden (2006)
Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48
Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)
Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_56
Tihelka, D., Hanzlíček, Z., Jůzová, M., Matoušek, J.: First steps towards hybrid speech synthesis in Czech TTS system ARTIC. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 676–686. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_69
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Tihelka, D., Matoušek, J., Hanzlíček, Z.: Modelling F0 dynamics in unit selection based speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 457–464. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2_55
Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 508–515. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_66
Tihelka, D., Romportl, J.: Exploring automatic similarity measures for unit selection tuning. In: INTERSPEECH 2009, Proceedings of 10th Annual Conference of International Speech Communication Association, ISCA, Brighton, Great Britain, pp. 736–739 (2009)
Vít, J., Hanzlíček, Z., Matoušek, J.: Czech speech synthesis with generative neural vocoder. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 307–315. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_26
Wells, J.C.: SAMPA computer readable phonetic alphabet. In: Gibbon, D., Moore, R., Winski, R. (eds.) Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin and New York (1997)
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: 9th ISCA Speech Synthesis Workshop (2016), pp. 218–223, September 2016
Acknowledgements
This research was supported by the Technology Agency of the Czech Republic (project No. TH02010307), and by the grant of the University of West Bohemia, (project No. SGS-2019-027).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tihelka, D., Hanzlíček, Z., Jůzová, M. (2020). Uncertainty of Phone Voicing and Its Impact on Speech Synthesis. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_56
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_56
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)