Abstract
Introducing neural networks into the field has improved the performance of speech synthesis systems significantly. Most research was done on the English language which has substantial speech data resources, however, a low degree of grapheme-phoneme correspondence. Other, low resource languages pose different challenges that may be overcome using different approaches to text embeddings. In the present paper we present the results of using stressed text labels in speech datasets to train a speech synthesis model for a low speech data resource language - Lithuanian. Nvidia’s implementation of the Tacotron 2 system and the Lithuanian language speech dataset (corpus) with stressed text labels were used to train speech synthesis models. By introducing accentuation into sample labels, we show a significant improvement in speech naturalness as measured by MOS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sgall, P.: Towards a Theory of Phonemic Orthography. John Benjamins, Amsterdam (Philadelphia) (1987)
Berndt, R.S., Reggia, J.A., Mitchum, C.C.: Empirically derived probabilities for grapheme-to-phoneme correspondences in English. Behav. Res. Methods Instrum. Comput. 19(1), 1–9 (1987)
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229. IEEE (2015)
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019
Balode, L., Holvoet, A.: The Lithuanian language and its dialects. In: Circum-Baltic Languages: Typology and Contact, pp. 41–80 (2001)
Kasparaitis, P.: Automatic stressing of the Lithuanian text on the basis of a dictionary. Informatica 11(1), 19–40 (2000)
Wells, J.C., Hung, T.T.N.: Longman pronunciation dictionary. RELC J. 21(2), 95–97 (1990)
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017
Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: International Conference on Machine Learning, pp. 3918–3926. PMLR (2018)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Tacotron 2 (without wavenet). https://github.com/NVIDIA/tacotron2. Accessed 23 Apr 2021
WaveGlow: a flow-based generative network for speech synthesis. https://github.com/NVIDIA/waveglow. Accessed 23 Apr 2021
Yergeau, F.: UTF-8, a transformation format of ISO 10646. Technical report, STD 63, RFC 3629, November 2003
ITUT Recommendation. Telephone transmission quality subjective opinion tests. A method for subjective performance assessment of the quality of speech voice output devices (1994)
Ribeiro, F., Florêncio, D., Zhang, C., Seltzer, M.: CROWDMOS: an approach for crowdsourcing mean opinion score studies. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2416–2419. IEEE (2011)
Kasparaitis, P., et al.: Lietuviško balso sintezatoriu kokybės vertinimas. Kalby studijos (28), 80–91 (2016)
Kasparaitis, P., Beniušė, M.: Statistical parametric speech synthesis of lithuanian, p. 43 (2019). http://lki.lt/26-oji-tarptautine-moksline-jono-jablonskio-konferencija
Aeneas. https://github.com/readbeyond/aeneas. Accessed 23 Apr 2021
Moreno, P.J., Joerg, C., Van Thong, J.-M., Glickman, O.: A recursive algorithm for the forced alignment of very long audio segments. In: Fifth International Conference on Spoken Language Processing (1998)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Radzevičius, A., Raudys, A., Kasparaitis, P. (2021). Speech Synthesis Using Stressed Sample Labels for Languages with Higher Degree of Phonemic Orthography. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2021. Communications in Computer and Information Science, vol 1486. Springer, Cham. https://doi.org/10.1007/978-3-030-88304-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-88304-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88303-4
Online ISBN: 978-3-030-88304-1
eBook Packages: Computer ScienceComputer Science (R0)