Speech Synthesis Using Stressed Sample Labels for Languages with Higher Degree of Phonemic Orthography

Radzevičius, Arnas; Raudys, Aistis; Kasparaitis, Pijus

doi:10.1007/978-3-030-88304-1_30

Arnas Radzevičius⁸,
Aistis Raudys⁹ &
Pijus Kasparaitis⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1486))

Included in the following conference series:

International Conference on Information and Software Technologies

693 Accesses

Abstract

Introducing neural networks into the field has improved the performance of speech synthesis systems significantly. Most research was done on the English language which has substantial speech data resources, however, a low degree of grapheme-phoneme correspondence. Other, low resource languages pose different challenges that may be overcome using different approaches to text embeddings. In the present paper we present the results of using stressed text labels in speech datasets to train a speech synthesis model for a low speech data resource language - Lithuanian. Nvidia’s implementation of the Tacotron 2 system and the Lithuanian language speech dataset (corpus) with stressed text labels were used to train speech synthesis models. By introducing accentuation into sample labels, we show a significant improvement in speech naturalness as measured by MOS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sgall, P.: Towards a Theory of Phonemic Orthography. John Benjamins, Amsterdam (Philadelphia) (1987)
Google Scholar
Berndt, R.S., Reggia, J.A., Mitchum, C.C.: Empirically derived probabilities for grapheme-to-phoneme correspondences in English. Behav. Res. Methods Instrum. Comput. 19(1), 1–9 (1987)
Article Google Scholar
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Article Google Scholar
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229. IEEE (2015)
Google Scholar
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019
Balode, L., Holvoet, A.: The Lithuanian language and its dialects. In: Circum-Baltic Languages: Typology and Contact, pp. 41–80 (2001)
Google Scholar
Kasparaitis, P.: Automatic stressing of the Lithuanian text on the basis of a dictionary. Informatica 11(1), 19–40 (2000)
MATH Google Scholar
Wells, J.C., Hung, T.T.N.: Longman pronunciation dictionary. RELC J. 21(2), 95–97 (1990)
Article Google Scholar
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017
Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: International Conference on Machine Learning, pp. 3918–3926. PMLR (2018)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Google Scholar
Tacotron 2 (without wavenet). https://github.com/NVIDIA/tacotron2. Accessed 23 Apr 2021
WaveGlow: a flow-based generative network for speech synthesis. https://github.com/NVIDIA/waveglow. Accessed 23 Apr 2021
Yergeau, F.: UTF-8, a transformation format of ISO 10646. Technical report, STD 63, RFC 3629, November 2003
Google Scholar
ITUT Recommendation. Telephone transmission quality subjective opinion tests. A method for subjective performance assessment of the quality of speech voice output devices (1994)
Google Scholar
Ribeiro, F., Florêncio, D., Zhang, C., Seltzer, M.: CROWDMOS: an approach for crowdsourcing mean opinion score studies. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2416–2419. IEEE (2011)
Google Scholar
Kasparaitis, P., et al.: Lietuviško balso sintezatoriu kokybės vertinimas. Kalby studijos (28), 80–91 (2016)
Google Scholar
Kasparaitis, P., Beniušė, M.: Statistical parametric speech synthesis of lithuanian, p. 43 (2019). http://lki.lt/26-oji-tarptautine-moksline-jono-jablonskio-konferencija
Aeneas. https://github.com/readbeyond/aeneas. Accessed 23 Apr 2021
Moreno, P.J., Joerg, C., Van Thong, J.-M., Glickman, O.: A recursive algorithm for the forced alignment of very long audio segments. In: Fifth International Conference on Spoken Language Processing (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

AAI Labs, Suopių g. 21, Riešė, 14265, Lithuania
Arnas Radzevičius
Institute of Informatics, Vilnius University, Naugarduko 24, 03225, Vilnius, Lithuania
Aistis Raudys & Pijus Kasparaitis

Authors

Arnas Radzevičius
View author publications
You can also search for this author in PubMed Google Scholar
Aistis Raudys
View author publications
You can also search for this author in PubMed Google Scholar
Pijus Kasparaitis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Arnas Radzevičius , Aistis Raudys or Pijus Kasparaitis .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Audrius Lopata
Kaunas University of Technology, Kaunas, Lithuania
Daina Gudonienė
Kaunas University of Technology, Kaunas, Lithuania
Rita Butkienė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Radzevičius, A., Raudys, A., Kasparaitis, P. (2021). Speech Synthesis Using Stressed Sample Labels for Languages with Higher Degree of Phonemic Orthography. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2021. Communications in Computer and Information Science, vol 1486. Springer, Cham. https://doi.org/10.1007/978-3-030-88304-1_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-88304-1_30
Published: 07 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88303-4
Online ISBN: 978-3-030-88304-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics