Skip to main content

Speech Synthesis Using Stressed Sample Labels for Languages with Higher Degree of Phonemic Orthography

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1486))

Included in the following conference series:

  • 693 Accesses

Abstract

Introducing neural networks into the field has improved the performance of speech synthesis systems significantly. Most research was done on the English language which has substantial speech data resources, however, a low degree of grapheme-phoneme correspondence. Other, low resource languages pose different challenges that may be overcome using different approaches to text embeddings. In the present paper we present the results of using stressed text labels in speech datasets to train a speech synthesis model for a low speech data resource language - Lithuanian. Nvidia’s implementation of the Tacotron 2 system and the Lithuanian language speech dataset (corpus) with stressed text labels were used to train speech synthesis models. By introducing accentuation into sample labels, we show a significant improvement in speech naturalness as measured by MOS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sgall, P.: Towards a Theory of Phonemic Orthography. John Benjamins, Amsterdam (Philadelphia) (1987)

    Google Scholar 

  2. Berndt, R.S., Reggia, J.A., Mitchum, C.C.: Empirically derived probabilities for grapheme-to-phoneme correspondences in English. Behav. Res. Methods Instrum. Comput. 19(1), 1–9 (1987)

    Article  Google Scholar 

  3. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)

    Article  Google Scholar 

  4. Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229. IEEE (2015)

    Google Scholar 

  5. Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019

  6. Balode, L., Holvoet, A.: The Lithuanian language and its dialects. In: Circum-Baltic Languages: Typology and Contact, pp. 41–80 (2001)

    Google Scholar 

  7. Kasparaitis, P.: Automatic stressing of the Lithuanian text on the basis of a dictionary. Informatica 11(1), 19–40 (2000)

    MATH  Google Scholar 

  8. Wells, J.C., Hung, T.T.N.: Longman pronunciation dictionary. RELC J. 21(2), 95–97 (1990)

    Article  Google Scholar 

  9. Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)

  10. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017

  11. Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: International Conference on Machine Learning, pp. 3918–3926. PMLR (2018)

    Google Scholar 

  12. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  13. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)

    Google Scholar 

  14. Tacotron 2 (without wavenet). https://github.com/NVIDIA/tacotron2. Accessed 23 Apr 2021

  15. WaveGlow: a flow-based generative network for speech synthesis. https://github.com/NVIDIA/waveglow. Accessed 23 Apr 2021

  16. Yergeau, F.: UTF-8, a transformation format of ISO 10646. Technical report, STD 63, RFC 3629, November 2003

    Google Scholar 

  17. ITUT Recommendation. Telephone transmission quality subjective opinion tests. A method for subjective performance assessment of the quality of speech voice output devices (1994)

    Google Scholar 

  18. Ribeiro, F., Florêncio, D., Zhang, C., Seltzer, M.: CROWDMOS: an approach for crowdsourcing mean opinion score studies. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2416–2419. IEEE (2011)

    Google Scholar 

  19. Kasparaitis, P., et al.: Lietuviško balso sintezatoriu kokybės vertinimas. Kalby studijos (28), 80–91 (2016)

    Google Scholar 

  20. Kasparaitis, P., Beniušė, M.: Statistical parametric speech synthesis of lithuanian, p. 43 (2019). http://lki.lt/26-oji-tarptautine-moksline-jono-jablonskio-konferencija

  21. Aeneas. https://github.com/readbeyond/aeneas. Accessed 23 Apr 2021

  22. Moreno, P.J., Joerg, C., Van Thong, J.-M., Glickman, O.: A recursive algorithm for the forced alignment of very long audio segments. In: Fifth International Conference on Spoken Language Processing (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Arnas Radzevičius , Aistis Raudys or Pijus Kasparaitis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Radzevičius, A., Raudys, A., Kasparaitis, P. (2021). Speech Synthesis Using Stressed Sample Labels for Languages with Higher Degree of Phonemic Orthography. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2021. Communications in Computer and Information Science, vol 1486. Springer, Cham. https://doi.org/10.1007/978-3-030-88304-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88304-1_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88303-4

  • Online ISBN: 978-3-030-88304-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics