Deep Learning Based Speech Synthesis with Emotion Overlay

Bhat, Abhijnya; Priya, Sejal; Sethi, Abhijit; Shet, Kedar U.; Srinath, Ramamoorthy

doi:10.1007/978-3-031-37940-6_25

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1848))

Included in the following conference series:

International Conference on Advances in Computing and Data Sciences

321 Accesses

Abstract

This paper proposes a Text-to-Speech System that synthesizes emotional speech. We have identified the lack of style diversity within audios produced for the same emotion, and have thus, come up with techniques of combining emotion embeddings with style embeddings in a novel weighted-controlled manner to produce speech with style varying based on target speakers and emotion depending on the emotion-category specified. We also provide different model variations which suggest different methods of injecting emotion embeddings into the TTS and various tactics of combining them with style embeddings. Tests conducted on our model variation which overlays emotion embeddings on the encoder outputs during inference and combines them with style embeddings in a 3:7 weight-ratio as per our novel approach gives a Mean Opinion Score of 3.612 which shows the more-than-satisfactory performance of our models in synthesizing style varying emotional speech.

A. Bhat, S. Priya, A. Sethi, K. U. Shet, R. Srinath—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, Y., et al.: Style tokens: unsuper-vised style modeling, control and transfer in end-to-end speech synthesis. In: International Conference on Machine Learning (2018)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368
Oord, A., et al.: Wavenet: A generative model for raw audio (2016)
Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Interspeech (2017)
Google Scholar
Um, S., Oh, S., Byun, K., Jang, I., Ahn, C.H., Kang, H.-G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258 (2019)
Google Scholar
Kwon, O., Song, E., Kim, J.-M., Kang, H.-G.: Effective parameter estimation methods for an excitnet model in generative text-to-speech systems. ArXiv abs/1905.08486 (2019)
Google Scholar
Shin, Y., Lee, Y., Jo, S., Hwang, Y., Kim, T.: Text-driven emotional style control and cross-speaker style transfer in neural TTS. ArXiv abs/207.06000 (2022)
Google Scholar
Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S.: Expressive text- to-speech using style tag. In: Interspeech (2021)
Google Scholar
Yoon, H.-W., et al.: Language model-based emotion prediction methods for emotional speech synthesis system. ArXiv abs/206.15067 (2022)
Google Scholar
Lei, Y., Yang, S., Zhu, X., Xie, L., Su, D.: Cross-speaker emotion transfer through information perturbation in emotional speech synthesis. IEEE Signal Process. Lett. 29, 1948–1952 (2022). https://doi.org/10.1109/LSP.2022.3203888
Article Google Scholar
Im, C.-B., Lee, S.-H., Kim, S.-B., Lee, S.-W.: EMOQ-TTS: emotion intensity quantization for ne-grained controllable emotional text-to-speech. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6317–6321 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747098
Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 920–924 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413391
Deng, C.: A PyTorch implementation of Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. https://github.com/KinglittleQ/GST-Tacotron (2019)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, PES University, Bangalore, Karnataka, India
Abhijnya Bhat, Sejal Priya, Abhijit Sethi, Kedar U. Shet & Ramamoorthy Srinath

Authors

Abhijnya Bhat
View author publications
You can also search for this author in PubMed Google Scholar
Sejal Priya
View author publications
You can also search for this author in PubMed Google Scholar
Abhijit Sethi
View author publications
You can also search for this author in PubMed Google Scholar
Kedar U. Shet
View author publications
You can also search for this author in PubMed Google Scholar
Ramamoorthy Srinath
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhijnya Bhat .

Editor information

Editors and Affiliations

Consilio Research Lab, Tallinn, Estonia
Mayank Singh
Jaypee University of Engineering and Technology, Guna, India
Vipin Tyagi
Jaypee University of Information Technology, Waknaghat, India
P.K. Gupta
Institute of Information Theory and Automation, Prague, Czech Republic
Jan Flusser
University of Ottawa, Ottawa, ON, Canada
Tuncer Ören

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhat, A., Priya, S., Sethi, A., Shet, K.U., Srinath, R. (2023). Deep Learning Based Speech Synthesis with Emotion Overlay. In: Singh, M., Tyagi, V., Gupta, P., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2023. Communications in Computer and Information Science, vol 1848. Springer, Cham. https://doi.org/10.1007/978-3-031-37940-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-37940-6_25
Published: 23 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37939-0
Online ISBN: 978-3-031-37940-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning Based Speech Synthesis with Emotion Overlay