Abstract
This paper proposes a Text-to-Speech System that synthesizes emotional speech. We have identified the lack of style diversity within audios produced for the same emotion, and have thus, come up with techniques of combining emotion embeddings with style embeddings in a novel weighted-controlled manner to produce speech with style varying based on target speakers and emotion depending on the emotion-category specified. We also provide different model variations which suggest different methods of injecting emotion embeddings into the TTS and various tactics of combining them with style embeddings. Tests conducted on our model variation which overlays emotion embeddings on the encoder outputs during inference and combines them with style embeddings in a 3:7 weight-ratio as per our novel approach gives a Mean Opinion Score of 3.612 which shows the more-than-satisfactory performance of our models in synthesizing style varying emotional speech.
A. Bhat, S. Priya, A. Sethi, K. U. Shet, R. Srinath—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang, Y., et al.: Style tokens: unsuper-vised style modeling, control and transfer in end-to-end speech synthesis. In: International Conference on Machine Learning (2018)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368
Oord, A., et al.: Wavenet: A generative model for raw audio (2016)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Interspeech (2017)
Um, S., Oh, S., Byun, K., Jang, I., Ahn, C.H., Kang, H.-G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258 (2019)
Kwon, O., Song, E., Kim, J.-M., Kang, H.-G.: Effective parameter estimation methods for an excitnet model in generative text-to-speech systems. ArXiv abs/1905.08486 (2019)
Shin, Y., Lee, Y., Jo, S., Hwang, Y., Kim, T.: Text-driven emotional style control and cross-speaker style transfer in neural TTS. ArXiv abs/207.06000 (2022)
Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S.: Expressive text- to-speech using style tag. In: Interspeech (2021)
Yoon, H.-W., et al.: Language model-based emotion prediction methods for emotional speech synthesis system. ArXiv abs/206.15067 (2022)
Lei, Y., Yang, S., Zhu, X., Xie, L., Su, D.: Cross-speaker emotion transfer through information perturbation in emotional speech synthesis. IEEE Signal Process. Lett. 29, 1948–1952 (2022). https://doi.org/10.1109/LSP.2022.3203888
Im, C.-B., Lee, S.-H., Kim, S.-B., Lee, S.-W.: EMOQ-TTS: emotion intensity quantization for ne-grained controllable emotional text-to-speech. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6317–6321 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747098
Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 920–924 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413391
Deng, C.: A PyTorch implementation of Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. https://github.com/KinglittleQ/GST-Tacotron (2019)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bhat, A., Priya, S., Sethi, A., Shet, K.U., Srinath, R. (2023). Deep Learning Based Speech Synthesis with Emotion Overlay. In: Singh, M., Tyagi, V., Gupta, P., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2023. Communications in Computer and Information Science, vol 1848. Springer, Cham. https://doi.org/10.1007/978-3-031-37940-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-37940-6_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37939-0
Online ISBN: 978-3-031-37940-6
eBook Packages: Computer ScienceComputer Science (R0)