Skip to main content

Deep Learning Based Speech Synthesis with Emotion Overlay

  • Conference paper
  • First Online:
Advances in Computing and Data Sciences (ICACDS 2023)

Abstract

This paper proposes a Text-to-Speech System that synthesizes emotional speech. We have identified the lack of style diversity within audios produced for the same emotion, and have thus, come up with techniques of combining emotion embeddings with style embeddings in a novel weighted-controlled manner to produce speech with style varying based on target speakers and emotion depending on the emotion-category specified. We also provide different model variations which suggest different methods of injecting emotion embeddings into the TTS and various tactics of combining them with style embeddings. Tests conducted on our model variation which overlays emotion embeddings on the encoder outputs during inference and combines them with style embeddings in a 3:7 weight-ratio as per our novel approach gives a Mean Opinion Score of 3.612 which shows the more-than-satisfactory performance of our models in synthesizing style varying emotional speech.

A. Bhat, S. Priya, A. Sethi, K. U. Shet, R. Srinath—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wang, Y., et al.: Style tokens: unsuper-vised style modeling, control and transfer in end-to-end speech synthesis. In: International Conference on Machine Learning (2018)

    Google Scholar 

  2. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018). https://doi.org/10.1109/ICASSP.2018.8461368

  3. Oord, A., et al.: Wavenet: A generative model for raw audio (2016)

    Google Scholar 

  4. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Interspeech (2017)

    Google Scholar 

  5. Um, S., Oh, S., Byun, K., Jang, I., Ahn, C.H., Kang, H.-G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258 (2019)

    Google Scholar 

  6. Kwon, O., Song, E., Kim, J.-M., Kang, H.-G.: Effective parameter estimation methods for an excitnet model in generative text-to-speech systems. ArXiv abs/1905.08486 (2019)

    Google Scholar 

  7. Shin, Y., Lee, Y., Jo, S., Hwang, Y., Kim, T.: Text-driven emotional style control and cross-speaker style transfer in neural TTS. ArXiv abs/207.06000 (2022)

    Google Scholar 

  8. Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S.: Expressive text- to-speech using style tag. In: Interspeech (2021)

    Google Scholar 

  9. Yoon, H.-W., et al.: Language model-based emotion prediction methods for emotional speech synthesis system. ArXiv abs/206.15067 (2022)

    Google Scholar 

  10. Lei, Y., Yang, S., Zhu, X., Xie, L., Su, D.: Cross-speaker emotion transfer through information perturbation in emotional speech synthesis. IEEE Signal Process. Lett. 29, 1948–1952 (2022). https://doi.org/10.1109/LSP.2022.3203888

    Article  Google Scholar 

  11. Im, C.-B., Lee, S.-H., Kim, S.-B., Lee, S.-W.: EMOQ-TTS: emotion intensity quantization for ne-grained controllable emotional text-to-speech. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6317–6321 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747098

  12. Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 920–924 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413391

  13. Deng, C.: A PyTorch implementation of Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. https://github.com/KinglittleQ/GST-Tacotron (2019)

  14. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhijnya Bhat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bhat, A., Priya, S., Sethi, A., Shet, K.U., Srinath, R. (2023). Deep Learning Based Speech Synthesis with Emotion Overlay. In: Singh, M., Tyagi, V., Gupta, P., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2023. Communications in Computer and Information Science, vol 1848. Springer, Cham. https://doi.org/10.1007/978-3-031-37940-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37940-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37939-0

  • Online ISBN: 978-3-031-37940-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics