Skip to main content

Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12379))

Included in the following conference series:

Abstract

In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker’s voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker’s voice in a synthesized speech. To our knowledge, it is for the first time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Wang, Y.: Tacotron: a fully end-to-end text-to-speech synthesis model. J. CoRR, arxiv.org, vol. abs/1703.10135 (2017)

  2. Ping, W., et al.: Deep Voice 3: 2000-speaker neural text-to-speech, CoRR, arxiv.org, volume abs/1710.07654 (2017)

  3. Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis. In: ICLR (2017)

    Google Scholar 

  4. Taigman, Y., Wolf, L., Polyak, A., Nachmani, E.: VoiceLoop: voice fitting and synthesis via a phonological loop. In: ICLR (2017)

    Google Scholar 

  5. Zhang, Y.-J., Pan, S., He, L., Ling, Z.-H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP (2018)

    Google Scholar 

  6. Akuzawa, K., Yusuke, I., Yutaka, M.: Expressive speech synthesis via modeling expressions with variational autoencoder. In: Interspeech (2018)

    Google Scholar 

  7. Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. In: ICLR (2019)

    Google Scholar 

  8. Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: ICML (2018)

    Google Scholar 

  9. Skerry-Ryan, R.J., et al.: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In: ICML (2018)

    Google Scholar 

  10. Lee, Y., Kim, T.: Robust and fine-grained prosody control of end-to-end speech synthesis. In: ICASSP (2019)

    Google Scholar 

  11. Parker, J., Stylianou, Y., Cipolla, R.: Adaptation of an expressive single speaker deep neural network speech synthesis system. In: ICASSP (2018)

    Google Scholar 

  12. Kulkarni, A., Colotte, V., Jouvet, D.: Layer adaptation for transfer of expressivity in speech synthesis. In: Language & Technology Conference (LTC) (2019)

    Google Scholar 

  13. Lin, X., Duan, Y., Dong, Q., Lu, J., Zhou, J.: Deep variational metric learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 714–729. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_42

    Chapter  Google Scholar 

  14. Kaya, M., Bilge, H.Ş.: Deep metric learning: a survey. In: Symmetry, vol. 11 (2019). ISSN 2073–8994

    Google Scholar 

  15. Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: NIPS (2016)

    Google Scholar 

  16. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: robust DNN embeddings for speaker recognition. In: ICASSP (2018)

    Google Scholar 

  17. Kingma, D.P., Max, W.: Auto-encoding variational bayes. CoRR, arxiv.org, abs/1312.6114 (2013)

  18. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)

    Google Scholar 

  19. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: SIGNLL Conference on Computational Natural Language Learning (2016)

    Google Scholar 

  20. Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: Interspeech (2018)

    Google Scholar 

  21. Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU Conference (2011)

    Google Scholar 

  22. Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. In: IEICE Transactions (2016)

    Google Scholar 

  23. Yamagishi, J., Honnet, P.E., Garner, P., Lazaridis, A.: The SIWIS French Speech Synthesis Database (2017)

    Google Scholar 

  24. Stan, A., et al.: TUNDRA: a multilingual corpus of found data for TTS research created with light supervision. In: Interspeech (2013)

    Google Scholar 

  25. Streijl, R.C., Winkler, S., Hands, D.S.: Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives. Multimed. Syst. 22(2), 213–227 (2014). https://doi.org/10.1007/s00530-014-0446-1

    Article  Google Scholar 

  26. Dahmani, S., Colotte, V., Girard, V., Ouni, S.: Conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis. In: Interspeech (2019)

    Google Scholar 

  27. Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: ISCA Speech Synthesis Workshop (SSW9) (2016)

    Google Scholar 

Download references

Acknowledgements

Experiments presented in this paper were carried out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations. (see https://www.grid5000.fr).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ajinkya Kulkarni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kulkarni, A., Colotte, V., Jouvet, D. (2020). Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2020. Lecture Notes in Computer Science(), vol 12379. Springer, Cham. https://doi.org/10.1007/978-3-030-59430-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59430-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59429-9

  • Online ISBN: 978-3-030-59430-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics