Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech

Kulkarni, Ajinkya; Colotte, Vincent; Jouvet, Denis

doi:10.1007/978-3-030-59430-5_13

Ajinkya Kulkarni¹¹,
Vincent Colotte¹¹ &
Denis Jouvet¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12379))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

508 Accesses
3 Citations

Abstract

In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker’s voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker’s voice in a synthesized speech. To our knowledge, it is for the first time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi speaker text-to-speech synthesis using generalized end-to-end loss function

Article 13 January 2024

DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech

Article 01 February 2025

MelMAE-VC: Extending Masked Autoencoders to Voice Conversion

References

Wang, Y.: Tacotron: a fully end-to-end text-to-speech synthesis model. J. CoRR, arxiv.org, vol. abs/1703.10135 (2017)
Ping, W., et al.: Deep Voice 3: 2000-speaker neural text-to-speech, CoRR, arxiv.org, volume abs/1710.07654 (2017)
Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis. In: ICLR (2017)
Google Scholar
Taigman, Y., Wolf, L., Polyak, A., Nachmani, E.: VoiceLoop: voice fitting and synthesis via a phonological loop. In: ICLR (2017)
Google Scholar
Zhang, Y.-J., Pan, S., He, L., Ling, Z.-H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP (2018)
Google Scholar
Akuzawa, K., Yusuke, I., Yutaka, M.: Expressive speech synthesis via modeling expressions with variational autoencoder. In: Interspeech (2018)
Google Scholar
Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. In: ICLR (2019)
Google Scholar
Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: ICML (2018)
Google Scholar
Skerry-Ryan, R.J., et al.: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In: ICML (2018)
Google Scholar
Lee, Y., Kim, T.: Robust and fine-grained prosody control of end-to-end speech synthesis. In: ICASSP (2019)
Google Scholar
Parker, J., Stylianou, Y., Cipolla, R.: Adaptation of an expressive single speaker deep neural network speech synthesis system. In: ICASSP (2018)
Google Scholar
Kulkarni, A., Colotte, V., Jouvet, D.: Layer adaptation for transfer of expressivity in speech synthesis. In: Language & Technology Conference (LTC) (2019)
Google Scholar
Lin, X., Duan, Y., Dong, Q., Lu, J., Zhou, J.: Deep variational metric learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 714–729. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_42
Chapter Google Scholar
Kaya, M., Bilge, H.Ş.: Deep metric learning: a survey. In: Symmetry, vol. 11 (2019). ISSN 2073–8994
Google Scholar
Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: NIPS (2016)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: robust DNN embeddings for speaker recognition. In: ICASSP (2018)
Google Scholar
Kingma, D.P., Max, W.: Auto-encoding variational bayes. CoRR, arxiv.org, abs/1312.6114 (2013)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)
Google Scholar
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: SIGNLL Conference on Computational Natural Language Learning (2016)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: Interspeech (2018)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU Conference (2011)
Google Scholar
Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. In: IEICE Transactions (2016)
Google Scholar
Yamagishi, J., Honnet, P.E., Garner, P., Lazaridis, A.: The SIWIS French Speech Synthesis Database (2017)
Google Scholar
Stan, A., et al.: TUNDRA: a multilingual corpus of found data for TTS research created with light supervision. In: Interspeech (2013)
Google Scholar
Streijl, R.C., Winkler, S., Hands, D.S.: Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives. Multimed. Syst. 22(2), 213–227 (2014). https://doi.org/10.1007/s00530-014-0446-1
Article Google Scholar
Dahmani, S., Colotte, V., Girard, V., Ouni, S.: Conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis. In: Interspeech (2019)
Google Scholar
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: ISCA Speech Synthesis Workshop (SSW9) (2016)
Google Scholar

Download references

Acknowledgements

Experiments presented in this paper were carried out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations. (see https://www.grid5000.fr).

Author information

Authors and Affiliations

Université de Lorraine, CNRS, Inria, LORIA, 54000, Nancy, France
Ajinkya Kulkarni, Vincent Colotte & Denis Jouvet

Authors

Ajinkya Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Colotte
View author publications
You can also search for this author in PubMed Google Scholar
Denis Jouvet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ajinkya Kulkarni .

Editor information

Editors and Affiliations

Cardiff University, Cardiff, UK
Luis Espinosa-Anke
Rovira i Virgili University, Tarragona, Tarragona, Spain
Carlos Martín-Vide
Computer Science, Cardiff University, Cardiff, UK
Irena Spasić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kulkarni, A., Colotte, V., Jouvet, D. (2020). Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2020. Lecture Notes in Computer Science(), vol 12379. Springer, Cham. https://doi.org/10.1007/978-3-030-59430-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-59430-5_13
Published: 26 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59429-9
Online ISBN: 978-3-030-59430-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics