Abstract
In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker’s voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker’s voice in a synthesized speech. To our knowledge, it is for the first time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wang, Y.: Tacotron: a fully end-to-end text-to-speech synthesis model. J. CoRR, arxiv.org, vol. abs/1703.10135 (2017)
Ping, W., et al.: Deep Voice 3: 2000-speaker neural text-to-speech, CoRR, arxiv.org, volume abs/1710.07654 (2017)
Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis. In: ICLR (2017)
Taigman, Y., Wolf, L., Polyak, A., Nachmani, E.: VoiceLoop: voice fitting and synthesis via a phonological loop. In: ICLR (2017)
Zhang, Y.-J., Pan, S., He, L., Ling, Z.-H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP (2018)
Akuzawa, K., Yusuke, I., Yutaka, M.: Expressive speech synthesis via modeling expressions with variational autoencoder. In: Interspeech (2018)
Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. In: ICLR (2019)
Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: ICML (2018)
Skerry-Ryan, R.J., et al.: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In: ICML (2018)
Lee, Y., Kim, T.: Robust and fine-grained prosody control of end-to-end speech synthesis. In: ICASSP (2019)
Parker, J., Stylianou, Y., Cipolla, R.: Adaptation of an expressive single speaker deep neural network speech synthesis system. In: ICASSP (2018)
Kulkarni, A., Colotte, V., Jouvet, D.: Layer adaptation for transfer of expressivity in speech synthesis. In: Language & Technology Conference (LTC) (2019)
Lin, X., Duan, Y., Dong, Q., Lu, J., Zhou, J.: Deep variational metric learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 714–729. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_42
Kaya, M., Bilge, H.Ş.: Deep metric learning: a survey. In: Symmetry, vol. 11 (2019). ISSN 2073–8994
Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: NIPS (2016)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-Vectors: robust DNN embeddings for speaker recognition. In: ICASSP (2018)
Kingma, D.P., Max, W.: Auto-encoding variational bayes. CoRR, arxiv.org, abs/1312.6114 (2013)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: SIGNLL Conference on Computational Natural Language Learning (2016)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: Interspeech (2018)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU Conference (2011)
Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. In: IEICE Transactions (2016)
Yamagishi, J., Honnet, P.E., Garner, P., Lazaridis, A.: The SIWIS French Speech Synthesis Database (2017)
Stan, A., et al.: TUNDRA: a multilingual corpus of found data for TTS research created with light supervision. In: Interspeech (2013)
Streijl, R.C., Winkler, S., Hands, D.S.: Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives. Multimed. Syst. 22(2), 213–227 (2014). https://doi.org/10.1007/s00530-014-0446-1
Dahmani, S., Colotte, V., Girard, V., Ouni, S.: Conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis. In: Interspeech (2019)
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: ISCA Speech Synthesis Workshop (SSW9) (2016)
Acknowledgements
Experiments presented in this paper were carried out using the Grid5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations. (see https://www.grid5000.fr).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kulkarni, A., Colotte, V., Jouvet, D. (2020). Deep Variational Metric Learning for Transfer of Expressivity in Multispeaker Text to Speech. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2020. Lecture Notes in Computer Science(), vol 12379. Springer, Cham. https://doi.org/10.1007/978-3-030-59430-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-59430-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59429-9
Online ISBN: 978-3-030-59430-5
eBook Packages: Computer ScienceComputer Science (R0)