Assessing Speaker Interpolation in Neural Text-to-Speech

Korostik, Roman; Latorre, Javier; Achanta, Sivanand; Stylianou, Yannis

doi:10.1007/978-3-030-87802-3_33

Assessing Speaker Interpolation in Neural Text-to-Speech

Roman Korostik¹⁰,
Javier Latorre¹¹,
Sivanand Achanta¹² &
…
Yannis Stylianou¹¹

Conference paper
First Online: 22 September 2021

1619 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Abstract

This paper presents a study on voice interpolation in the framework of neural text-to-speech. Two main approaches are considered. The first one consists of adding three independent speaker embeddings at 3 different positions within the model. The second one substitutes the embedding vectors by convolutional layers, kernels of which are computed on the fly from reference spectrograms. The interpolation between speakers is done by linear interpolation between the speaker embeddings in the first case, and between convolution kernels in the second. Finally, we propose a new method for evaluating interpolation smoothness using agreements between interpolation weights, objective and subjective speaker similarities. The results indicate that both methods are able to produce smooth interpolation to some extent, with the one based on learned speaker embeddings yielding better results.

R. Korostik—Work done during internship at Apple.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Analogous to GST, VAE in context of text-to-speech is usually seen as an auxiliary module to the main network; the whole model can be interpreted as a variational autoencoder of spectrograms with decoder conditioned on input text.

References

Athar, S., Burnaev, E., Lempitsky, V.: Latent convolutional models. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)
Google Scholar
Battenberg, E., et al.: Location-relative attention mechanisms for robust long-form speech synthesis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. IEEE (2020)
Google Scholar
Chen, Y., et al.: Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460 (2018)
Cooper, E., et al.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6184–6188. IEEE (2020)
Google Scholar
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543 (2013)
Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575 (2020)
Dumoulin, V., et al.: Feature-wise transformations. Distill 3(7), e11 (2018)
Article Google Scholar
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)
Habib, R., et al.: Semi-supervised generative modeling for controllable speech synthesis. arXiv preprint arXiv:1910.01709 (2019)
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. arXiv preprint arXiv:2102.04906 (2021)
Hsu, W.N., et al.: Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905. IEEE (2019)
Google Scholar
Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217 (2018)
Hu, Q., Marchi, E., Winarsky, D., Stylianou, Y., Naik, D., Kajarekar, S.: Neural text-to-speech adaptation from low quality public recordings. In: Speech Synthesis Workshop, vol. 10 (2019)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Google Scholar
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv preprint arXiv:1806.04558 (2018)
Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: International Conference on Machine Learning, pp. 2410–2419. PMLR (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012). https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Ma, N., Zhang, X., Huang, J., Sun, J.: WeightNet: revisiting the design space of weight networks. arXiv preprint arXiv:2007.11823 (2020)
Moss, H.B., Aggarwal, V., Prateek, N., González, J., Barra-Chicote, R.: BOFFIN TTS: few-shot speaker adaptation by Bayesian optimization. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7639–7643. IEEE (2020)
Google Scholar
Pucher, M., Schabus, D., Yamagishi, J., Neubarth, F., Strom, V.: Modeling and interpolation of Austrian German and Viennese dialect in hmm-based speech synthesis. Speech Commun. 52(2), 164–179 (2010)
Article Google Scholar
Raitio, T., Rasipuram, R., Castellani, D.: Controllable neural text-to-speech synthesis using intuitive prosodic features. arXiv preprint arXiv:2009.06775 (2020)
Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 901–909 (2016)
Google Scholar
Schmidhuber, J.: Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4(1), 131–139 (1992)
Article Google Scholar
Shechtman, S., Sorin, A.: Sequence to sequence neural speech synthesis with prosody modification capabilities. In: Proceedings of the 10th ISCA Speech Synthesis Workshop, pp. 275–280 (2019)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
Tachibana, M., Yamagishi, J., Onishi, K., Masuko, T., Kobayashi, T.: HMM-based speech synthesis with various speaking styles using model interpolation. In: Speech Prosody 2004, International Conference (2004)
Google Scholar
Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258. IEEE (2020)
Google Scholar
Wang, Y., et al.: Uncovering latent style factors for expressive speech synthesis. arXiv preprint arXiv:1711.00520 (2017)
Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: International Conference on Machine Learning, pp. 5180–5189. PMLR (2018)
Google Scholar
Wu, F., Fan, A., Baevski, A., Dauphin, Y.N., Auli, M.: Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430 (2019)
Xiao, Y., He, L., Ming, H., Soong, F.K.: Improving prosody with linguistic and BERT derived features in multi-speaker based mandarin Chinese neural TTS. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6704–6708 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054337
Yamagishi, J., Veaux, C., MacDonald, K., et al.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)
Google Scholar
Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., Kitamura, T.: Speaker interpolation in HMM-based speech synthesis system. In: Fifth European Conference on Speech Communication and Technology (1997)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)
Zhang, Y.J., Pan, S., He, L., Ling, Z.H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6945–6949. IEEE (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

ITMO University, Saint Petersburg, Russia
Roman Korostik
Apple, Cambridge, UK
Javier Latorre & Yannis Stylianou
Apple, Cupertino, USA
Sivanand Achanta

Authors

Roman Korostik
View author publications
You can also search for this author in PubMed Google Scholar
Javier Latorre
View author publications
You can also search for this author in PubMed Google Scholar
Sivanand Achanta
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Stylianou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Korostik .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Korostik, R., Latorre, J., Achanta, S., Stylianou, Y. (2021). Assessing Speaker Interpolation in Neural Text-to-Speech. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_33
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics