Abstract
This paper presents a study on voice interpolation in the framework of neural text-to-speech. Two main approaches are considered. The first one consists of adding three independent speaker embeddings at 3 different positions within the model. The second one substitutes the embedding vectors by convolutional layers, kernels of which are computed on the fly from reference spectrograms. The interpolation between speakers is done by linear interpolation between the speaker embeddings in the first case, and between convolution kernels in the second. Finally, we propose a new method for evaluating interpolation smoothness using agreements between interpolation weights, objective and subjective speaker similarities. The results indicate that both methods are able to produce smooth interpolation to some extent, with the one based on learned speaker embeddings yielding better results.
R. Korostik—Work done during internship at Apple.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Analogous to GST, VAE in context of text-to-speech is usually seen as an auxiliary module to the main network; the whole model can be interpreted as a variational autoencoder of spectrograms with decoder conditioned on input text.
References
Athar, S., Burnaev, E., Lempitsky, V.: Latent convolutional models. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)
Battenberg, E., et al.: Location-relative attention mechanisms for robust long-form speech synthesis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. IEEE (2020)
Chen, Y., et al.: Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460 (2018)
Cooper, E., et al.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6184–6188. IEEE (2020)
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543 (2013)
Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575 (2020)
Dumoulin, V., et al.: Feature-wise transformations. Distill 3(7), e11 (2018)
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)
Habib, R., et al.: Semi-supervised generative modeling for controllable speech synthesis. arXiv preprint arXiv:1910.01709 (2019)
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. arXiv preprint arXiv:2102.04906 (2021)
Hsu, W.N., et al.: Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905. IEEE (2019)
Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217 (2018)
Hu, Q., Marchi, E., Winarsky, D., Stylianou, Y., Naik, D., Kajarekar, S.: Neural text-to-speech adaptation from low quality public recordings. In: Speech Synthesis Workshop, vol. 10 (2019)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv preprint arXiv:1806.04558 (2018)
Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: International Conference on Machine Learning, pp. 2410–2419. PMLR (2018)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012). https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Ma, N., Zhang, X., Huang, J., Sun, J.: WeightNet: revisiting the design space of weight networks. arXiv preprint arXiv:2007.11823 (2020)
Moss, H.B., Aggarwal, V., Prateek, N., González, J., Barra-Chicote, R.: BOFFIN TTS: few-shot speaker adaptation by Bayesian optimization. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7639–7643. IEEE (2020)
Pucher, M., Schabus, D., Yamagishi, J., Neubarth, F., Strom, V.: Modeling and interpolation of Austrian German and Viennese dialect in hmm-based speech synthesis. Speech Commun. 52(2), 164–179 (2010)
Raitio, T., Rasipuram, R., Castellani, D.: Controllable neural text-to-speech synthesis using intuitive prosodic features. arXiv preprint arXiv:2009.06775 (2020)
Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 901–909 (2016)
Schmidhuber, J.: Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4(1), 131–139 (1992)
Shechtman, S., Sorin, A.: Sequence to sequence neural speech synthesis with prosody modification capabilities. In: Proceedings of the 10th ISCA Speech Synthesis Workshop, pp. 275–280 (2019)
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Tachibana, M., Yamagishi, J., Onishi, K., Masuko, T., Kobayashi, T.: HMM-based speech synthesis with various speaking styles using model interpolation. In: Speech Prosody 2004, International Conference (2004)
Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258. IEEE (2020)
Wang, Y., et al.: Uncovering latent style factors for expressive speech synthesis. arXiv preprint arXiv:1711.00520 (2017)
Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: International Conference on Machine Learning, pp. 5180–5189. PMLR (2018)
Wu, F., Fan, A., Baevski, A., Dauphin, Y.N., Auli, M.: Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430 (2019)
Xiao, Y., He, L., Ming, H., Soong, F.K.: Improving prosody with linguistic and BERT derived features in multi-speaker based mandarin Chinese neural TTS. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6704–6708 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054337
Yamagishi, J., Veaux, C., MacDonald, K., et al.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)
Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., Kitamura, T.: Speaker interpolation in HMM-based speech synthesis system. In: Fifth European Conference on Speech Communication and Technology (1997)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)
Zhang, Y.J., Pan, S., He, L., Ling, Z.H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6945–6949. IEEE (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Korostik, R., Latorre, J., Achanta, S., Stylianou, Y. (2021). Assessing Speaker Interpolation in Neural Text-to-Speech. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)