Skip to main content

Assessing Speaker Interpolation in Neural Text-to-Speech

  • Conference paper
  • First Online:
  • 1619 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Abstract

This paper presents a study on voice interpolation in the framework of neural text-to-speech. Two main approaches are considered. The first one consists of adding three independent speaker embeddings at 3 different positions within the model. The second one substitutes the embedding vectors by convolutional layers, kernels of which are computed on the fly from reference spectrograms. The interpolation between speakers is done by linear interpolation between the speaker embeddings in the first case, and between convolution kernels in the second. Finally, we propose a new method for evaluating interpolation smoothness using agreements between interpolation weights, objective and subjective speaker similarities. The results indicate that both methods are able to produce smooth interpolation to some extent, with the one based on learned speaker embeddings yielding better results.

R. Korostik—Work done during internship at Apple.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Analogous to GST, VAE in context of text-to-speech is usually seen as an auxiliary module to the main network; the whole model can be interpreted as a variational autoencoder of spectrograms with decoder conditioned on input text.

References

  1. Athar, S., Burnaev, E., Lempitsky, V.: Latent convolutional models. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)

    Google Scholar 

  2. Battenberg, E., et al.: Location-relative attention mechanisms for robust long-form speech synthesis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. IEEE (2020)

    Google Scholar 

  3. Chen, Y., et al.: Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460 (2018)

  4. Cooper, E., et al.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6184–6188. IEEE (2020)

    Google Scholar 

  5. Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543 (2013)

  6. Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575 (2020)

  7. Dumoulin, V., et al.: Feature-wise transformations. Distill 3(7), e11 (2018)

    Article  Google Scholar 

  8. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)

  9. Habib, R., et al.: Semi-supervised generative modeling for controllable speech synthesis. arXiv preprint arXiv:1910.01709 (2019)

  10. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. arXiv preprint arXiv:2102.04906 (2021)

  11. Hsu, W.N., et al.: Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905. IEEE (2019)

    Google Scholar 

  12. Hsu, W.N., et al.: Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217 (2018)

  13. Hu, Q., Marchi, E., Winarsky, D., Stylianou, Y., Naik, D., Kajarekar, S.: Neural text-to-speech adaptation from low quality public recordings. In: Speech Synthesis Workshop, vol. 10 (2019)

    Google Scholar 

  14. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

    Google Scholar 

  15. Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv preprint arXiv:1806.04558 (2018)

  16. Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: International Conference on Machine Learning, pp. 2410–2419. PMLR (2018)

    Google Scholar 

  17. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)

  18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012). https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

  19. Ma, N., Zhang, X., Huang, J., Sun, J.: WeightNet: revisiting the design space of weight networks. arXiv preprint arXiv:2007.11823 (2020)

  20. Moss, H.B., Aggarwal, V., Prateek, N., González, J., Barra-Chicote, R.: BOFFIN TTS: few-shot speaker adaptation by Bayesian optimization. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7639–7643. IEEE (2020)

    Google Scholar 

  21. Pucher, M., Schabus, D., Yamagishi, J., Neubarth, F., Strom, V.: Modeling and interpolation of Austrian German and Viennese dialect in hmm-based speech synthesis. Speech Commun. 52(2), 164–179 (2010)

    Article  Google Scholar 

  22. Raitio, T., Rasipuram, R., Castellani, D.: Controllable neural text-to-speech synthesis using intuitive prosodic features. arXiv preprint arXiv:2009.06775 (2020)

  23. Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 901–909 (2016)

    Google Scholar 

  24. Schmidhuber, J.: Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4(1), 131–139 (1992)

    Article  Google Scholar 

  25. Shechtman, S., Sorin, A.: Sequence to sequence neural speech synthesis with prosody modification capabilities. In: Proceedings of the 10th ISCA Speech Synthesis Workshop, pp. 275–280 (2019)

    Google Scholar 

  26. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)

    Google Scholar 

  27. Tachibana, M., Yamagishi, J., Onishi, K., Masuko, T., Kobayashi, T.: HMM-based speech synthesis with various speaking styles using model interpolation. In: Speech Prosody 2004, International Conference (2004)

    Google Scholar 

  28. Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258. IEEE (2020)

    Google Scholar 

  29. Wang, Y., et al.: Uncovering latent style factors for expressive speech synthesis. arXiv preprint arXiv:1711.00520 (2017)

  30. Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: International Conference on Machine Learning, pp. 5180–5189. PMLR (2018)

    Google Scholar 

  31. Wu, F., Fan, A., Baevski, A., Dauphin, Y.N., Auli, M.: Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430 (2019)

  32. Xiao, Y., He, L., Ming, H., Soong, F.K.: Improving prosody with linguistic and BERT derived features in multi-speaker based mandarin Chinese neural TTS. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6704–6708 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054337

  33. Yamagishi, J., Veaux, C., MacDonald, K., et al.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)

    Google Scholar 

  34. Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., Kitamura, T.: Speaker interpolation in HMM-based speech synthesis system. In: Fifth European Conference on Speech Communication and Technology (1997)

    Google Scholar 

  35. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  36. Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)

  37. Zhang, Y.J., Pan, S., He, L., Ling, Z.H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6945–6949. IEEE (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Korostik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Korostik, R., Latorre, J., Achanta, S., Stylianou, Y. (2021). Assessing Speaker Interpolation in Neural Text-to-Speech. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics