Exploring Effective Speech Representation via ASR for High-Quality End-to-End Multispeaker TTS

Liu, Dawei; Wang, Longbiao; Li, Sheng; Li, Haoyu; Ding, Chenchen; Zhang, Ju; Dang, Jianwu

doi:10.1007/978-3-030-92310-5_13

Dawei Liu¹⁰,
Longbiao Wang ORCID: orcid.org/0000-0002-4005-5036¹⁰,
Sheng Li ORCID: orcid.org/0000-0001-7636-3797¹¹,
Haoyu Li¹²,
Chenchen Ding¹¹,
Ju Zhang¹³ &
…
Jianwu Dang ORCID: orcid.org/0000-0002-9237-4821^10,14

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1517))

Included in the following conference series:

International Conference on Neural Information Processing

1813 Accesses

Abstract

The quality of multispeaker text-to-speech (TTS) is composed of speech naturalness and speaker similarity. The current multispeaker TTS based on speaker embeddings extracted by speaker verification (SV) or speaker recognition (SR) models has made significant progress in speaker similarity of synthesized speech. SV/SR tasks build the speaker space based on the differences between speakers in the training set and thus extract speaker embeddings that can improve speaker similarity; however, they deteriorate the naturalness of synthetic speech since such embeddings lost speech dynamics to some extent. Unlike SV/SR-based systems, the automatic speech recognition (ASR) encoder outputs contain relatively complete speech information, such as speaker information, timbre, and prosody. Therefore, we propose an ASR-based synthesis framework to extract speech embeddings using an ASR encoder to improve multispeaker TTS quality, especially for speech naturalness. To enable the ASR system to learn the speaker characteristics better, we explicitly feed the speaker-id to the training label. The experimental results show that the speech embeddings extracted by the proposed method have good speaker characteristics and beneficial acoustic information for speech naturalness. The proposed method significantly improves the naturalness and similarity of multispeaker TTS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://daveliuabc.github.io/multispeaker-demo/.
2.
https://github.com/tensorflow/tensor2tensor.
3.
We train the phone-level ASR system to extract the phonetic posteriorgram (PPG) feature for TTS in the future.
4.
https://github.com/CorentinJ/Real-Time-Voice-Cloning.
5.
https://github.com/mkotha/WaveRNN.

References

Arik, S., et al.: Deep voice: real-time neural text-to-speech. In: Proceedings of ICML, pp. 264–273 (2017)
Google Scholar
Ren, Y., et al.: Fastspeech: fast, robust and controllable text to speech. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of ICASSP, pp. 4779–4783 (2018)
Google Scholar
Chen, Y., et al.: Sample efficient adaptive text-to-speech. In: Proceedings of ICLR (2019)
Google Scholar
Kons, Z., et al.: High quality, lightweight and adaptable TTS using LPCNet. In: Proceedings of INTERSPEECH, pp. 176–180 (2019)
Google Scholar
Nachmani, E., et al.: Fitting new speakers based on a short untranscribed sample. In: Proceedings of ICML, pp. 5932–5940 (2018)
Google Scholar
Cooper, E., et al.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: Proceedings of ICASSP, pp. 6184–6188 (2020)
Google Scholar
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in Neural Information Processing Systems, pp. 4480–4490 (2018)
Google Scholar
Chen, M., et al.: Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. In: Proceedings of INTERSPEECH, pp. 2105–2109 (2019)
Google Scholar
Li, C., et al.: What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis. In: Proceedings of ICASSP, pp. 6434–6438 (2020)
Google Scholar
Li, S., et al.: Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: Proceedings of INTERSPEECH, pp. 1408–1412 (2019)
Google Scholar
Hori, T., et al.: Cycle-consistency training for end-to-end speech recognition. In: Proceedings of ICASSP, pp. 6271–6275 (2019)
Google Scholar
Karita, S., et al.: Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In: Proceedings of ICASSP, pp. 6166–6170 (2019)
Google Scholar
Tjandra, A., et al.: Listening while speaking: speech chain by deep learning. In: Proceedings of ASRU, pp. 301–308 (2017)
Google Scholar
Tjandra, A., et al.: Machine speech chain with one-shot speaker adaptation. In: Proceedings of INTERSPEECH, pp. 887–891 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017)
Google Scholar
Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: Proceedings of ICML, pp. 3775–3784 (2018)
Google Scholar
Panayotov, V., et al.: Librispeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210 (2015)
Google Scholar
Yamagishi, J., et al.: CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019). https://doi.org/10.7488/ds/2645
Wan, L., et al.: Generalized end-to-end loss for speaker verification. In: Proceedings of ICASSP, pp. 4879–4883 (2018)
Google Scholar
Paul, D., et al.: Speaker conditional WaveRNN: towards universal neural vocoder for unseen speaker and recording conditions. In: Proceedings of INTERSPEECH (2020)
Google Scholar
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Odyssey 2018 The Speaker and Language Recognition Workshop (2018)
Google Scholar
Zhou, D., et al.: Dynamic margin softmax loss for speaker verification. In: Proceedings of INTERSPEECH (2020)
Google Scholar
Denes, P., Pinson, E.: The Speech Chain, 2nd edn. Worth Publisher, New York (1993)
Google Scholar
Kashino, M.: The motor theory of speech perception: its history, progress and perspective (Japanese). Acoust. Sci. Tech. 62(5), 391–396 (2006)
Google Scholar
Liberman, A., Mattingly, I.: The motor theory of speech perception revised. Cognition 21, 1–36 (1985)
Article Google Scholar

Download references

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61771333, NICT International Funding, and JSPS KAKENHI Grant No. 21K17837. We thank Prof. Zhenhua Ling of the University of Science and Technology of China for useful discussions.

Author information

Authors and Affiliations

Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Dawei Liu, Longbiao Wang & Jianwu Dang
National Institute of Information and Communications Technology, Kyoto, Japan
Sheng Li & Chenchen Ding
National Institute of Informatics (NII), Tokyo, Japan
Haoyu Li
Huiyan Technology (Tianjin) Co., Ltd., Tianjin, China
Ju Zhang
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Jianwu Dang

Authors

Dawei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Longbiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Chenchen Ding
View author publications
You can also search for this author in PubMed Google Scholar
Ju Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Longbiao Wang or Sheng Li .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, D. et al. (2021). Exploring Effective Speech Representation via ASR for High-Quality End-to-End Multispeaker TTS. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-92310-5_13
Published: 02 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring Effective Speech Representation via ASR for High-Quality End-to-End Multispeaker TTS