An end-to-end TTS model with pronunciation predictor

Han, Chol-Jin; Ri, Un-Chol; Mun, Song-Il; Jang, Kang-Song; Han, Song-Yun

doi:10.1007/s10772-022-10008-7

An end-to-end TTS model with pronunciation predictor

Published: 13 November 2022

Volume 25, pages 1013–1024, (2022)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Chol-Jin Han¹,
Un-Chol Ri¹,
Song-Il Mun¹,
Kang-Song Jang¹ &
…
Song-Yun Han¹

260 Accesses
1 Citation
Explore all metrics

Abstract

Recent end-to-end TTS models generate human-like natural speech in real-time, but they produce pronunciation errors which cause the degradation of the naturalness of synthesized speech. In this paper, we investigate a method to alleviate the mispronunciation problem, one of the challenges in end-to-end TTS. To address this problem, we propose a novel framework that incorporates a pronunciation predictor, which predicts the corresponding phoneme sequence given character sequence, into the encoder of the end-to-end TTS model. Our model is based on non-autoregressive feed-forward Transformer, which is able to generate the mel-spectrogram in parallel, and the pronunciation predictor has also feed-forward architecture. Motivated by the idea that the pronunciation errors of end-to-end model is caused due to the limited and unbalanced lexical coverage of training data, a two-stage training scheme involving pre-training of the pronunciation predictor with a large-scale language dataset is also proposed. Experimental results showed that our model outperforms FastSpeech in the naturalness assessment as well as the phoneme error rate dropped from 8.7 to 1.4%. From the experimental results, we also found that using the pronunciation information is efficient for duration prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Article 08 August 2022

Text-to-Speech Duration Models for Resource-Scarce Languages in Neural Architectures

DNN-Based Duration Modeling for Synthesizing Short Sentences

References

Achanta, S., Pandey, A., & Gangashetty, S. V. (2016). Analysis of sequence to sequence neural networks on grapheme to phoneme conversion task. In Proceedings of international joint conference on neural networks (IJCNN) (pp. 2798–2804).
Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. In Proceeedings of international conference on machine learning (ICML) (pp. 195–204).
Beliaev, S., Rebryk, Y., & Ginsburg, B. (2020). TalkNet: Fully-convolutional non-autoregressive speech synthesis model. arXiv preprint arXiv:2005.05514
Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems (pp. 2962–2970).
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.
Article Google Scholar
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 1996) (pp. 373–376).
Kalchbrenner, N., Elsen, E., Simonyan K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., & Kavukcuoglu, K. (2018). Efficient neural audio synthesis. In Proceedings of international conference on machine learning (ICML) (pp. 2415–2424).
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of international conference on learning representations (ICLR).
Kumar, K., Kumar, R., Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., Brebisson, A., Bengio, Y., & Courville, A.C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in neural information processing systems (pp. 14881–14892).
Łancucki, A. (2020). Fastpitch: Parallel text-to-speech with pitch prediction. arXiv preprint arXiv:2006.06873
Li, J., Wu, Z., Li, R., Zhi, P., Yang, S., & Meng, H. (2019a). Knowledge-based linguistic encoding for end-to-end mandarin text-to-speech synthesis. In Proceedings of Interspeech (pp. 4494–4498).
Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019b). Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, (pp. 6706–6713).
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flowtts: A non-autoregressive network for text to speech based on flow. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 7209–7213).
Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877–1884.
Article Google Scholar
Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. In 9th ISCA speech synthesis workshop (pp. 125–125).
Peng, K., Ping, W., Song, Z., & Zhao, K. (2020). Non-autoregressive neural text-to-speech. In Proceedings of international conference on machine learning (ICML) (pp. 10192–10204).
Ping, W., Peng, K., & Chen, J. (2019). Clarinet: Parallel wave generation in end-to-end text-tospeech. In Proceedings of international conference on learning representations (ICLR).
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2018). Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of international conference on learning representations (ICLR).
Prenger, R., Valle, R., & Catanzaro, B. (2019). Waveglow: A flow-based generative network for speech synthesis. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2019) (pp. 3617–3621).
Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T. (2020). Fastspeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. (2019). Fastspeech: Fast, robust and controllable text to speech. In Advances in neural information processing systems (pp. 3165–3174).
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2018) (pp. 4779–4783).
Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In Proceedings of international conference on learning representations (ICLR) (pp. 24–26).
Taylor, J., & Richmond K. (2019). Analysis of pronunciation learning in end-to-end speech synthesis. In Proceedings of Interspeech (pp. 2070–2074).
Valle, R., Shih, K. J., Prenger, R., & Catanzaro, B. (2021). Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In Proceedings of international conference on learning representations (ICLR).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech (pp. 4006–4010).
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. arXiv preprint, arXiv:2005.05106.
Yao, K., & Zweig, G. (2015). Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In Proceedings of Interspeech (pp. 3330–3334).
Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2013) (pp. 7962–7966).
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Article Google Scholar
Zeng, Z., Wang, J., Cheng, N., Xia, T., & Xiao, J. (2020). Aligntts: Efficient feed-forward text-to-speech system without explicit alignment. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 6714–6718).

Download references

Acknowledgements

We appreciate the helpful discussions with Dr. Kim and Dr. Ri, anonymous reviewers and editors for many invaluable comments and suggestions to improve this paper.

Author information

Authors and Affiliations

Faculty of Information Science, KIM IL SUNG University, Pyongyang, Democratic People’s Republic of Korea
Chol-Jin Han, Un-Chol Ri, Song-Il Mun, Kang-Song Jang & Song-Yun Han

Authors

Chol-Jin Han
View author publications
You can also search for this author in PubMed Google Scholar
Un-Chol Ri
View author publications
You can also search for this author in PubMed Google Scholar
Song-Il Mun
View author publications
You can also search for this author in PubMed Google Scholar
Kang-Song Jang
View author publications
You can also search for this author in PubMed Google Scholar
Song-Yun Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chol-Jin Han.

Ethics declarations

Compliance with ethical standards

There are no our items corresponding with the guideline.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Han, CJ., Ri, UC., Mun, SI. et al. An end-to-end TTS model with pronunciation predictor. Int J Speech Technol 25, 1013–1024 (2022). https://doi.org/10.1007/s10772-022-10008-7

Download citation

Received: 07 November 2021
Accepted: 05 October 2022
Published: 13 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10772-022-10008-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An end-to-end TTS model with pronunciation predictor

Abstract

Access this article

Similar content being viewed by others

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Text-to-Speech Duration Models for Resource-Scarce Languages in Neural Architectures

DNN-Based Duration Modeling for Synthesizing Short Sentences

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Compliance with ethical standards

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An end-to-end TTS model with pronunciation predictor

Abstract

Access this article

Similar content being viewed by others

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Text-to-Speech Duration Models for Resource-Scarce Languages in Neural Architectures

DNN-Based Duration Modeling for Synthesizing Short Sentences

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Compliance with ethical standards

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation