Skip to main content
Log in

An end-to-end TTS model with pronunciation predictor

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Recent end-to-end TTS models generate human-like natural speech in real-time, but they produce pronunciation errors which cause the degradation of the naturalness of synthesized speech. In this paper, we investigate a method to alleviate the mispronunciation problem, one of the challenges in end-to-end TTS. To address this problem, we propose a novel framework that incorporates a pronunciation predictor, which predicts the corresponding phoneme sequence given character sequence, into the encoder of the end-to-end TTS model. Our model is based on non-autoregressive feed-forward Transformer, which is able to generate the mel-spectrogram in parallel, and the pronunciation predictor has also feed-forward architecture. Motivated by the idea that the pronunciation errors of end-to-end model is caused due to the limited and unbalanced lexical coverage of training data, a two-stage training scheme involving pre-training of the pronunciation predictor with a large-scale language dataset is also proposed. Experimental results showed that our model outperforms FastSpeech in the naturalness assessment as well as the phoneme error rate dropped from 8.7 to 1.4%. From the experimental results, we also found that using the pronunciation information is efficient for duration prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Achanta, S., Pandey, A., & Gangashetty, S. V. (2016). Analysis of sequence to sequence neural networks on grapheme to phoneme conversion task. In Proceedings of international joint conference on neural networks (IJCNN) (pp. 2798–2804).

  • Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. In Proceeedings of international conference on machine learning (ICML) (pp. 195–204).

  • Beliaev, S., Rebryk, Y., & Ginsburg, B. (2020). TalkNet: Fully-convolutional non-autoregressive speech synthesis model. arXiv preprint arXiv:2005.05514

  • Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems (pp. 2962–2970).

  • Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.

    Article  Google Scholar 

  • Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 1996) (pp. 373–376).

  • Kalchbrenner, N., Elsen, E., Simonyan K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., & Kavukcuoglu, K. (2018). Efficient neural audio synthesis. In Proceedings of international conference on machine learning (ICML) (pp. 2415–2424).

  • Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of international conference on learning representations (ICLR).

  • Kumar, K., Kumar, R., Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., Brebisson, A., Bengio, Y., & Courville, A.C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in neural information processing systems (pp. 14881–14892).

  • Łancucki, A. (2020). Fastpitch: Parallel text-to-speech with pitch prediction. arXiv preprint arXiv:2006.06873

  • Li, J., Wu, Z., Li, R., Zhi, P., Yang, S., & Meng, H. (2019a). Knowledge-based linguistic encoding for end-to-end mandarin text-to-speech synthesis. In Proceedings of Interspeech (pp. 4494–4498).

  • Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019b). Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, (pp. 6706–6713).

  • Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flowtts: A non-autoregressive network for text to speech based on flow. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 7209–7213).

  • Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877–1884.

    Article  Google Scholar 

  • Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. In 9th ISCA speech synthesis workshop (pp. 125–125).

  • Peng, K., Ping, W., Song, Z., & Zhao, K. (2020). Non-autoregressive neural text-to-speech. In Proceedings of international conference on machine learning (ICML) (pp. 10192–10204).

  • Ping, W., Peng, K., & Chen, J. (2019). Clarinet: Parallel wave generation in end-to-end text-tospeech. In Proceedings of international conference on learning representations (ICLR).

  • Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2018). Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of international conference on learning representations (ICLR).

  • Prenger, R., Valle, R., & Catanzaro, B. (2019). Waveglow: A flow-based generative network for speech synthesis. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2019) (pp. 3617–3621).

  • Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T. (2020). Fastspeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558

  • Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. (2019). Fastspeech: Fast, robust and controllable text to speech. In Advances in neural information processing systems (pp. 3165–3174).

  • Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2018) (pp. 4779–4783).

  • Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In Proceedings of international conference on learning representations (ICLR) (pp. 24–26).

  • Taylor, J., & Richmond K. (2019). Analysis of pronunciation learning in end-to-end speech synthesis. In Proceedings of Interspeech (pp. 2070–2074).

  • Valle, R., Shih, K. J., Prenger, R., & Catanzaro, B. (2021). Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In Proceedings of international conference on learning representations (ICLR).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

  • Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech (pp. 4006–4010).

  • Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. arXiv preprint, arXiv:2005.05106.

  • Yao, K., & Zweig, G. (2015). Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In Proceedings of Interspeech (pp. 3330–3334).

  • Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2013) (pp. 7962–7966).

  • Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.

    Article  Google Scholar 

  • Zeng, Z., Wang, J., Cheng, N., Xia, T., & Xiao, J. (2020). Aligntts: Efficient feed-forward text-to-speech system without explicit alignment. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 6714–6718).

Download references

Acknowledgements

We appreciate the helpful discussions with Dr. Kim and Dr. Ri, anonymous reviewers and editors for many invaluable comments and suggestions to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chol-Jin Han.

Ethics declarations

Compliance with ethical standards

There are no our items corresponding with the guideline.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, CJ., Ri, UC., Mun, SI. et al. An end-to-end TTS model with pronunciation predictor. Int J Speech Technol 25, 1013–1024 (2022). https://doi.org/10.1007/s10772-022-10008-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-022-10008-7

Keywords

Navigation