Abstract
Recent end-to-end TTS models generate human-like natural speech in real-time, but they produce pronunciation errors which cause the degradation of the naturalness of synthesized speech. In this paper, we investigate a method to alleviate the mispronunciation problem, one of the challenges in end-to-end TTS. To address this problem, we propose a novel framework that incorporates a pronunciation predictor, which predicts the corresponding phoneme sequence given character sequence, into the encoder of the end-to-end TTS model. Our model is based on non-autoregressive feed-forward Transformer, which is able to generate the mel-spectrogram in parallel, and the pronunciation predictor has also feed-forward architecture. Motivated by the idea that the pronunciation errors of end-to-end model is caused due to the limited and unbalanced lexical coverage of training data, a two-stage training scheme involving pre-training of the pronunciation predictor with a large-scale language dataset is also proposed. Experimental results showed that our model outperforms FastSpeech in the naturalness assessment as well as the phoneme error rate dropped from 8.7 to 1.4%. From the experimental results, we also found that using the pronunciation information is efficient for duration prediction.
Similar content being viewed by others
References
Achanta, S., Pandey, A., & Gangashetty, S. V. (2016). Analysis of sequence to sequence neural networks on grapheme to phoneme conversion task. In Proceedings of international joint conference on neural networks (IJCNN) (pp. 2798–2804).
Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. In Proceeedings of international conference on machine learning (ICML) (pp. 195–204).
Beliaev, S., Rebryk, Y., & Ginsburg, B. (2020). TalkNet: Fully-convolutional non-autoregressive speech synthesis model. arXiv preprint arXiv:2005.05514
Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems (pp. 2962–2970).
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 1996) (pp. 373–376).
Kalchbrenner, N., Elsen, E., Simonyan K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., & Kavukcuoglu, K. (2018). Efficient neural audio synthesis. In Proceedings of international conference on machine learning (ICML) (pp. 2415–2424).
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of international conference on learning representations (ICLR).
Kumar, K., Kumar, R., Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., Brebisson, A., Bengio, Y., & Courville, A.C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in neural information processing systems (pp. 14881–14892).
Łancucki, A. (2020). Fastpitch: Parallel text-to-speech with pitch prediction. arXiv preprint arXiv:2006.06873
Li, J., Wu, Z., Li, R., Zhi, P., Yang, S., & Meng, H. (2019a). Knowledge-based linguistic encoding for end-to-end mandarin text-to-speech synthesis. In Proceedings of Interspeech (pp. 4494–4498).
Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019b). Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, (pp. 6706–6713).
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flowtts: A non-autoregressive network for text to speech based on flow. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 7209–7213).
Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877–1884.
Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. In 9th ISCA speech synthesis workshop (pp. 125–125).
Peng, K., Ping, W., Song, Z., & Zhao, K. (2020). Non-autoregressive neural text-to-speech. In Proceedings of international conference on machine learning (ICML) (pp. 10192–10204).
Ping, W., Peng, K., & Chen, J. (2019). Clarinet: Parallel wave generation in end-to-end text-tospeech. In Proceedings of international conference on learning representations (ICLR).
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2018). Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of international conference on learning representations (ICLR).
Prenger, R., Valle, R., & Catanzaro, B. (2019). Waveglow: A flow-based generative network for speech synthesis. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2019) (pp. 3617–3621).
Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T. (2020). Fastspeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. (2019). Fastspeech: Fast, robust and controllable text to speech. In Advances in neural information processing systems (pp. 3165–3174).
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2018) (pp. 4779–4783).
Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In Proceedings of international conference on learning representations (ICLR) (pp. 24–26).
Taylor, J., & Richmond K. (2019). Analysis of pronunciation learning in end-to-end speech synthesis. In Proceedings of Interspeech (pp. 2070–2074).
Valle, R., Shih, K. J., Prenger, R., & Catanzaro, B. (2021). Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In Proceedings of international conference on learning representations (ICLR).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech (pp. 4006–4010).
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. arXiv preprint, arXiv:2005.05106.
Yao, K., & Zweig, G. (2015). Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In Proceedings of Interspeech (pp. 3330–3334).
Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2013) (pp. 7962–7966).
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Zeng, Z., Wang, J., Cheng, N., Xia, T., & Xiao, J. (2020). Aligntts: Efficient feed-forward text-to-speech system without explicit alignment. In Proceedings of international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 6714–6718).
Acknowledgements
We appreciate the helpful discussions with Dr. Kim and Dr. Ri, anonymous reviewers and editors for many invaluable comments and suggestions to improve this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Compliance with ethical standards
There are no our items corresponding with the guideline.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Han, CJ., Ri, UC., Mun, SI. et al. An end-to-end TTS model with pronunciation predictor. Int J Speech Technol 25, 1013–1024 (2022). https://doi.org/10.1007/s10772-022-10008-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-10008-7