Abstract
Speech synthesis, an artificial intelligence technology that employs computers to imitate human speech, has played a crucial role in human–computer interaction since it can automatically convert text into speech with satisfactory intelligibility and naturalness. Tacotron2 is the second generation end-to-end English speech synthesis model developed by Google. As Mandarin becomes more and more popular in the world, the associated speech synthesis technologies have been applied in various applications. Aiming at extending Tacotron2 to synthesize Mandarin speech, we propose in this paper a novel synthesis method by adding a Mandarin-to-PinYin module and a prosodic structure prediction model into Tacotron2. By evaluating synthesized results with subjective and objective methods, the added prosodic structure prediction model is demonstrated to help Tacotron2 synthesize more natural and human-like Mandarin speech.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Arik SÖ, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, Shoeybi M (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning, Sydney, Australia, pp 195–204
Arik SÖ, Diamos GF, Gibiansky A, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: the 31st conference on neural information processing systems, CA, USA, pp 2962–2970
de Barcelos Silva A, Gomes MM, da Costa CA, da Rosa Righi R, Barbosa JLV, Pessin G, De Doncker G, Federizzi G (2020) Intelligent personal assistants: a systematic literature review. Expert Syst Appl 147:113193
Gonzalvo X, Tazari S, Chan C, (2016) Recent advances in google real-time hmm-driven unit selection synthesizer. In: Proceedings of the 17th annual conference on the international speech communication association (Interspeech), San Francisco, USA, pp 2238–2242
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 1, pp 373–376
Itakura F (1975) Minimum prediction residual principle applied to speech recognition. IEEE Trans Acoust Speech Signal Process 23(1):67–72
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representation
Krueger D, Maharaj T, Kramár J, Pezeshki M, Ballas N, Ke NR, Goyal A, Bengio Y, Courville A, Pal C (2017) Zoneout: Regularizing RNNs by randomly preserving hidden activations. In: Proceedings of the 5th international conference on learning representation
Lee J, Cho K, Hofmann T (2017) Fully character-level neural machine translation without explicit segmentation. Trans Assoc Comput Linguist 5:365–378
Lu Y, Dong M, Chen Y (2019) Implementing prosodic phrasing in Chinese end-to-end speech synthesis. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7050–7054
Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville AC, Bengio Y (2016) SampleRNN: an unconditional end-to-end neural audio generation model. CoRR. arXiv:1612.07837
Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst E99.D(7):1877–1884
Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma Z, Wang Y (2020) A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. In: Proceedings of the international conference on acoustics, speech and signal processing, pp 6689–6693
Ping W, Peng K, Gibiansky A, Arik SÖ, Kannan A, Narang S, Raiman J, Miller J (2017) Deep voice 3: 2000-speaker neural text-to-speech. CoRR. arXiv:1710.07654
Rix AW (2003) Comparison between subjective listening quality and P.862 PESQ score. In: Proceedings of online workshop measurement of speech and audio quality in networks (MESAQIN’03), pp 17–25
Shechtman S, Sorin A (2019) Sequence to sequence neural speech synthesis with prosody modification capabilities. In: 10th ISCA speech synthesis workshop, pp 275–280
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville AC, Bengio Y (2017) Char2wav: end-to-end speech synthesis. In: International conference on learning representations (ICLR)
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Sun G, Zhang Y, Weiss RJ, Cao Y, Zen H, Wu Y (2020) Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In: Proceedings of the 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6264–6268
Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, Cambridge
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. CoRR. arXiv:1609.03499
Vazquez-Alvarez Y, Huckvale M (2002) The reliability of the ITU-p.85 standard for the evaluation of text-to-speech systems. In: The 7th international conference on spoken language processing
Wang W, Xu S, Xu B (2016) First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. In: Proceedings interspeech, pp 2243–2247
Wang W, Xu S, Xu B (2016) Gating recurrent mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5520–5524
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le QV, Agiomyrgiannakis Y, Clark R, Saurous RA (2017) Tacotron: a fully end-to-end text-to-speech synthesis model. In: Proceedings of the 18th annual conference on the international speech communication association, Stockholm, Sweden, pp 4006–4010
Wu Z, King S (2016) Investigating gated recurrent networks for speech synthesis. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5140–5144
Yang F, Yang S, Zhu P, Yan P, Xie L (2019) Improving mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp 208–213
Zen H, Agiomyrgiannakis Y, Egberts N, Henderson F, Szczepaniak P (2016) Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In: Proceedings of the 17th annual conference on the international speech communication association (Interspeech), San Francisco, USA, pp 2273–2277
Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4470–4474
Zen H, Senior A (2014) Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3844–3848
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064
Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on Tacotron. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 165–169
Zhu X (2019) Emerging champions in the digital economy. Springer, Singapore
Acknowledgements
This work is supported by the high-performance computing platform of Xi’an Jiaotong University which provides us convenient and efficient computing resources. This research was supported by the National Key Research and Development Program of China (no. 2018AAA0102201), the National Natural Science Foundations of China (no. 61877049).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, J., Xie, Z., Zhang, C. et al. A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2. Int. J. Mach. Learn. & Cyber. 12, 2809–2823 (2021). https://doi.org/10.1007/s13042-021-01365-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01365-x