ABSTRACT
End-to-end speech synthesis breaks away from the original system framework and directly converts text into speech. Although it is shown that Tacotron2 is superior to traditional piping systems in terms of speech naturalness, it still has many defects. A flaw in tacotron2 is mentioned in this paper., which impacts negatively upon the synthesis quality and the synthesized length of speech. It is cumulative error between training process (forward) and synthesis process (inference). In order to improve this problem, an unsupervised GAN (Generative Adversarial Network) model was proposed based on the Tacotron2. The proposed GAN model can also optimize the prosody of synthesize speech because of the prosody discriminator is also designed in our model. For further reduce the cumulative error mentioned above, this paper propose a training strategy called "random down" based on Tacotron2. And then demonstrate that the unimportant attention weights could be a contributing factor to cumulative error when the input sequence is too long. For this, a window has been added to the attention weights. Through these method, the length of the speech synthesis is improved to about 1000 encoder output. The prosody of synthetic speech has also been improved.
- J. Shen, R. Pang, R. J. Weiss, et al., "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions", in ICASSP, 2018, pp. 4779--1783.Google ScholarDigital Library
- Aaron van den Oord, Sander Dieleman, Heiga Zen, et al., "WaveNet: A generative model for raw audio," CoRR, vol. abs/1609.03499, 2016.Google Scholar
- J. Shen, R. Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions," in Proc. ICASSP, 2018, pp. 4779--1783.Google ScholarDigital Library
- Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., "Tacotron: Towards end-to-end speech synthesis," 2017.Google Scholar
- W. Ping, K. Peng, and J. Chen, "Clarinet: Parallel wave generation in end-to-end text-to-speech," arXiv preprint arXiv: 1807.07281, 2018.Google Scholar
- Haohan Guo, Frank K. Soong, Lei He, Lei Xie "A New GAN-based End-to-End TTS Training Algorithm", in InterSpeech, April 2019.Google Scholar
- Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z.: Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076 (2017)Google Scholar
- Skerry-Ryan, R J, Battenberg, Eric, Xiao, Ying, Wang, Yuxuan, Stanton, Daisy, Shor, Joel, Weiss, Ron J., Clark, Rob, and Saurous, Rif A. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint, 2018.Google Scholar
- Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," International Conference on Machine Learning, 2018. [Online]. Available: https://arxiv.org/abs/1803.09017Google Scholar
- Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan "Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis," arXiv:1808.01410, August 2018Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 2017, pp. 6000--6010.Google Scholar
- Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou, "Close to human quality TTS with transformer," CoRR, vol. abs/1809.08895, 2018.Google Scholar
- Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks. " IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).Google Scholar
- Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li, " Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework", arXiv:1707.01670, Jul 2017.Google Scholar
- Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. "Attngan: Finegrained text to image generation with attentional generative adversarial networks." CoRR, abs/1711.10485, 2017.Google Scholar
- Hao Dong, Jingqing Zhang, Douglas McIlwraith, and Yike Guo. I2t2i: Learning text to image synthesis with textual data augmentation. 2017 IEEE International Conference on Image Processing (ICIP), pages 2015--2019, 2017.Google ScholarDigital Library
- Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Finegrained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017.Google Scholar
Index Terms
- A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2
Recommendations
Audiovisual Speech Synthesis using Tacotron2
ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionAudiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based ...
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition
Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from ...
End-to-End Speech Synthesis for Tibetan Multidialect
The research on Tibetan speech synthesis technology has been mainly focusing on single dialect, and thus there is a lack of research on Tibetan multidialect speech synthesis technology. This paper presents an end-to-end Tibetan multidialect speech ...
Comments