skip to main content
10.1145/3364908.3365292acmotherconferencesArticle/Chapter ViewAbstractPublication PagessspsConference Proceedingsconference-collections
research-article

A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2

Authors Info & Claims
Published:20 September 2019Publication History

ABSTRACT

End-to-end speech synthesis breaks away from the original system framework and directly converts text into speech. Although it is shown that Tacotron2 is superior to traditional piping systems in terms of speech naturalness, it still has many defects. A flaw in tacotron2 is mentioned in this paper., which impacts negatively upon the synthesis quality and the synthesized length of speech. It is cumulative error between training process (forward) and synthesis process (inference). In order to improve this problem, an unsupervised GAN (Generative Adversarial Network) model was proposed based on the Tacotron2. The proposed GAN model can also optimize the prosody of synthesize speech because of the prosody discriminator is also designed in our model. For further reduce the cumulative error mentioned above, this paper propose a training strategy called "random down" based on Tacotron2. And then demonstrate that the unimportant attention weights could be a contributing factor to cumulative error when the input sequence is too long. For this, a window has been added to the attention weights. Through these method, the length of the speech synthesis is improved to about 1000 encoder output. The prosody of synthetic speech has also been improved.

References

  1. J. Shen, R. Pang, R. J. Weiss, et al., "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions", in ICASSP, 2018, pp. 4779--1783.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aaron van den Oord, Sander Dieleman, Heiga Zen, et al., "WaveNet: A generative model for raw audio," CoRR, vol. abs/1609.03499, 2016.Google ScholarGoogle Scholar
  3. J. Shen, R. Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions," in Proc. ICASSP, 2018, pp. 4779--1783.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., "Tacotron: Towards end-to-end speech synthesis," 2017.Google ScholarGoogle Scholar
  5. W. Ping, K. Peng, and J. Chen, "Clarinet: Parallel wave generation in end-to-end text-to-speech," arXiv preprint arXiv: 1807.07281, 2018.Google ScholarGoogle Scholar
  6. Haohan Guo, Frank K. Soong, Lei He, Lei Xie "A New GAN-based End-to-End TTS Training Algorithm", in InterSpeech, April 2019.Google ScholarGoogle Scholar
  7. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z.: Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076 (2017)Google ScholarGoogle Scholar
  8. Skerry-Ryan, R J, Battenberg, Eric, Xiao, Ying, Wang, Yuxuan, Stanton, Daisy, Shor, Joel, Weiss, Ron J., Clark, Rob, and Saurous, Rif A. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint, 2018.Google ScholarGoogle Scholar
  9. Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," International Conference on Machine Learning, 2018. [Online]. Available: https://arxiv.org/abs/1803.09017Google ScholarGoogle Scholar
  10. Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan "Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis," arXiv:1808.01410, August 2018Google ScholarGoogle Scholar
  11. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 2017, pp. 6000--6010.Google ScholarGoogle Scholar
  12. Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou, "Close to human quality TTS with transformer," CoRR, vol. abs/1809.08895, 2018.Google ScholarGoogle Scholar
  13. Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks. " IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).Google ScholarGoogle Scholar
  14. Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li, " Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework", arXiv:1707.01670, Jul 2017.Google ScholarGoogle Scholar
  15. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. "Attngan: Finegrained text to image generation with attentional generative adversarial networks." CoRR, abs/1711.10485, 2017.Google ScholarGoogle Scholar
  16. Hao Dong, Jingqing Zhang, Douglas McIlwraith, and Yike Guo. I2t2i: Learning text to image synthesis with textual data augmentation. 2017 IEEE International Conference on Image Processing (ICIP), pages 2015--2019, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Finegrained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017.Google ScholarGoogle Scholar

Index Terms

  1. A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SSPS '19: Proceedings of the 2019 International Symposium on Signal Processing Systems
      September 2019
      188 pages
      ISBN:9781450362412
      DOI:10.1145/3364908

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 September 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader