skip to main content
10.1145/3364908.3365292acmotherconferencesArticle/Chapter ViewAbstractPublication PagessspsConference Proceedingsconference-collections
research-article

A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2

Published: 20 September 2019 Publication History

Abstract

End-to-end speech synthesis breaks away from the original system framework and directly converts text into speech. Although it is shown that Tacotron2 is superior to traditional piping systems in terms of speech naturalness, it still has many defects. A flaw in tacotron2 is mentioned in this paper., which impacts negatively upon the synthesis quality and the synthesized length of speech. It is cumulative error between training process (forward) and synthesis process (inference). In order to improve this problem, an unsupervised GAN (Generative Adversarial Network) model was proposed based on the Tacotron2. The proposed GAN model can also optimize the prosody of synthesize speech because of the prosody discriminator is also designed in our model. For further reduce the cumulative error mentioned above, this paper propose a training strategy called "random down" based on Tacotron2. And then demonstrate that the unimportant attention weights could be a contributing factor to cumulative error when the input sequence is too long. For this, a window has been added to the attention weights. Through these method, the length of the speech synthesis is improved to about 1000 encoder output. The prosody of synthetic speech has also been improved.

References

[1]
J. Shen, R. Pang, R. J. Weiss, et al., "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions", in ICASSP, 2018, pp. 4779--1783.
[2]
Aaron van den Oord, Sander Dieleman, Heiga Zen, et al., "WaveNet: A generative model for raw audio," CoRR, vol. abs/1609.03499, 2016.
[3]
J. Shen, R. Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions," in Proc. ICASSP, 2018, pp. 4779--1783.
[4]
Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., "Tacotron: Towards end-to-end speech synthesis," 2017.
[5]
W. Ping, K. Peng, and J. Chen, "Clarinet: Parallel wave generation in end-to-end text-to-speech," arXiv preprint arXiv: 1807.07281, 2018.
[6]
Haohan Guo, Frank K. Soong, Lei He, Lei Xie "A New GAN-based End-to-End TTS Training Algorithm", in InterSpeech, April 2019.
[7]
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z.: Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076 (2017)
[8]
Skerry-Ryan, R J, Battenberg, Eric, Xiao, Ying, Wang, Yuxuan, Stanton, Daisy, Shor, Joel, Weiss, Ron J., Clark, Rob, and Saurous, Rif A. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint, 2018.
[9]
Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," International Conference on Machine Learning, 2018. [Online]. Available: https://arxiv.org/abs/1803.09017
[10]
Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan "Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis," arXiv:1808.01410, August 2018
[11]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 2017, pp. 6000--6010.
[12]
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou, "Close to human quality TTS with transformer," CoRR, vol. abs/1809.08895, 2018.
[13]
Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks. " IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).
[14]
Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li, " Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework", arXiv:1707.01670, Jul 2017.
[15]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. "Attngan: Finegrained text to image generation with attentional generative adversarial networks." CoRR, abs/1711.10485, 2017.
[16]
Hao Dong, Jingqing Zhang, Douglas McIlwraith, and Yike Guo. I2t2i: Learning text to image synthesis with textual data augmentation. 2017 IEEE International Conference on Image Processing (ICIP), pages 2015--2019, 2017.
[17]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Finegrained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017.

Cited By

View all

Index Terms

  1. A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSPS '19: Proceedings of the 2019 International Symposium on Signal Processing Systems
    September 2019
    188 pages
    ISBN:9781450362412
    DOI:10.1145/3364908
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Beijing University of Posts and Telecommunications

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 September 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attention weights window
    2. end-to-end TTS synthesis
    3. random down method
    4. unsupervised GAN model

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SSPS 2019

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Robust TTSNeural Text-to-Speech Synthesis10.1007/978-981-99-0827-1_9(141-151)Online publication date: 8-Mar-2023
    • (2023)Expressive and Controllable TTSNeural Text-to-Speech Synthesis10.1007/978-981-99-0827-1_8(125-140)Online publication date: 8-Mar-2023
    • (2022)A literature review and perspectives in deepfakes: generation, detection, and applicationsInternational Journal of Multimedia Information Retrieval10.1007/s13735-022-00241-w11:3(219-289)Online publication date: 23-Jul-2022
    • (2022)A survey on deep reinforcement learning for audio-based applicationsArtificial Intelligence Review10.1007/s10462-022-10224-256:3(2193-2240)Online publication date: 2-Jul-2022
    • (2021)End-to-end Indonesian Speech Synthesis Based On Transfer Learning And Alternate Training2021 International Conference on Asian Language Processing (IALP)10.1109/IALP54817.2021.9675251(30-35)Online publication date: 11-Dec-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media