research-article

A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2

Authors:

Mengyuan LiuAuthors Info & Claims

SSPS '19: Proceedings of the 2019 International Symposium on Signal Processing Systems

Pages 46 - 50

https://doi.org/10.1145/3364908.3365292

Published: 20 September 2019 Publication History

Abstract

End-to-end speech synthesis breaks away from the original system framework and directly converts text into speech. Although it is shown that Tacotron2 is superior to traditional piping systems in terms of speech naturalness, it still has many defects. A flaw in tacotron2 is mentioned in this paper., which impacts negatively upon the synthesis quality and the synthesized length of speech. It is cumulative error between training process (forward) and synthesis process (inference). In order to improve this problem, an unsupervised GAN (Generative Adversarial Network) model was proposed based on the Tacotron2. The proposed GAN model can also optimize the prosody of synthesize speech because of the prosody discriminator is also designed in our model. For further reduce the cumulative error mentioned above, this paper propose a training strategy called "random down" based on Tacotron2. And then demonstrate that the unimportant attention weights could be a contributing factor to cumulative error when the input sequence is too long. For this, a window has been added to the attention weights. Through these method, the length of the speech synthesis is improved to about 1000 encoder output. The prosody of synthetic speech has also been improved.

References

[1]

J. Shen, R. Pang, R. J. Weiss, et al., "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions", in ICASSP, 2018, pp. 4779--1783.

Digital Library

[2]

Aaron van den Oord, Sander Dieleman, Heiga Zen, et al., "WaveNet: A generative model for raw audio," CoRR, vol. abs/1609.03499, 2016.

[3]

J. Shen, R. Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu, "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions," in Proc. ICASSP, 2018, pp. 4779--1783.

Digital Library

[4]

Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., "Tacotron: Towards end-to-end speech synthesis," 2017.

[5]

W. Ping, K. Peng, and J. Chen, "Clarinet: Parallel wave generation in end-to-end text-to-speech," arXiv preprint arXiv: 1807.07281, 2018.

[6]

Haohan Guo, Frank K. Soong, Lei He, Lei Xie "A New GAN-based End-to-End TTS Training Algorithm", in InterSpeech, April 2019.

[7]

Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z.: Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076 (2017)

[8]

Skerry-Ryan, R J, Battenberg, Eric, Xiao, Ying, Wang, Yuxuan, Stanton, Daisy, Shor, Joel, Weiss, Ron J., Clark, Rob, and Saurous, Rif A. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint, 2018.

[9]

Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," International Conference on Machine Learning, 2018. [Online]. Available: https://arxiv.org/abs/1803.09017

[10]

Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan "Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis," arXiv:1808.01410, August 2018

[11]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 2017, pp. 6000--6010.

[12]

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou, "Close to human quality TTS with transformer," CoRR, vol. abs/1809.08895, 2018.

[13]

Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks. " IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).

[14]

Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li, " Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework", arXiv:1707.01670, Jul 2017.

[15]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. "Attngan: Finegrained text to image generation with attentional generative adversarial networks." CoRR, abs/1711.10485, 2017.

[16]

Hao Dong, Jingqing Zhang, Douglas McIlwraith, and Yike Guo. I2t2i: Learning text to image synthesis with textual data augmentation. 2017 IEEE International Conference on Image Processing (ICIP), pages 2015--2019, 2017.

Digital Library

[17]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Finegrained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017.

Cited By

Tan XTan X(2023)Robust TTSNeural Text-to-Speech Synthesis10.1007/978-981-99-0827-1_9(141-151)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-981-99-0827-1_9
Tan XTan X(2023)Expressive and Controllable TTSNeural Text-to-Speech Synthesis10.1007/978-981-99-0827-1_8(125-140)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-981-99-0827-1_8
Dagar DVishwakarma D(2022)A literature review and perspectives in deepfakes: generation, detection, and applicationsInternational Journal of Multimedia Information Retrieval10.1007/s13735-022-00241-w11:3(219-289)Online publication date: 23-Jul-2022
https://doi.org/10.1007/s13735-022-00241-w
Show More Cited By

Index Terms

A New End-to-End Long-Time Speech Synthesis System Based on Tacotron2
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Audiovisual Speech Synthesis using Tacotron2
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Audiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based ...
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from ...
End-to-End Speech Synthesis for Tibetan Multidialect
The research on Tibetan speech synthesis technology has been mainly focusing on single dialect, and thus there is a lack of research on Tibetan multidialect speech synthesis technology. This paper presents an end-to-end Tibetan multidialect speech ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSPS '19: Proceedings of the 2019 International Symposium on Signal Processing Systems

September 2019

188 pages

ISBN:9781450362412

DOI:10.1145/3364908

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Beijing University of Posts and Telecommunications

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSPS 2019

SSPS 2019: 2019 International Symposium on Signal Processing Systems

September 20 - 22, 2019

Beijing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
200
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tan XTan X(2023)Robust TTSNeural Text-to-Speech Synthesis10.1007/978-981-99-0827-1_9(141-151)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-981-99-0827-1_9
Tan XTan X(2023)Expressive and Controllable TTSNeural Text-to-Speech Synthesis10.1007/978-981-99-0827-1_8(125-140)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-981-99-0827-1_8
Dagar DVishwakarma D(2022)A literature review and perspectives in deepfakes: generation, detection, and applicationsInternational Journal of Multimedia Information Retrieval10.1007/s13735-022-00241-w11:3(219-289)Online publication date: 23-Jul-2022
https://doi.org/10.1007/s13735-022-00241-w
Latif SCuayáhuitl HPervez FShamshad FAli HCambria E(2022)A survey on deep reinforcement learning for audio-based applicationsArtificial Intelligence Review10.1007/s10462-022-10224-256:3(2193-2240)Online publication date: 2-Jul-2022
https://dl.acm.org/doi/10.1007/s10462-022-10224-2
Lu YYang JYang R(2021)End-to-end Indonesian Speech Synthesis Based On Transfer Learning And Alternate Training2021 International Conference on Asian Language Processing (IALP)10.1109/IALP54817.2021.9675251(30-35)Online publication date: 11-Dec-2021
https://doi.org/10.1109/IALP54817.2021.9675251

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents