research-article

CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System

Authors:

Kuo Chen,

Xuebin SunAuthors Info & Claims

ICCIR '22: Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics

Pages 774 - 777

https://doi.org/10.1145/3548608.3559304

Published: 14 October 2022 Publication History

Get Access

Abstract

End-to-end speech synthesis technology has already replaced the positions of Statistical Parametric Speech Synthesis (SPSS) in text-to-speech (TTS) field. The end-to-end model based on neural network, does not require a lot of domain knowledge but synthesize more natural speeches. Tacotron is the first model that can synthesize speeches which even human is hard to distinguish. We propose a new end-to-end speech synthesis system which is called Convolution-Recurrent-Convolution Text-to-Speech (CRCTTS). We chose Tacotron as our baseline model and adjust the architecture through fully Convolution Neural Network (CNN) module and Dynamic Convolution Attention (DCA). Besides, we also introduce the attention guided mechanism to our model for accelerating the attention alignment in the decoder module. The model we proposed has been proved that can synthesis speech with better quality and cost less time in terms of training stage and synthesis stage than the baseline model with these technologies.

References

[1]

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Google Scholar

[2]

Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.

Google Scholar

[3]

Tachibana, H., Uenoyama, K., & Aihara, S. (2018). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4784-4788.

Digital Library

Google Scholar

[4]

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.

Google Scholar

[5]

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4779-4783.

Digital Library

Google Scholar

[6]

Battenberg, E., Skerry-Ryan, R. J., Mariooryad, S., Stanton, D., Kao, D., Shannon, M., & Bagby, T. (2020). Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6194-6198.

Crossref

Google Scholar

[7]

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. arXiv preprint arXiv:1507.06228.

Google Scholar

[8]

Keith Ito. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset.

Google Scholar

CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Efficient Speech Enhancement Using Recurrent Convolution Encoder and Decoder
Abstract
The accuracy of voice or speech recognition is affected due to the presence of various background noises present in the surroundings. Automatic Speech Recognition communication systems are utilized for enhancing the speech by either reducing or ...
Noise-robust speech recognition in mobile network based on convolution neural networks
Abstract
The performance of Continuous Automatic Speech Recognition Systems (CASRS) in networks communications degrades rapidly in the presence of speech signal variability such as noisy environment, channel communication, and speech codec. There are ...
Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction

Speech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods ...

Comments

Information & Contributors

Information

Published In

ICCIR '22: Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics

June 2022

905 pages

ISBN:9781450397179

DOI:10.1145/3548608

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCIR 2022

ICCIR 2022: 2022 2nd International Conference on Control and Intelligent Robot

June 24 - 26, 2022

Nanjing, China

Acceptance Rates

Overall Acceptance Rate 131 of 239 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
23
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Recommendations

Efficient Speech Enhancement Using Recurrent Convolution Encoder and Decoder

Noise-robust speech recognition in mobile network based on convolution neural networks

Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations