skip to main content
10.1145/3548608.3559304acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccirConference Proceedingsconference-collections
research-article

CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System

Published: 14 October 2022 Publication History

Abstract

End-to-end speech synthesis technology has already replaced the positions of Statistical Parametric Speech Synthesis (SPSS) in text-to-speech (TTS) field. The end-to-end model based on neural network, does not require a lot of domain knowledge but synthesize more natural speeches. Tacotron is the first model that can synthesize speeches which even human is hard to distinguish. We propose a new end-to-end speech synthesis system which is called Convolution-Recurrent-Convolution Text-to-Speech (CRCTTS). We chose Tacotron as our baseline model and adjust the architecture through fully Convolution Neural Network (CNN) module and Dynamic Convolution Attention (DCA). Besides, we also introduce the attention guided mechanism to our model for accelerating the attention alignment in the decoder module. The model we proposed has been proved that can synthesis speech with better quality and cost less time in terms of training stage and synthesis stage than the baseline model with these technologies.

References

[1]
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[2]
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[3]
Tachibana, H., Uenoyama, K., & Aihara, S. (2018). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4784-4788.
[4]
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.
[5]
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4779-4783.
[6]
Battenberg, E., Skerry-Ryan, R. J., Mariooryad, S., Stanton, D., Kao, D., Shannon, M., & Bagby, T. (2020). Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6194-6198.
[7]
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. arXiv preprint arXiv:1507.06228.
[8]
Keith Ito. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset.
  1. CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCIR '22: Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics
    June 2022
    905 pages
    ISBN:9781450397179
    DOI:10.1145/3548608
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICCIR 2022

    Acceptance Rates

    Overall Acceptance Rate 131 of 239 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 23
      Total Downloads
    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media