skip to main content
10.1145/3614008.3614019acmotherconferencesArticle/Chapter ViewAbstractPublication PagesspmlConference Proceedingsconference-collections
research-article

The Tacotron2-based IPA-to-Speech speech synthesis system

Published:17 October 2023Publication History

ABSTRACT

To help language learners better understanding the pronunciation of one language, in this paper, we proposed an IPA-to-Speech speech synthesis system which aims to generate high quality human speech from written language in IPA format. There are mainly two parts in our system: a Transformer-based G2P converter and a Tacotron2-based speech synthesis system. The purpose of the G2P converter is to build the training data, all the English sentences in LJSpeech can be converted into their IPA formats by this converter, and the speech synthesis module intend to generate the speech from IPA sentences. The word error rate and phoneme error rate were utilized to evaluate the G2P converter and the mean opinion score was used to evaluate the performance of the speech synthesis. Also, this work inspired us to use the IPA format represent the dialects, in the future work, we will continue this research on the dialect recognition and generation.

References

  1. Liu, J. (2022). Transformer-Based Multilingual G2P Converter for E-Learning System. In: Degen, H., Ntoa, S. (eds) Artificial Intelligence in HCI. HCII 2022. Lecture Notes in Computer Science (), vol 13336. Springer, Cham.Google ScholarGoogle Scholar
  2. Hunnicutt, S. "Grapheme-to-phoneme rules: A review." Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden, QPSR 2-3 (1980): 38-60.Google ScholarGoogle Scholar
  3. Taylor, Paul. "Hidden Markov models for grapheme to phoneme conversion." Ninth European Conference on Speech Communication and Technology. 2005.Chelsea Finn. 2018. Learning to Learn with Gradients. PhD Thesis, EECS Department, University of Berkeley.Google ScholarGoogle Scholar
  4. Bisani, Maximilian, and Hermann Ney. "Joint-sequence models for grapheme-to-phoneme conversion." Speech communication 50.5 (2008): 434-451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rao, Kanishka, "Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks." 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.Google ScholarGoogle Scholar
  6. Yao, Kaisheng, and Geoffrey Zweig. "Sequence-to-sequence neural net models for grapheme-to-phoneme conversion." arXiv preprint arXiv:1506.00196 (2015).Google ScholarGoogle Scholar
  7. Yolchuyeva, Sevinj, Géza Németh, and Bálint Gyires-Tóth. "Transformer based grapheme-to-phoneme conversion." arXiv preprint arXiv:2004.06338 (2020).Google ScholarGoogle Scholar
  8. Tan, X., , A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.Google ScholarGoogle Scholar
  9. Schwarz, D., Corpus-based concatenative synthesis. IEEE signal processing magazine, 2007. 24(2): p. 92-104.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ze, H., A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. in 2013 ieee international conference on acoustics, speech and signal processing. 2013. IEEE.Google ScholarGoogle Scholar
  11. Oord, A.v.d., , Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.Google ScholarGoogle Scholar
  12. Wang, Y., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. ArXiv. /abs/1703.10135Google ScholarGoogle Scholar
  13. Ren, Yi, "Fastspeech: Fast, robust and controllable text to speech." Advances in neural information processing systems 32 (2019).Google ScholarGoogle Scholar
  14. Viswanathan, Mahesh, and Madhubalan Viswanathan. "Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale." Computer speech & language 19.1 (2005): 55-83.Google ScholarGoogle Scholar
  15. Ito, Keith, and Linda Johnson. "The lj speech dataset." (2017).Google ScholarGoogle Scholar
  16. Vaswani, Ashish, "Attention is all you need." Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar

Index Terms

  1. The Tacotron2-based IPA-to-Speech speech synthesis system

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SPML '23: Proceedings of the 2023 6th International Conference on Signal Processing and Machine Learning
      July 2023
      383 pages
      ISBN:9798400707575
      DOI:10.1145/3614008

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)41
      • Downloads (Last 6 weeks)15

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format