ABSTRACT
To help language learners better understanding the pronunciation of one language, in this paper, we proposed an IPA-to-Speech speech synthesis system which aims to generate high quality human speech from written language in IPA format. There are mainly two parts in our system: a Transformer-based G2P converter and a Tacotron2-based speech synthesis system. The purpose of the G2P converter is to build the training data, all the English sentences in LJSpeech can be converted into their IPA formats by this converter, and the speech synthesis module intend to generate the speech from IPA sentences. The word error rate and phoneme error rate were utilized to evaluate the G2P converter and the mean opinion score was used to evaluate the performance of the speech synthesis. Also, this work inspired us to use the IPA format represent the dialects, in the future work, we will continue this research on the dialect recognition and generation.
- Liu, J. (2022). Transformer-Based Multilingual G2P Converter for E-Learning System. In: Degen, H., Ntoa, S. (eds) Artificial Intelligence in HCI. HCII 2022. Lecture Notes in Computer Science (), vol 13336. Springer, Cham.Google Scholar
- Hunnicutt, S. "Grapheme-to-phoneme rules: A review." Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden, QPSR 2-3 (1980): 38-60.Google Scholar
- Taylor, Paul. "Hidden Markov models for grapheme to phoneme conversion." Ninth European Conference on Speech Communication and Technology. 2005.Chelsea Finn. 2018. Learning to Learn with Gradients. PhD Thesis, EECS Department, University of Berkeley.Google Scholar
- Bisani, Maximilian, and Hermann Ney. "Joint-sequence models for grapheme-to-phoneme conversion." Speech communication 50.5 (2008): 434-451.Google ScholarDigital Library
- Rao, Kanishka, "Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks." 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.Google Scholar
- Yao, Kaisheng, and Geoffrey Zweig. "Sequence-to-sequence neural net models for grapheme-to-phoneme conversion." arXiv preprint arXiv:1506.00196 (2015).Google Scholar
- Yolchuyeva, Sevinj, Géza Németh, and Bálint Gyires-Tóth. "Transformer based grapheme-to-phoneme conversion." arXiv preprint arXiv:2004.06338 (2020).Google Scholar
- Tan, X., , A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.Google Scholar
- Schwarz, D., Corpus-based concatenative synthesis. IEEE signal processing magazine, 2007. 24(2): p. 92-104.Google ScholarCross Ref
- Ze, H., A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. in 2013 ieee international conference on acoustics, speech and signal processing. 2013. IEEE.Google Scholar
- Oord, A.v.d., , Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.Google Scholar
- Wang, Y., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. ArXiv. /abs/1703.10135Google Scholar
- Ren, Yi, "Fastspeech: Fast, robust and controllable text to speech." Advances in neural information processing systems 32 (2019).Google Scholar
- Viswanathan, Mahesh, and Madhubalan Viswanathan. "Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale." Computer speech & language 19.1 (2005): 55-83.Google Scholar
- Ito, Keith, and Linda Johnson. "The lj speech dataset." (2017).Google Scholar
- Vaswani, Ashish, "Attention is all you need." Advances in neural information processing systems 30 (2017).Google Scholar
Index Terms
- The Tacotron2-based IPA-to-Speech speech synthesis system
Recommendations
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System
Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Audiovisual Speech Synthesis using Tacotron2
ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionAudiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based ...
Comments