Abstract
Deep neural networks (DNN) are gaining increasing interest in speech processing applications, especially in text-to-speech synthesis. Actually state-of-the-art speech generation tools, like MERLIN and WAVENET are totally DNN-based. However, every language has to be modeled on its own using DNN. One of the key components of speech synthesis modules is the prosodic parameters generation module from contextual input features, and more particularly the fundamental frequency (\(F_{0}\)) generation module. Actually \(F_{0}\) is responsible for intonation, that is why it should be accurately modeled to provide intelligible and natural speech. However, \(F_{0}\) modeling is highly dependent on the language. Therefore, language specific characteristics have to be taken into account. In this paper, we aim to model \(F_{0}\) for Arabic speech synthesis with feedforward and recurrent DNN, and using specific characteristic features for Arabic like vowel quantity and gemination, in order to improve the quality of Arabic parametric speech synthesis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pierrehumbert, J.: The phonology and phonetics of English intonation. Ph.D. Thesis, Massachusetts Institute of Technology (1980)
Hart, J., Collier, R., Cohen, A.: A Perceptual Study of Intonation. Cambridge University Press, Cambridge (1990)
Dusterhoff, K., Black, A.: Generating \(F_{0}\) contour for speech synthesis using the tilt intonation theory. In: 3rd ESCA workshop on Intonation: Theory Models and Applications, pp. 107–110. Athens, Greece (1997)
Taylor, P.: Analysis and synthesis of intonation using the tilt model. J. Acoust. Soc. Am. 107(3), 1697–1714 (2000)
Moehler, G., Conkie, A.: Parametric modeling of Intonation using vector quantization. In: 3rd ESCA Workshop on Speech Synthesis, pp. 311–316. Jenolan Caves, Australia (1998)
Wu, Z., Watts, O., King, S.: Merlin: An open source neural network speech synthesis system. In: 9th ISCA Workshop on Speech Synthesis, pp. 202–207. Sunnyvale, USA (2016)
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: 6th European Conference on Speech Communication and Technology, pp. 2347–2350. Budapest, Hungary (1999)
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 38th International Conference on Acoustics, Speech, and Signal Processing, pp. 7962–7966. IEEE, Vancouver, Canada (2013)
Chen, B., Bian, T., Yu, K.: Discrete duration model for speech synthesis. In: 18th Annual Conference of the International Speech Communication Association, pp. 789–793. Stockholm, Sweden (2017)
Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D., Houidhek, A.: Duration modeling using DNN for Arabic speech synthesis. In: 9th International Conference on Speech Prosody, pp. 597–601. Poznan, Poland (2018)
Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. arXiv preprint arXiv: 1609.03499 (2016)
Yoshimura, T.: Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based Text-to-Speech systems. Ph.D. Thesis, Department of Electrical and Computer Engineering, Nagoya Institute of Technology (2002)
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-Markov model based speech synthesis. In: 8th International Conference on Spoken Language Processing, pp. 1393–1396. Jeju Island, Korea (2004)
Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)
Zen, H., Tokuda, K. Black, A.W.: Statistical parametric speech synthesis. In: Speech Communication 2009, vol. 51, pp. 1093–1064. ELSEVIER (2009). https://doi.org/10.1016/j.specom.2009.04.004
Yu, K., Young, S.: Continuous \(F_{0}\) modeling for HMM based statistical parametric speech synthesis. IEICE Trans. Inf. Syst. 19(5), 1071–1079 (2011)
Fan, Y., Qian, Y., Xie, F. L., Soong, F. K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, pp. 1964–1968. Singapore (2014)
Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K.: New methods in continuous Mandarin speech recognition. In: 5th European Conference on Speech Communication and Technology, pp. 1543–1546. Rhodes, Greece (1997)
Chen, B., Lai, J., Yu, K.: Comparison of modeling target in LSTM-RNN duration model. In: 18th Annual Conference of the International Speech Communication Association, pp. 794–798. Stockholm, Sweden (2017)
Halabi, N., Wald, M.: Phonetic inventory for an Arabic speech corpus. In: 10th International Conference on Language Resources and Evaluation, pp. 734–738. Slovenia (2016)
Speech Signal Processing Toolkit (SPTK). http://sp-tk.sourceforge.net/
Houidhek, A., Colotte, V., Mnasri, Z., Jouvet, D.: DNN-based speech synthesis for Arabic: modelling and evaluation. In: 6th International Conference on Statistical Language and Speech Processing, pp. 9–20. Mons, Belgium (2018)
Camacho, A., Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 124(3), 1638–1652 (2008)
Zen, H.: An example of context-dependent label format for HMM-based speech synthesis in English. The HTS CMUARCTIC demo (2006)
Acknowledgement
This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No15G1405.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D. (2020). \(F_{0}\) Modeling Using DNN for Arabic Parametric Speech Synthesis. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Advances in Big Data and Deep Learning. INNSBDDL 2019. Proceedings of the International Neural Networks Society, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-16841-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-16841-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16840-7
Online ISBN: 978-3-030-16841-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)