$$F_{0}$$ Modeling Using DNN for Arabic Parametric Speech Synthesis

Zangar, Imene; Mnasri, Zied; Colotte, Vincent; Jouvet, Denis

doi:10.1007/978-3-030-16841-4_20

Imene Zangar⁷,
Zied Mnasri^7,9,
Vincent Colotte⁸ &
…
Denis Jouvet⁸

Part of the book series: Proceedings of the International Neural Networks Society ((INNS,volume 1))

Included in the following conference series:

INNS Big Data and Deep Learning conference

Abstract

Deep neural networks (DNN) are gaining increasing interest in speech processing applications, especially in text-to-speech synthesis. Actually state-of-the-art speech generation tools, like MERLIN and WAVENET are totally DNN-based. However, every language has to be modeled on its own using DNN. One of the key components of speech synthesis modules is the prosodic parameters generation module from contextual input features, and more particularly the fundamental frequency ($F_{0}$) generation module. Actually $F_{0}$ is responsible for intonation, that is why it should be accurately modeled to provide intelligible and natural speech. However, $F_{0}$ modeling is highly dependent on the language. Therefore, language specific characteristics have to be taken into account. In this paper, we aim to model $F_{0}$ for Arabic speech synthesis with feedforward and recurrent DNN, and using specific characteristic features for Arabic like vowel quantity and gemination, in order to improve the quality of Arabic parametric speech synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pierrehumbert, J.: The phonology and phonetics of English intonation. Ph.D. Thesis, Massachusetts Institute of Technology (1980)
Google Scholar
Hart, J., Collier, R., Cohen, A.: A Perceptual Study of Intonation. Cambridge University Press, Cambridge (1990)
Book Google Scholar
Dusterhoff, K., Black, A.: Generating $F_{0}$ contour for speech synthesis using the tilt intonation theory. In: 3rd ESCA workshop on Intonation: Theory Models and Applications, pp. 107–110. Athens, Greece (1997)
Google Scholar
Taylor, P.: Analysis and synthesis of intonation using the tilt model. J. Acoust. Soc. Am. 107(3), 1697–1714 (2000)
Article Google Scholar
Moehler, G., Conkie, A.: Parametric modeling of Intonation using vector quantization. In: 3rd ESCA Workshop on Speech Synthesis, pp. 311–316. Jenolan Caves, Australia (1998)
Google Scholar
Wu, Z., Watts, O., King, S.: Merlin: An open source neural network speech synthesis system. In: 9th ISCA Workshop on Speech Synthesis, pp. 202–207. Sunnyvale, USA (2016)
Google Scholar
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: 6th European Conference on Speech Communication and Technology, pp. 2347–2350. Budapest, Hungary (1999)
Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 38th International Conference on Acoustics, Speech, and Signal Processing, pp. 7962–7966. IEEE, Vancouver, Canada (2013)
Google Scholar
Chen, B., Bian, T., Yu, K.: Discrete duration model for speech synthesis. In: 18th Annual Conference of the International Speech Communication Association, pp. 789–793. Stockholm, Sweden (2017)
Google Scholar
Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D., Houidhek, A.: Duration modeling using DNN for Arabic speech synthesis. In: 9th International Conference on Speech Prosody, pp. 597–601. Poznan, Poland (2018)
Google Scholar
Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. arXiv preprint arXiv: 1609.03499 (2016)
Yoshimura, T.: Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based Text-to-Speech systems. Ph.D. Thesis, Department of Electrical and Computer Engineering, Nagoya Institute of Technology (2002)
Google Scholar
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-Markov model based speech synthesis. In: 8th International Conference on Spoken Language Processing, pp. 1393–1396. Jeju Island, Korea (2004)
Google Scholar
Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)
Google Scholar
Zen, H., Tokuda, K. Black, A.W.: Statistical parametric speech synthesis. In: Speech Communication 2009, vol. 51, pp. 1093–1064. ELSEVIER (2009). https://doi.org/10.1016/j.specom.2009.04.004
Yu, K., Young, S.: Continuous $F_{0}$ modeling for HMM based statistical parametric speech synthesis. IEICE Trans. Inf. Syst. 19(5), 1071–1079 (2011)
Google Scholar
Fan, Y., Qian, Y., Xie, F. L., Soong, F. K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, pp. 1964–1968. Singapore (2014)
Google Scholar
Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K.: New methods in continuous Mandarin speech recognition. In: 5th European Conference on Speech Communication and Technology, pp. 1543–1546. Rhodes, Greece (1997)
Google Scholar
Chen, B., Lai, J., Yu, K.: Comparison of modeling target in LSTM-RNN duration model. In: 18th Annual Conference of the International Speech Communication Association, pp. 794–798. Stockholm, Sweden (2017)
Google Scholar
Halabi, N., Wald, M.: Phonetic inventory for an Arabic speech corpus. In: 10th International Conference on Language Resources and Evaluation, pp. 734–738. Slovenia (2016)
Google Scholar
Speech Signal Processing Toolkit (SPTK). http://sp-tk.sourceforge.net/
Houidhek, A., Colotte, V., Mnasri, Z., Jouvet, D.: DNN-based speech synthesis for Arabic: modelling and evaluation. In: 6th International Conference on Statistical Language and Speech Processing, pp. 9–20. Mons, Belgium (2018)
Google Scholar
Camacho, A., Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 124(3), 1638–1652 (2008)
Article Google Scholar
Zen, H.: An example of context-dependent label format for HMM-based speech synthesis in English. The HTS CMUARCTIC demo (2006)
Google Scholar

Download references

Acknowledgement

This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No15G1405.

Author information

Authors and Affiliations

Electrical Engineering Department, University Tunis El Manar, Ecole Nationale d’Ingénieurs de Tunis, Tunis, Tunisia
Imene Zangar & Zied Mnasri
Université de Lorraine, CNRS, Inria, LORIA, 54000, Nancy, France
Vincent Colotte & Denis Jouvet
Università degli studi di Genova, DIBRIS, Genoa, Italy
Zied Mnasri

Authors

Imene Zangar
View author publications
You can also search for this author in PubMed Google Scholar
Zied Mnasri
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Colotte
View author publications
You can also search for this author in PubMed Google Scholar
Denis Jouvet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imene Zangar .

Editor information

Editors and Affiliations

Department of Informatics, Bioengineering, Robotics, and Systems Engineering, University of Genova, Genoa, Italy
Luca Oneto
Department of Mathematics, University of Padova, Padua, Italy
Nicolò Navarin
Department of Mathematics, University of Padova, Padua, Italy
Alessandro Sperduti
Department of Informatics, Bioengineering, Robotics, and Systems Engineering, University of Genova, Genoa, Italy
Davide Anguita

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D. (2020). $F_{0}$ Modeling Using DNN for Arabic Parametric Speech Synthesis. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Advances in Big Data and Deep Learning. INNSBDDL 2019. Proceedings of the International Neural Networks Society, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-16841-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-16841-4_20
Published: 03 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16840-7
Online ISBN: 978-3-030-16841-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

\(F_{0}\) Modeling Using DNN for Arabic Parametric Speech Synthesis

Abstract

Access this chapter

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

\(F_{0}\) Modeling Using DNN for Arabic Parametric Speech Synthesis

Abstract

Access this chapter

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation