Skip to main content

\(F_{0}\) Modeling Using DNN for Arabic Parametric Speech Synthesis

  • Conference paper
  • First Online:
Recent Advances in Big Data and Deep Learning (INNSBDDL 2019)

Part of the book series: Proceedings of the International Neural Networks Society ((INNS,volume 1))

Included in the following conference series:

Abstract

Deep neural networks (DNN) are gaining increasing interest in speech processing applications, especially in text-to-speech synthesis. Actually state-of-the-art speech generation tools, like MERLIN and WAVENET are totally DNN-based. However, every language has to be modeled on its own using DNN. One of the key components of speech synthesis modules is the prosodic parameters generation module from contextual input features, and more particularly the fundamental frequency (\(F_{0}\)) generation module. Actually \(F_{0}\) is responsible for intonation, that is why it should be accurately modeled to provide intelligible and natural speech. However, \(F_{0}\) modeling is highly dependent on the language. Therefore, language specific characteristics have to be taken into account. In this paper, we aim to model \(F_{0}\) for Arabic speech synthesis with feedforward and recurrent DNN, and using specific characteristic features for Arabic like vowel quantity and gemination, in order to improve the quality of Arabic parametric speech synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Pierrehumbert, J.: The phonology and phonetics of English intonation. Ph.D. Thesis, Massachusetts Institute of Technology (1980)

    Google Scholar 

  2. Hart, J., Collier, R., Cohen, A.: A Perceptual Study of Intonation. Cambridge University Press, Cambridge (1990)

    Book  Google Scholar 

  3. Dusterhoff, K., Black, A.: Generating \(F_{0}\) contour for speech synthesis using the tilt intonation theory. In: 3rd ESCA workshop on Intonation: Theory Models and Applications, pp. 107–110. Athens, Greece (1997)

    Google Scholar 

  4. Taylor, P.: Analysis and synthesis of intonation using the tilt model. J. Acoust. Soc. Am. 107(3), 1697–1714 (2000)

    Article  Google Scholar 

  5. Moehler, G., Conkie, A.: Parametric modeling of Intonation using vector quantization. In: 3rd ESCA Workshop on Speech Synthesis, pp. 311–316. Jenolan Caves, Australia (1998)

    Google Scholar 

  6. Wu, Z., Watts, O., King, S.: Merlin: An open source neural network speech synthesis system. In: 9th ISCA Workshop on Speech Synthesis, pp. 202–207. Sunnyvale, USA (2016)

    Google Scholar 

  7. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: 6th European Conference on Speech Communication and Technology, pp. 2347–2350. Budapest, Hungary (1999)

    Google Scholar 

  8. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 38th International Conference on Acoustics, Speech, and Signal Processing, pp. 7962–7966. IEEE, Vancouver, Canada (2013)

    Google Scholar 

  9. Chen, B., Bian, T., Yu, K.: Discrete duration model for speech synthesis. In: 18th Annual Conference of the International Speech Communication Association, pp. 789–793. Stockholm, Sweden (2017)

    Google Scholar 

  10. Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D., Houidhek, A.: Duration modeling using DNN for Arabic speech synthesis. In: 9th International Conference on Speech Prosody, pp. 597–601. Poznan, Poland (2018)

    Google Scholar 

  11. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. arXiv preprint arXiv: 1609.03499 (2016)

  12. Yoshimura, T.: Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based Text-to-Speech systems. Ph.D. Thesis, Department of Electrical and Computer Engineering, Nagoya Institute of Technology (2002)

    Google Scholar 

  13. Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-Markov model based speech synthesis. In: 8th International Conference on Spoken Language Processing, pp. 1393–1396. Jeju Island, Korea (2004)

    Google Scholar 

  14. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)

    Google Scholar 

  15. Zen, H., Tokuda, K. Black, A.W.: Statistical parametric speech synthesis. In: Speech Communication 2009, vol. 51, pp. 1093–1064. ELSEVIER (2009). https://doi.org/10.1016/j.specom.2009.04.004

  16. Yu, K., Young, S.: Continuous \(F_{0}\) modeling for HMM based statistical parametric speech synthesis. IEICE Trans. Inf. Syst. 19(5), 1071–1079 (2011)

    Google Scholar 

  17. Fan, Y., Qian, Y., Xie, F. L., Soong, F. K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: 15th Annual Conference of the International Speech Communication Association, pp. 1964–1968. Singapore (2014)

    Google Scholar 

  18. Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K.: New methods in continuous Mandarin speech recognition. In: 5th European Conference on Speech Communication and Technology, pp. 1543–1546. Rhodes, Greece (1997)

    Google Scholar 

  19. Chen, B., Lai, J., Yu, K.: Comparison of modeling target in LSTM-RNN duration model. In: 18th Annual Conference of the International Speech Communication Association, pp. 794–798. Stockholm, Sweden (2017)

    Google Scholar 

  20. Halabi, N., Wald, M.: Phonetic inventory for an Arabic speech corpus. In: 10th International Conference on Language Resources and Evaluation, pp. 734–738. Slovenia (2016)

    Google Scholar 

  21. Speech Signal Processing Toolkit (SPTK). http://sp-tk.sourceforge.net/

  22. Houidhek, A., Colotte, V., Mnasri, Z., Jouvet, D.: DNN-based speech synthesis for Arabic: modelling and evaluation. In: 6th International Conference on Statistical Language and Speech Processing, pp. 9–20. Mons, Belgium (2018)

    Google Scholar 

  23. Camacho, A., Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 124(3), 1638–1652 (2008)

    Article  Google Scholar 

  24. Zen, H.: An example of context-dependent label format for HMM-based speech synthesis in English. The HTS CMUARCTIC demo (2006)

    Google Scholar 

Download references

Acknowledgement

This research work was conducted in the framework of PHC-Utique Program, financed by CMCU (Comité mixte de coopération universitaire), grant No15G1405.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imene Zangar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D. (2020). \(F_{0}\) Modeling Using DNN for Arabic Parametric Speech Synthesis. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Advances in Big Data and Deep Learning. INNSBDDL 2019. Proceedings of the International Neural Networks Society, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-16841-4_20

Download citation

Publish with us

Policies and ethics