Skip to main content
Log in

GAN acoustic model for Kazakh speech synthesis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Recent studies on the application of generative adversarial networks (GAN) for speech synthesis have shown improvements in the naturalness of synthesized speech, compared to the conventional approaches. In this article, we present a new framework of GAN to train an acoustic model for speech synthesis. The proposed GAN consists of a generator and a pair of agent discriminators, where the generator produces acoustic parameters taking into account linguistic parameters; and the pair of agent discriminators are introduced to improve the naturalness of the synthesized speech. We feed the agents with acoustic and linguistic parameters, thereby the agents do not only examine the acoustic distribution, but also the relationship between linguistic and acoustic parameters. Training and testing were conducted on the Kazakh speech corpus. According to the results of this research, the proposed framework of GAN improves the accuracy of the acoustic model for the Kazakh text-to-speech system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Berment, V. (2004). Methods to computerize “little equipped” languages and groups of languages. Theses: Université Joseph-Fourier - Grenoble I.

  • Bollepalli, B., Juvela, L., & Alku, P. (2019). Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis. arXiv e-prints, p. arXiv:1903.05955.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. in Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 2672–2680, Curran Associates, Inc.

  • Han, J., Zhang, Z., Ren, Z., Ringeval, F., & Schuller, B. W. (2018). Towards conditional adversarial training for predicting emotions from speech. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6822–6826.

  • Kaliyev, A., Rybin, S. V., & Matveev, Y. N. (2018). Phoneme duration prediction for Kazakh language. In A. Karpov, O. Jokisch, & R. Potapova (Eds.), Speech and computer (pp. 274–280). Cham: Springer International Publishing.

    Chapter  Google Scholar 

  • Kaliyev, A., Rybin, S. V., & Matveev, Y. N. (2017). The pausing method based on brown clustering and word embedding. In A. Karpov, R. Potapova, & I. Mporas (Eds.), Speech and computer (pp. 741–747). Cham: Springer International Publishing.

    Chapter  Google Scholar 

  • Kaliyev, A., Matveev, Y. N., Lyakso, E. E., & Rybin, S. V. (2018). Prosodic processing for the automatic synthesis of emotional russian speech. in 2018 IEEE International Conference “Quality Management, Transport and Information Security, Information Technologies” (IT QM IS), Proceedings of the 2018 International Conference ”Quality Management, Transport and Information Security, Information Technologies”, IT and QM and IS 2018, (United States), pp. 653–655, Institute of Electrical and Electronics Engineers Inc.

  • Kaliyev, A., Rybin, S. V., Matveev, Y. N., Kaziyeva, N., & Burambayeva, N. (2018). “Modeling pause for the synthesis of kazakh speech,” in Proceedings of the Fourth International Conference on Engineering & MIS 2018, ICEMIS ’18, (New York, NY, USA), pp. 1:1–1:4, ACM.

  • Karpov, A., & Verkhodanova, V. (2015). Speech technologies for under-resourced languages of the world. Voprosy Jazykoznanija, 20162015, 117–135.

    Google Scholar 

  • Khomitsevich, O., Mendelev, V., Tomashenko, N., Rybin, S., Medennikov, I., & Kudubayeva, S. (2015). A bilingual Kazakh–Russian system for automatic speech recognition and synthesis. In A. Ronzhin, R. Potapova, & N. Fakotakis (Eds.), Speech and computer (pp. 25–33). Cham: Springer International Publishing.

    Chapter  Google Scholar 

  • Krauwer, S. (2003). The basic language resource kit (blark) as the first milestone for the language resources roadmap. Proceedings of SPECOM, 2003, 8–15.

    Google Scholar 

  • Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brébisson, A., Bengio, Y., & Courville, A. C. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. in Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc.

  • Liu, B., Nie, S., Zhang, Y., Ke, D., Liang, S., & Liu, W. (2018). Boosting noise robustness of acoustic model via deep adversarial training. CoRR, vol. abs/1805.01357.

  • Ma, S., Mcduff, D., & Song, Y. (2019). A generative adversarial network for style modeling in a text-to-speech system. in International Conference on Learning Representations, vol. 2.

  • Mon, A. N., Pa, W. P., & Thu, Y. K. (2019). Ucsy-sc1: A myanmar speech corpus for automatic speech recognition. International Journal of Electrical and Computer Engineering, 9, 3194–3202.

    Google Scholar 

  • Morise, M. (2016). D4c, a band-aperiodicity estimator for high-quality speech synthesis. Speech Communication, 84, 57–65.

    Article  Google Scholar 

  • Morise, M., Yokomori, F., & Ozawa, K. (2016). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99, 1877–1884.

    Article  Google Scholar 

  • Passricha, V., & Aggarwal, R. K. (2019). PSO-based optimized CNN for Hindi ASR. International Journal of Speech Technology, 22, 1123–1133.

    Article  Google Scholar 

  • Saito, Y., Takamichi, S., & Saruwatari, H. (2018). Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 84–96.

    Article  Google Scholar 

  • Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., & Saurous, R. A. (2018). Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR, vol. abs/1803.09047.

  • Sotelo, J., Mehri, Soroush., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. in International Conference on Learning Representations (Workshop Track), pp. 1–6.

  • Sun, L., Chen, J., Xie, K., & Gu, T. (2018). Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition. International Journal of Speech Technology, 21, 931–940.

    Article  Google Scholar 

  • Taigman, Y., Wolf, L., Polyak, A., & Nachmani, E. (2017). Voice synthesis for in-the-wild speakers via a phonological loop. CoRR, vol. abs/1707.06588.

  • Yamamoto, R., Song, E., & Kim, J. (2020). Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203.

  • Yang, S., Xie, L., Chen, X., Lou, X., Zhu, X., Huang, D., & Li, H. (2017). Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 685–691.

  • Yang, J., Lee, J., Kim, Y., Cho, H.-Y., & Kim, I. (2020). VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. in Proc. Interspeech, pp. 200–204.

  • Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. CoRR, vol. abs/2005.05106.

  • Zhao, Y., Takaki, S., Luong, H., Yamagishi, J., Saito, D., & Minematsu, N. (2018). Wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder. IEEE Access, 6, 60478–60488.

    Article  Google Scholar 

  • Zia, T., & Zahid, U. (2019). Long short-term memory recurrent neural network architectures for Urdu acoustic modeling. International Journal of Speech Technology, 22, 21–30.

    Article  Google Scholar 

Download references

Acknowledgements

The study is financially supported by the Russian Science Foundation (Project No 18-18-00063) and the Russian Foundation for Basic Research (Project 19-57-45008–IND_ a).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuri N. Matveev.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaliyev, A., Zeno, B., Rybin, S.V. et al. GAN acoustic model for Kazakh speech synthesis. Int J Speech Technol 24, 729–735 (2021). https://doi.org/10.1007/s10772-021-09840-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09840-0

Keywords

Navigation