Skip to main content
Log in

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Currently, most speech synthesis systems only generate speech in a reading style, which greatly affects the expressiveness of the synthetized speech. To improve the expressiveness of the synthetized speech, this paper focuses on the generation of exclamatory and interrogative speech for Mandarin spoken language. A multi-style (exclamatory and interrogative) deep neural network-based acoustic model with a style-specific layer (which can have multiple layers) and several shared hidden layers is proposed. The style-specific layer is used to model the distinct style specific patterns. The shared layers allow maximum knowledge sharing between the declarative and multi-style speech. We investigate five major aspects of the multi-style adaptation: neural network type and topology, the number of layers in style-specific layer, initial model, adaptation parameters and adaptation corpus size. Both objective and subjective evaluations are carried out to evaluate the proposed method. Experiment results show the proposed multi-style BLSTM with top one layer adapted is superior to our prior work (which is trained by the combination of constrained Maximum likelihood linear regression and structural maximum a posterior), and achieves the best performance. We also find that adapting on both spectral and excitation parameters are more effective than only adapting on the excitation parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  1. Black, A. & Cambpbell, N. (1995). Optimising selection of units from speech database for concatenative synthesis. In Proc. European Conference on Speech Communication and Technology, EUROSPEECH’95, pp. 581–584.

  2. Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’96, pp. 373–376.

  3. Donovan, R., & Woodland, P. (1999). A hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.

    Article  Google Scholar 

  4. Merritt, T., Clark, R. A. J., Wu, Z., Yamagishi, J., & King. S. (2016). Deep neural network-guided unit selection synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 333–336.

  5. Black, A. (2003). Unit selection and emotional speech. In Proc. European Conference on Speech Communication and Technology, EUROSPEECH’03, pp. 1649–1652.

  6. Masuko, T., Tokuda, K., Kobayashi, T., & Imai, S. (1996). HMM-based speech synthesis using dynamic features (in Japanese). IEICE Transactions, J79-D-II(12), 2184–2190.

    Google Scholar 

  7. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, (in Japanese). IEICE Transactions, J83-D-II(11), 2099–2107.

    Google Scholar 

  8. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’00, pp. 1315–1318.

  9. Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi, T. (2005). Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Transactions on Information and Systems, E88-D(3), 503–509.

    Article  Google Scholar 

  10. Tachibana, M., Yamagishi, J., Masuko, T., & Kobayashi, T. (2006). A style adaptation technique for speech synthesis using HSMM and supra segmental features. IEICE Transactions on Information and Systems, E89-D(3), 1092–1099.

    Article  Google Scholar 

  11. Tachibana, M., Yamagishi, J., Masuko, T., & Kobayashi, T. (2005). Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Transactions on Information and Systems, E88-D(11), 2484–2491.

    Article  Google Scholar 

  12. Nose, T., Yamagishi, J., & Kobayashi, T. (2007). A style control technique for HMM-based expressive speech synthesis. IEICE Transactions on Information and Systems, E90-D(9), 1406–1413.

    Article  Google Scholar 

  13. Masuko, T., Tokuda, K., Kobayashi, T., & Imai, S. (1997) Voice characteristic conversion for HMM-based speech synthesis system. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’97, pp. 1611–1614.

  14. Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2001). Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’01, pp. 805–808.

  15. Yamagishi, J., Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (Aug. 2003). A training method of average voice model for HMM-based speech synthesis. IEICE Transactions on Fundamentals, E86-A(8), 1956–1963.

    Google Scholar 

  16. Yamagishi, J., & Kobayashi, T. (Feb. 2007). Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Transactions on Information and Systems, E90-D(2), 533–543.

    Article  Google Scholar 

  17. Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (1998). Speaker adaptation for HMM-based speech synthesis system using MLLR. In Proc. in Proc. IEEE Speech Synthesis Workshop, pp. 273–276.

  18. Leggetter, C., & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171–185.

    Article  Google Scholar 

  19. Masuko, T., Tamura, M., & Tokuda, K. (2000). Voice characteristics conversion for HMM-based speech synthesis system using MAP-VFS (in Japanese). IEICE Transactions, J83-D-II(12), 2509–2516.

    Google Scholar 

  20. Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2002). Speaker adaptation of pitch and spectrum for HMM-based speech synthesis, (in Japanese). IEICE Transactions, J85-D-II(4), 545–553.

    Google Scholar 

  21. Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (May 2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information and Systems, E90-D(5), 825–834.

    Article  Google Scholar 

  22. Fang, S., Wen, Z., & Tao, J. (2015). Speech synthesis of questions based on adaptive training. In Proc. NCMMSC, 2015, pp. 092–095.

  23. Yamagishi, J., Kobayashi, T., & Isogai, J. (2009). Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR Adaptation algorithm. IEICE Transactions on Information and Systems, 17(1), 66–84.

  24. Fan, Y., Qian, Y., Soong, F. K., & He, L. (2015). Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4475–4479.

  25. Zen, H., Senior, A., & Schuster, M (2013). Statistical parametric speech synthesis using deep neural networks. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’13

  26. Tokuda, K., & Zen, H. (2015). Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15 pp. 4215–4219.

  27. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  28. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

  29. Zen, H., & Sak, H., (2015). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4470–4474.

  30. Wu, Z. Z., Valentini-Botinhao, C., Watts, O., & King, S. (2015). Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4460–4464.

  31. Wu, Z. & King, S. (2016). Investigating gated recurrent networks for speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 5140–5144.

  32. Wu, Z. Z., Swietojanski, P., Veaux, C., & King, S., (2015). A study of speaker adaptation for DNN-based speech synthesis. In Proc. INTERSPEECH’15, pp. 879–883.

  33. Potard, B., Motlicek, P., & Imseng, D. (2015). Preliminary work on speaker adaptation for DNN-based speech synthesis. Tech. Rep., No. EPFL-REPORT- 204660, Idiap, 2015

  34. Yuchen Fan, Y. Q., Song, F. K., & He, L. (2015). Unsupervised speaker adaptation for DNN-based tts synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15 pp. 5135–5139.

  35. Fan, Y., Y. Qian, F.K. Soong, & L. He, (2016). Speaker and language factorization in DNN-based TTS synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 1005–1008.

  36. Yu, Q., Liu, P., & Cai, L. (2016). Learning cross-lingual information with multilingual BLTSM for speech synthesis of low-resource language. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 1233–1236.

  37. Digalakis, V., Rtischev, D., & Neumeyer, L. (1995). Speaker adaptation using constrained reestimation of Gaussian mixtures. IEEE Trans. Speech Audio Process., 3(5), 357–366.

    Article  Google Scholar 

  38. Gales, M. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.

    Article  Google Scholar 

  39. Kawahara, H., Masuda-Katsuse, I., & de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3–4), 187–207.

    Article  Google Scholar 

  40. Theano. (2016). A Python framework for fast computation of mathematical expressions. [OL] [2016–05-09] http://deeplearning.net/software/theano/. Accessed 7 July 2016.

  41. Tokuda, K., Zen, H., & Black, A.W. (2002). An HMM-based speech synthesis system applied to English. In Proc. IEEE Speech Synthesis Workshop, pp. 227–230.

Download references

Acknowledgements

This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No. 2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61305003, No.61425017,No.61403386), the Strategic Priority Research Program of the CAS (Grant XDB02080006) and partly supported by the Major Program for the National Social Science Fund of China (13&ZD189).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ya Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Li, Y., Wen, Z. et al. Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin. J Sign Process Syst 90, 1039–1052 (2018). https://doi.org/10.1007/s11265-017-1290-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-017-1290-2

Keywords

Navigation