Abstract
Currently, most speech synthesis systems only generate speech in a reading style, which greatly affects the expressiveness of the synthetized speech. To improve the expressiveness of the synthetized speech, this paper focuses on the generation of exclamatory and interrogative speech for Mandarin spoken language. A multi-style (exclamatory and interrogative) deep neural network-based acoustic model with a style-specific layer (which can have multiple layers) and several shared hidden layers is proposed. The style-specific layer is used to model the distinct style specific patterns. The shared layers allow maximum knowledge sharing between the declarative and multi-style speech. We investigate five major aspects of the multi-style adaptation: neural network type and topology, the number of layers in style-specific layer, initial model, adaptation parameters and adaptation corpus size. Both objective and subjective evaluations are carried out to evaluate the proposed method. Experiment results show the proposed multi-style BLSTM with top one layer adapted is superior to our prior work (which is trained by the combination of constrained Maximum likelihood linear regression and structural maximum a posterior), and achieves the best performance. We also find that adapting on both spectral and excitation parameters are more effective than only adapting on the excitation parameters.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11265-017-1290-2/MediaObjects/11265_2017_1290_Fig7_HTML.gif)
Similar content being viewed by others
References
Black, A. & Cambpbell, N. (1995). Optimising selection of units from speech database for concatenative synthesis. In Proc. European Conference on Speech Communication and Technology, EUROSPEECH’95, pp. 581–584.
Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’96, pp. 373–376.
Donovan, R., & Woodland, P. (1999). A hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.
Merritt, T., Clark, R. A. J., Wu, Z., Yamagishi, J., & King. S. (2016). Deep neural network-guided unit selection synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 333–336.
Black, A. (2003). Unit selection and emotional speech. In Proc. European Conference on Speech Communication and Technology, EUROSPEECH’03, pp. 1649–1652.
Masuko, T., Tokuda, K., Kobayashi, T., & Imai, S. (1996). HMM-based speech synthesis using dynamic features (in Japanese). IEICE Transactions, J79-D-II(12), 2184–2190.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, (in Japanese). IEICE Transactions, J83-D-II(11), 2099–2107.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’00, pp. 1315–1318.
Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi, T. (2005). Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Transactions on Information and Systems, E88-D(3), 503–509.
Tachibana, M., Yamagishi, J., Masuko, T., & Kobayashi, T. (2006). A style adaptation technique for speech synthesis using HSMM and supra segmental features. IEICE Transactions on Information and Systems, E89-D(3), 1092–1099.
Tachibana, M., Yamagishi, J., Masuko, T., & Kobayashi, T. (2005). Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Transactions on Information and Systems, E88-D(11), 2484–2491.
Nose, T., Yamagishi, J., & Kobayashi, T. (2007). A style control technique for HMM-based expressive speech synthesis. IEICE Transactions on Information and Systems, E90-D(9), 1406–1413.
Masuko, T., Tokuda, K., Kobayashi, T., & Imai, S. (1997) Voice characteristic conversion for HMM-based speech synthesis system. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’97, pp. 1611–1614.
Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2001). Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’01, pp. 805–808.
Yamagishi, J., Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (Aug. 2003). A training method of average voice model for HMM-based speech synthesis. IEICE Transactions on Fundamentals, E86-A(8), 1956–1963.
Yamagishi, J., & Kobayashi, T. (Feb. 2007). Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Transactions on Information and Systems, E90-D(2), 533–543.
Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (1998). Speaker adaptation for HMM-based speech synthesis system using MLLR. In Proc. in Proc. IEEE Speech Synthesis Workshop, pp. 273–276.
Leggetter, C., & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171–185.
Masuko, T., Tamura, M., & Tokuda, K. (2000). Voice characteristics conversion for HMM-based speech synthesis system using MAP-VFS (in Japanese). IEICE Transactions, J83-D-II(12), 2509–2516.
Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2002). Speaker adaptation of pitch and spectrum for HMM-based speech synthesis, (in Japanese). IEICE Transactions, J85-D-II(4), 545–553.
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (May 2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information and Systems, E90-D(5), 825–834.
Fang, S., Wen, Z., & Tao, J. (2015). Speech synthesis of questions based on adaptive training. In Proc. NCMMSC, 2015, pp. 092–095.
Yamagishi, J., Kobayashi, T., & Isogai, J. (2009). Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR Adaptation algorithm. IEICE Transactions on Information and Systems, 17(1), 66–84.
Fan, Y., Qian, Y., Soong, F. K., & He, L. (2015). Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4475–4479.
Zen, H., Senior, A., & Schuster, M (2013). Statistical parametric speech synthesis using deep neural networks. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’13
Tokuda, K., & Zen, H. (2015). Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15 pp. 4215–4219.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Zen, H., & Sak, H., (2015). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4470–4474.
Wu, Z. Z., Valentini-Botinhao, C., Watts, O., & King, S. (2015). Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4460–4464.
Wu, Z. & King, S. (2016). Investigating gated recurrent networks for speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 5140–5144.
Wu, Z. Z., Swietojanski, P., Veaux, C., & King, S., (2015). A study of speaker adaptation for DNN-based speech synthesis. In Proc. INTERSPEECH’15, pp. 879–883.
Potard, B., Motlicek, P., & Imseng, D. (2015). Preliminary work on speaker adaptation for DNN-based speech synthesis. Tech. Rep., No. EPFL-REPORT- 204660, Idiap, 2015
Yuchen Fan, Y. Q., Song, F. K., & He, L. (2015). Unsupervised speaker adaptation for DNN-based tts synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15 pp. 5135–5139.
Fan, Y., Y. Qian, F.K. Soong, & L. He, (2016). Speaker and language factorization in DNN-based TTS synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 1005–1008.
Yu, Q., Liu, P., & Cai, L. (2016). Learning cross-lingual information with multilingual BLTSM for speech synthesis of low-resource language. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 1233–1236.
Digalakis, V., Rtischev, D., & Neumeyer, L. (1995). Speaker adaptation using constrained reestimation of Gaussian mixtures. IEEE Trans. Speech Audio Process., 3(5), 357–366.
Gales, M. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.
Kawahara, H., Masuda-Katsuse, I., & de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3–4), 187–207.
Theano. (2016). A Python framework for fast computation of mathematical expressions. [OL] [2016–05-09] http://deeplearning.net/software/theano/. Accessed 7 July 2016.
Tokuda, K., Zen, H., & Black, A.W. (2002). An HMM-based speech synthesis system applied to English. In Proc. IEEE Speech Synthesis Workshop, pp. 227–230.
Acknowledgements
This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No. 2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61305003, No.61425017,No.61403386), the Strategic Priority Research Program of the CAS (Grant XDB02080006) and partly supported by the Major Program for the National Social Science Fund of China (13&ZD189).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, Y., Li, Y., Wen, Z. et al. Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin. J Sign Process Syst 90, 1039–1052 (2018). https://doi.org/10.1007/s11265-017-1290-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-017-1290-2