Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Zheng, Yibin; Li, Ya; Wen, Zhengqi; Liu, Bin; Tao, Jianhua

doi:10.1007/s11265-017-1290-2

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Published: 26 September 2017

Volume 90, pages 1039–1052, (2018)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yibin Zheng^1,3,
Ya Li¹,
Zhengqi Wen¹,
Bin Liu¹ &
…
Jianhua Tao^1,2,3

228 Accesses
Explore all metrics

Abstract

Currently, most speech synthesis systems only generate speech in a reading style, which greatly affects the expressiveness of the synthetized speech. To improve the expressiveness of the synthetized speech, this paper focuses on the generation of exclamatory and interrogative speech for Mandarin spoken language. A multi-style (exclamatory and interrogative) deep neural network-based acoustic model with a style-specific layer (which can have multiple layers) and several shared hidden layers is proposed. The style-specific layer is used to model the distinct style specific patterns. The shared layers allow maximum knowledge sharing between the declarative and multi-style speech. We investigate five major aspects of the multi-style adaptation: neural network type and topology, the number of layers in style-specific layer, initial model, adaptation parameters and adaptation corpus size. Both objective and subjective evaluations are carried out to evaluate the proposed method. Experiment results show the proposed multi-style BLSTM with top one layer adapted is superior to our prior work (which is trained by the combination of constrained Maximum likelihood linear regression and structural maximum a posterior), and achieves the best performance. We also find that adapting on both spectral and excitation parameters are more effective than only adapting on the excitation parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Effectiveness of Neural Text Generation Based Data Augmentation for Recognition of Morphologically Rich Speech

Accent modification for speech recognition of non-native speakers using neural style transfer

Article Open access 18 February 2021

Exploring the Potential of Prompting Methods in Low-Resource Speech Recognition with Whisper

References

Black, A. & Cambpbell, N. (1995). Optimising selection of units from speech database for concatenative synthesis. In Proc. European Conference on Speech Communication and Technology, EUROSPEECH’95, pp. 581–584.
Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’96, pp. 373–376.
Donovan, R., & Woodland, P. (1999). A hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.
Article Google Scholar
Merritt, T., Clark, R. A. J., Wu, Z., Yamagishi, J., & King. S. (2016). Deep neural network-guided unit selection synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 333–336.
Black, A. (2003). Unit selection and emotional speech. In Proc. European Conference on Speech Communication and Technology, EUROSPEECH’03, pp. 1649–1652.
Masuko, T., Tokuda, K., Kobayashi, T., & Imai, S. (1996). HMM-based speech synthesis using dynamic features (in Japanese). IEICE Transactions, J79-D-II(12), 2184–2190.
Google Scholar
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, (in Japanese). IEICE Transactions, J83-D-II(11), 2099–2107.
Google Scholar
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’00, pp. 1315–1318.
Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi, T. (2005). Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Transactions on Information and Systems, E88-D(3), 503–509.
Article Google Scholar
Tachibana, M., Yamagishi, J., Masuko, T., & Kobayashi, T. (2006). A style adaptation technique for speech synthesis using HSMM and supra segmental features. IEICE Transactions on Information and Systems, E89-D(3), 1092–1099.
Article Google Scholar
Tachibana, M., Yamagishi, J., Masuko, T., & Kobayashi, T. (2005). Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Transactions on Information and Systems, E88-D(11), 2484–2491.
Article Google Scholar
Nose, T., Yamagishi, J., & Kobayashi, T. (2007). A style control technique for HMM-based expressive speech synthesis. IEICE Transactions on Information and Systems, E90-D(9), 1406–1413.
Article Google Scholar
Masuko, T., Tokuda, K., Kobayashi, T., & Imai, S. (1997) Voice characteristic conversion for HMM-based speech synthesis system. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’97, pp. 1611–1614.
Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2001). Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’01, pp. 805–808.
Yamagishi, J., Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (Aug. 2003). A training method of average voice model for HMM-based speech synthesis. IEICE Transactions on Fundamentals, E86-A(8), 1956–1963.
Google Scholar
Yamagishi, J., & Kobayashi, T. (Feb. 2007). Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Transactions on Information and Systems, E90-D(2), 533–543.
Article Google Scholar
Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (1998). Speaker adaptation for HMM-based speech synthesis system using MLLR. In Proc. in Proc. IEEE Speech Synthesis Workshop, pp. 273–276.
Leggetter, C., & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171–185.
Article Google Scholar
Masuko, T., Tamura, M., & Tokuda, K. (2000). Voice characteristics conversion for HMM-based speech synthesis system using MAP-VFS (in Japanese). IEICE Transactions, J83-D-II(12), 2509–2516.
Google Scholar
Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2002). Speaker adaptation of pitch and spectrum for HMM-based speech synthesis, (in Japanese). IEICE Transactions, J85-D-II(4), 545–553.
Google Scholar
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (May 2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information and Systems, E90-D(5), 825–834.
Article Google Scholar
Fang, S., Wen, Z., & Tao, J. (2015). Speech synthesis of questions based on adaptive training. In Proc. NCMMSC, 2015, pp. 092–095.
Yamagishi, J., Kobayashi, T., & Isogai, J. (2009). Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR Adaptation algorithm. IEICE Transactions on Information and Systems, 17(1), 66–84.
Fan, Y., Qian, Y., Soong, F. K., & He, L. (2015). Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4475–4479.
Zen, H., Senior, A., & Schuster, M (2013). Statistical parametric speech synthesis using deep neural networks. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’13
Tokuda, K., & Zen, H. (2015). Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15 pp. 4215–4219.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Zen, H., & Sak, H., (2015). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4470–4474.
Wu, Z. Z., Valentini-Botinhao, C., Watts, O., & King, S. (2015). Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4460–4464.
Wu, Z. & King, S. (2016). Investigating gated recurrent networks for speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 5140–5144.
Wu, Z. Z., Swietojanski, P., Veaux, C., & King, S., (2015). A study of speaker adaptation for DNN-based speech synthesis. In Proc. INTERSPEECH’15, pp. 879–883.
Potard, B., Motlicek, P., & Imseng, D. (2015). Preliminary work on speaker adaptation for DNN-based speech synthesis. Tech. Rep., No. EPFL-REPORT- 204660, Idiap, 2015
Yuchen Fan, Y. Q., Song, F. K., & He, L. (2015). Unsupervised speaker adaptation for DNN-based tts synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15 pp. 5135–5139.
Fan, Y., Y. Qian, F.K. Soong, & L. He, (2016). Speaker and language factorization in DNN-based TTS synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 1005–1008.
Yu, Q., Liu, P., & Cai, L. (2016). Learning cross-lingual information with multilingual BLTSM for speech synthesis of low-resource language. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’16, pp. 1233–1236.
Digalakis, V., Rtischev, D., & Neumeyer, L. (1995). Speaker adaptation using constrained reestimation of Gaussian mixtures. IEEE Trans. Speech Audio Process., 3(5), 357–366.
Article Google Scholar
Gales, M. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.
Article Google Scholar
Kawahara, H., Masuda-Katsuse, I., & de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3–4), 187–207.
Article Google Scholar
Theano. (2016). A Python framework for fast computation of mathematical expressions. [OL] [2016–05-09] http://deeplearning.net/software/theano/. Accessed 7 July 2016.
Tokuda, K., Zen, H., & Black, A.W. (2002). An HMM-based speech synthesis system applied to English. In Proc. IEEE Speech Synthesis Workshop, pp. 227–230.

Download references

Acknowledgements

This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No. 2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61305003, No.61425017,No.61403386), the Strategic Priority Research Program of the CAS (Grant XDB02080006) and partly supported by the Major Program for the National Social Science Fund of China (13&ZD189).

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Recognition, Beijing, China
Yibin Zheng, Ya Li, Zhengqi Wen, Bin Liu & Jianhua Tao
CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Science, Beijing, China
Jianhua Tao
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Yibin Zheng & Jianhua Tao

Authors

Yibin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Ya Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhengqi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ya Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, Y., Li, Y., Wen, Z. et al. Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin. J Sign Process Syst 90, 1039–1052 (2018). https://doi.org/10.1007/s11265-017-1290-2

Download citation

Received: 15 January 2017
Revised: 01 September 2017
Accepted: 18 September 2017
Published: 26 September 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11265-017-1290-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Effectiveness of Neural Text Generation Based Data Augmentation for Recognition of Morphologically Rich Speech

Accent modification for speech recognition of non-native speakers using neural style transfer

Exploring the Potential of Prompting Methods in Low-Resource Speech Recognition with Whisper

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Effectiveness of Neural Text Generation Based Data Augmentation for Recognition of Morphologically Rich Speech

Accent modification for speech recognition of non-native speakers using neural style transfer

Exploring the Potential of Prompting Methods in Low-Resource Speech Recognition with Whisper

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation