Mandarin Prosody Prediction Based on Attention Mechanism and Multi-model Ensemble

Xie, Kun; Pan, Wei

doi:10.1007/978-3-319-95930-6_45

Kun Xie¹⁷ &
Wei Pan¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10954))

Included in the following conference series:

International Conference on Intelligent Computing

2834 Accesses

Abstract

Prosodic boundary prediction is very important and challenging in the speech synthesis task, the result of prosodic prediction directly determines the quality of speech synthesis. In this paper, we proposed a prosodic boundary prediction method based on “encoding-decoding” frame while using an effective position attention mechanism to further improve performance. Finally, we investigate the use of Random Forest and Gradient Boosting Decision Tree to explore the potential of combined multiple models. The experimental results show that compared with the current best method of prosodic structure (Bi-LSTM), the proposed method presented a good result with F1-Score in terms of prosodic words, prosodic phrases, intonation phrases; the subjective experiment also shows that the proposed method can improve the quality and naturalness of synthesized speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Chu, M., Qian, Y.: Locating boundaries for prosodic constituents in unrestricted Mandarin texts. Int. J. Comput. Linguist. Chin. Lang. Process. 6(1), 61–82 (2001). Special Issue on Natural Language Processing Researches in MSRA
Google Scholar
Truckenbrodt, H.: Phonological phrases–their relation to syntax, focus, and prominance. Massachusetts Institute of Technology (1995)
Google Scholar
Levow, G.A.: Automatic prosodic labeling with conditional random fields and rich acoustic features. In: Proceedings of the Third International Joint Conference on Natural Language Processing, vol. I (2008)
Google Scholar
Geng, Y., Liang, R.Z., Li, W., Wang, J., Liang, G., Xu, C., Wang, J.Y.: Learning convolutional neural network to maximize pos@top performance measure. In: ESANN 2017 - Proceedings, pp. 589–594 (2016)
Google Scholar
Geng, Y., Zhang, G., Li, W., Gu, Y., Liang, R.Z., Liang, G., Wang, J., Wu, Y., Patil, N., Wang, J.Y.: A novel image tag completion method based on convolutional neural transformation. In: International Conference on Artificial Neural Networks, pp. 539–546 (2017)
Chapter Google Scholar
Zhang, G., Liang, G., Li, W., Fang, J., Wang, J., Geng, Y., Wang, J.Y.: Learning convolutional ranking-score function by query preference regularization. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 1–8 (2017)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Mikolov, T., Kombrink, S., Burget, L., et al.: Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE (2011)
Google Scholar
Vadapalli, A., Prahallad, K.: Learning continuous-valued word representations for phrase break prediction. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Ding, C., Xie, L., Yan, J., et al.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 98–102. IEEE (2015)
Google Scholar
Chan, W., Jaitly, N., Le, Q., et al.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Li, H., Min, M.R., Ge, Y., et al.: A context-aware attention network for interactive question answering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 927–935. ACM (2017)
Google Scholar
Wang, X., Yu, L., Ren, K., et al.: Dynamic attention deep model for article recommendation by learning human editors’ demonstration. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2051–2059. ACM (2017)
Google Scholar
Chen, Q., Hu, Q., Huang, J.X., et al.: Enhancing recurrent neural networks with positional attention for question answering. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 993–996. ACM (2017)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)
Article MathSciNet Google Scholar
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
Article MathSciNet Google Scholar
Breiman, L., Friedman, J., Stone, C.J., et al.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Menze, B.H., Kelm, B.M., Masuch, R., et al.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinf. 10(1), 213 (2009)
Article Google Scholar
Collobert, R., Weston, J., Bottou, L., et al.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Fujian Key Laboratory of Brain-Inspired Computing Technique and Applications, School of Information Science and Engineering, Xiamen University, Xiamen, China
Kun Xie & Wei Pan

Authors

Kun Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wei Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kun Xie or Wei Pan .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
Polytechnic of Bari, Bari, Italy
Vitoantonio Bevilacqua
University of Wollongong, North Wollongong, New South Wales, Australia
Prashan Premaratne
Indian Institute of Technology Kanpur, Kanpur, India
Phalguni Gupta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, K., Pan, W. (2018). Mandarin Prosody Prediction Based on Attention Mechanism and Multi-model Ensemble. In: Huang, DS., Bevilacqua, V., Premaratne, P., Gupta, P. (eds) Intelligent Computing Theories and Application. ICIC 2018. Lecture Notes in Computer Science(), vol 10954. Springer, Cham. https://doi.org/10.1007/978-3-319-95930-6_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-95930-6_45
Published: 06 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-95929-0
Online ISBN: 978-3-319-95930-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics