Skip to main content

Mandarin Prosody Prediction Based on Attention Mechanism and Multi-model Ensemble

  • Conference paper
  • First Online:
Intelligent Computing Theories and Application (ICIC 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10954))

Included in the following conference series:

  • 2834 Accesses

Abstract

Prosodic boundary prediction is very important and challenging in the speech synthesis task, the result of prosodic prediction directly determines the quality of speech synthesis. In this paper, we proposed a prosodic boundary prediction method based on “encoding-decoding” frame while using an effective position attention mechanism to further improve performance. Finally, we investigate the use of Random Forest and Gradient Boosting Decision Tree to explore the potential of combined multiple models. The experimental results show that compared with the current best method of prosodic structure (Bi-LSTM), the proposed method presented a good result with F1-Score in terms of prosodic words, prosodic phrases, intonation phrases; the subjective experiment also shows that the proposed method can improve the quality and naturalness of synthesized speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  2. Chu, M., Qian, Y.: Locating boundaries for prosodic constituents in unrestricted Mandarin texts. Int. J. Comput. Linguist. Chin. Lang. Process. 6(1), 61–82 (2001). Special Issue on Natural Language Processing Researches in MSRA

    Google Scholar 

  3. Truckenbrodt, H.: Phonological phrases–their relation to syntax, focus, and prominance. Massachusetts Institute of Technology (1995)

    Google Scholar 

  4. Levow, G.A.: Automatic prosodic labeling with conditional random fields and rich acoustic features. In: Proceedings of the Third International Joint Conference on Natural Language Processing, vol. I (2008)

    Google Scholar 

  5. Geng, Y., Liang, R.Z., Li, W., Wang, J., Liang, G., Xu, C., Wang, J.Y.: Learning convolutional neural network to maximize pos@top performance measure. In: ESANN 2017 - Proceedings, pp. 589–594 (2016)

    Google Scholar 

  6. Geng, Y., Zhang, G., Li, W., Gu, Y., Liang, R.Z., Liang, G., Wang, J., Wu, Y., Patil, N., Wang, J.Y.: A novel image tag completion method based on convolutional neural transformation. In: International Conference on Artificial Neural Networks, pp. 539–546 (2017)

    Chapter  Google Scholar 

  7. Zhang, G., Liang, G., Li, W., Fang, J., Wang, J., Geng, Y., Wang, J.Y.: Learning convolutional ranking-score function by query preference regularization. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 1–8 (2017)

    Google Scholar 

  8. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  9. Mikolov, T., Kombrink, S., Burget, L., et al.: Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE (2011)

    Google Scholar 

  10. Vadapalli, A., Prahallad, K.: Learning continuous-valued word representations for phrase break prediction. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  11. Ding, C., Xie, L., Yan, J., et al.: Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 98–102. IEEE (2015)

    Google Scholar 

  12. Chan, W., Jaitly, N., Le, Q., et al.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

    Google Scholar 

  13. Cho, K., Van Merriënboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  14. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Li, H., Min, M.R., Ge, Y., et al.: A context-aware attention network for interactive question answering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 927–935. ACM (2017)

    Google Scholar 

  17. Wang, X., Yu, L., Ren, K., et al.: Dynamic attention deep model for article recommendation by learning human editors’ demonstration. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2051–2059. ACM (2017)

    Google Scholar 

  18. Chen, Q., Hu, Q., Huang, J.X., et al.: Enhancing recurrent neural networks with positional attention for question answering. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 993–996. ACM (2017)

    Google Scholar 

  19. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  20. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)

    Article  MathSciNet  Google Scholar 

  21. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)

    Article  MathSciNet  Google Scholar 

  22. Breiman, L., Friedman, J., Stone, C.J., et al.: Classification and Regression Trees. CRC Press, Boca Raton (1984)

    MATH  Google Scholar 

  23. Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  24. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  25. Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In: Twelfth Annual Conference of the International Speech Communication Association (2011)

    Google Scholar 

  26. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  27. Menze, B.H., Kelm, B.M., Masuch, R., et al.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinf. 10(1), 213 (2009)

    Article  Google Scholar 

  28. Collobert, R., Weston, J., Bottou, L., et al.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)

    MATH  Google Scholar 

  29. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kun Xie or Wei Pan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xie, K., Pan, W. (2018). Mandarin Prosody Prediction Based on Attention Mechanism and Multi-model Ensemble. In: Huang, DS., Bevilacqua, V., Premaratne, P., Gupta, P. (eds) Intelligent Computing Theories and Application. ICIC 2018. Lecture Notes in Computer Science(), vol 10954. Springer, Cham. https://doi.org/10.1007/978-3-319-95930-6_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-95930-6_45

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-95929-0

  • Online ISBN: 978-3-319-95930-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics