Skip to main content
Log in

Generating emphatic speech with hidden Markov model for expressive speech synthesis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. As there are only a few emphasized words in a sentence, the problem of the data limitation is one of the most important problems for emphatic speech synthesis. In this paper, we analyze contrastive (neutral versus emphatic) speech recordings considering kinds of contexts, i.e. the relative locations between the syllables and the emphasized words. Based on the analysis, we propose a hidden Markov model (HMM) based method for emphatic speech synthesis with limited amount of data. In this method, decision trees (DTs) are constructed with non-emphasis-related questions using both neutral and emphasis corpora. The data in each leaf node of the DTs are classified into 6 emphasis categories according to the emphasis-related questions. The data in the same emphasis category are grouped into one sub-node and are used to train one HMM. As there might be no data of some specific emphasis categories in the leaf nodes of the DTs, a method based on cost calculation is proposed to select a suitable HMM in the same leaf node for predicting parameters. Further a compensation model is proposed to adjust the predicted parameters. We conduct a series of experiments to evaluate the performances of the approach. Experiments indicate that the proposed emphatic speech synthesis models improve the emphasis quality of synthesized speech while keeping a high degree of the naturalness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Boersma P, Weenink D (2003) Praat: doing phonetics by computer, http://www.praat.org

  2. Cai LH, Huang DZ, Cai R (2003) Foundation and applications of modern speech technology. Press of Tsinghua University, Beijing

  3. Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis, Tech. Rep. CMU-LTI-03-177, Carnegie Mellon University

  4. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9(2):171–185

  5. Li AJ (1994) Duration charateristics of stress and its synthesis rules on standard Chinese, Report of phonetic research

  6. Maeno Y, Nose T, Kobayashi T, Ijim Y, Nakajima H, Mizuno H, Yoshioka O (2011), HMM-based emphatic speech synthesis using unsupervised context labeling, In: Proc. Annual conference of international speech communication association (INTERSPEECH), 1849–1852

  7. Meng H, Lo WK, Harrison AM, Lee P, Won KH, Leung WK, Meng FB (2011) Development of automatic speech recongition and synthesis technologies to support Chinese learners of English: The CUHK experience, In: Proc. of Asia pacific signal and information processing association (APSIPA)

  8. Meng H, Lo YY, Wang L, Lau WY (2007) Deriving salient learners’ mispronunciations from cross-language phonological comparisons, In: Proc. IEEE workshop on automatic speech recognition and understanding (ASRU)

  9. Meng FB, Wu ZY, Jia J, Meng H, Cai LH (2013) Synthesizing english emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y

  10. Morizane K, Nakamura K, Toda T, Saruwatari H, Shikano K (2009) Emphasized speech synthesis based on hidden Markov models, In: Proc. of speech database and assessments oriental COCOSDA Int. Conf., 76–81

  11. Raux A, Black AW (2003) A unit selection approach to F0 modeling and its application to emphasis, In: Proc. IEEE workshop on automatic speech recognition and understanding (ASRU)

  12. Shinoda K, Watanabe T (2000) MDL-based context-dependent subword modeling for speech recognition. Acoust Soc Japan (E) 21:79–86

    Article  Google Scholar 

  13. Tokuda K, Zen H, Yamagishi J, Masuko T, Sako S, Black A, Nose T (2008) The HMM-based speech synthesis system (HTS) version 2.1, http://hts.sp.nitech.ac.jp/

  14. Wu ZY, Meng H, Yang HW, Cai LH (2009) Modeling the expressivity of input text semantics for Chinese text-to-speech synthesis in a spoken dialog system. IEEE Trans Audio Speech Lang Process 17(8):1567–1577

    Article  Google Scholar 

  15. Xu J (2009) Parametric analysis and synthesis for emotional speech, Doctoral dissertation, Tsinghua Unverisity

  16. Xu Y, Xu CX (2005) Phonetic realization of focus in english declarative intonation. J Phon 33:159–197

    Article  Google Scholar 

  17. Xydas G, Kouroupetroglou G (2006) Tone-group F0 selection for modeling focus prominence in small-footprint speech synthesis. Speech Comm 48(9):1057–1078

    Article  Google Scholar 

  18. Yu K, Mairesse F, Young S (2010) Word-level emphasis modeling in HMM-based speech synthesis, In: Proc. of IEEE Int. Conf. on acoustics, speech, and signal processing. (ICASSP), 4238–4241

  19. Zhu WB (2007) A Chinese speech synthesis system with capability of accent realizing. J Chin Inf Process 21(3):122–128

    Google Scholar 

Download references

Acknowledgments

This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government’s Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (61375027, 61370023 and 60805008), the National Social Science Foundation (Major Project) (13&ZD189), and Guangdong Provincial Science and Technology Program (2012A011100008). The authors would like to thank the students of the research group of Human Computer Speech Interaction in Tsinghua University, the Graduate School at Shenzhen of Tsinghua University and the Chinese University of Hong Kong, for their cooperation with the dataset setup and experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Zang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Ning, Y., Zang, X. et al. Generating emphatic speech with hidden Markov model for expressive speech synthesis. Multimed Tools Appl 74, 9909–9925 (2015). https://doi.org/10.1007/s11042-014-2164-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2164-2

Keywords

Navigation