Generating emphatic speech with hidden Markov model for expressive speech synthesis

Wu, Zhiyong; Ning, Yishuang; Zang, Xiao; Jia, Jia; Meng, Fanbo; Meng, Helen; Cai, Lianhong

doi:10.1007/s11042-014-2164-2

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Published: 12 July 2014

Volume 74, pages 9909–9925, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zhiyong Wu^1,2,3,
Yishuang Ning^1,3,
Xiao Zang^1,3,
Jia Jia^1,3,
Fanbo Meng^1,3,
Helen Meng^1,2 &
…
Lianhong Cai^1,3

281 Accesses
7 Citations
Explore all metrics

Abstract

Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. As there are only a few emphasized words in a sentence, the problem of the data limitation is one of the most important problems for emphatic speech synthesis. In this paper, we analyze contrastive (neutral versus emphatic) speech recordings considering kinds of contexts, i.e. the relative locations between the syllables and the emphasized words. Based on the analysis, we propose a hidden Markov model (HMM) based method for emphatic speech synthesis with limited amount of data. In this method, decision trees (DTs) are constructed with non-emphasis-related questions using both neutral and emphasis corpora. The data in each leaf node of the DTs are classified into 6 emphasis categories according to the emphasis-related questions. The data in the same emphasis category are grouped into one sub-node and are used to train one HMM. As there might be no data of some specific emphasis categories in the leaf nodes of the DTs, a method based on cost calculation is proposed to select a suitable HMM in the same leaf node for predicting parameters. Further a compensation model is proposed to adjust the predicted parameters. We conduct a series of experiments to evaluate the performances of the approach. Experiments indicate that the proposed emphatic speech synthesis models improve the emphasis quality of synthesized speech while keeping a high degree of the naturalness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

$$\hbox {F}_{0}$$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

Article 17 August 2014

Average Voice Modeling Based on Unbiased Decision Trees

Prosody Control and Variation Enhancement Techniques for HMM-Based Expressive Speech Synthesis

References

Boersma P, Weenink D (2003) Praat: doing phonetics by computer, http://www.praat.org
Cai LH, Huang DZ, Cai R (2003) Foundation and applications of modern speech technology. Press of Tsinghua University, Beijing
Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis, Tech. Rep. CMU-LTI-03-177, Carnegie Mellon University
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9(2):171–185
Li AJ (1994) Duration charateristics of stress and its synthesis rules on standard Chinese, Report of phonetic research
Maeno Y, Nose T, Kobayashi T, Ijim Y, Nakajima H, Mizuno H, Yoshioka O (2011), HMM-based emphatic speech synthesis using unsupervised context labeling, In: Proc. Annual conference of international speech communication association (INTERSPEECH), 1849–1852
Meng H, Lo WK, Harrison AM, Lee P, Won KH, Leung WK, Meng FB (2011) Development of automatic speech recongition and synthesis technologies to support Chinese learners of English: The CUHK experience, In: Proc. of Asia pacific signal and information processing association (APSIPA)
Meng H, Lo YY, Wang L, Lau WY (2007) Deriving salient learners’ mispronunciations from cross-language phonological comparisons, In: Proc. IEEE workshop on automatic speech recognition and understanding (ASRU)
Meng FB, Wu ZY, Jia J, Meng H, Cai LH (2013) Synthesizing english emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y
Morizane K, Nakamura K, Toda T, Saruwatari H, Shikano K (2009) Emphasized speech synthesis based on hidden Markov models, In: Proc. of speech database and assessments oriental COCOSDA Int. Conf., 76–81
Raux A, Black AW (2003) A unit selection approach to F0 modeling and its application to emphasis, In: Proc. IEEE workshop on automatic speech recognition and understanding (ASRU)
Shinoda K, Watanabe T (2000) MDL-based context-dependent subword modeling for speech recognition. Acoust Soc Japan (E) 21:79–86
Article Google Scholar
Tokuda K, Zen H, Yamagishi J, Masuko T, Sako S, Black A, Nose T (2008) The HMM-based speech synthesis system (HTS) version 2.1, http://hts.sp.nitech.ac.jp/
Wu ZY, Meng H, Yang HW, Cai LH (2009) Modeling the expressivity of input text semantics for Chinese text-to-speech synthesis in a spoken dialog system. IEEE Trans Audio Speech Lang Process 17(8):1567–1577
Article Google Scholar
Xu J (2009) Parametric analysis and synthesis for emotional speech, Doctoral dissertation, Tsinghua Unverisity
Xu Y, Xu CX (2005) Phonetic realization of focus in english declarative intonation. J Phon 33:159–197
Article Google Scholar
Xydas G, Kouroupetroglou G (2006) Tone-group F0 selection for modeling focus prominence in small-footprint speech synthesis. Speech Comm 48(9):1057–1078
Article Google Scholar
Yu K, Mairesse F, Young S (2010) Word-level emphasis modeling in HMM-based speech synthesis, In: Proc. of IEEE Int. Conf. on acoustics, speech, and signal processing. (ICASSP), 4238–4241
Zhu WB (2007) A Chinese speech synthesis system with capability of accent realizing. J Chin Inf Process 21(3):122–128
Google Scholar

Download references

Acknowledgments

This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government’s Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (61375027, 61370023 and 60805008), the National Social Science Foundation (Major Project) (13&ZD189), and Guangdong Provincial Science and Technology Program (2012A011100008). The authors would like to thank the students of the research group of Human Computer Speech Interaction in Tsinghua University, the Graduate School at Shenzhen of Tsinghua University and the Chinese University of Hong Kong, for their cooperation with the dataset setup and experiments.

Author information

Authors and Affiliations

Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, and Shenzhen Key Laboratory of Information Science and Technology, Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, China
Zhiyong Wu, Yishuang Ning, Xiao Zang, Jia Jia, Fanbo Meng, Helen Meng & Lianhong Cai
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, SAR, China
Zhiyong Wu & Helen Meng
Tsinghua National Laboratory for Information Science and Technology (TNList), and Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Zhiyong Wu, Yishuang Ning, Xiao Zang, Jia Jia, Fanbo Meng & Lianhong Cai

Authors

Zhiyong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yishuang Ning
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Zang
View author publications
You can also search for this author in PubMed Google Scholar
Jia Jia
View author publications
You can also search for this author in PubMed Google Scholar
Fanbo Meng
View author publications
You can also search for this author in PubMed Google Scholar
Helen Meng
View author publications
You can also search for this author in PubMed Google Scholar
Lianhong Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Zang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Z., Ning, Y., Zang, X. et al. Generating emphatic speech with hidden Markov model for expressive speech synthesis. Multimed Tools Appl 74, 9909–9925 (2015). https://doi.org/10.1007/s11042-014-2164-2

Download citation

Received: 05 March 2014
Revised: 01 June 2014
Accepted: 23 June 2014
Published: 12 July 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11042-014-2164-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Abstract

Access this article

Similar content being viewed by others

$$\hbox {F}_{0}$$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

Average Voice Modeling Based on Unbiased Decision Trees

Prosody Control and Variation Enhancement Techniques for HMM-Based Expressive Speech Synthesis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Abstract

Access this article

Similar content being viewed by others

$$\hbox {F}_{0}$$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

Average Voice Modeling Based on Unbiased Decision Trees

Prosody Control and Variation Enhancement Techniques for HMM-Based Expressive Speech Synthesis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation