An HMM-Based Mandarin Chinese Text-To-Speech System

Qian, Yao; Soong, Frank; Chen, Yining; Chu, Min

doi:10.1007/11939993_26

Yao Qian²²,
Frank Soong²²,
Yining Chen²² &
…
Min Chu²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4274))

Included in the following conference series:

International Symposium on Chinese Spoken Language Processing

1656 Accesses
12 Citations

Abstract

In this paper we present our Hidden Markov Model (HMM)-based, Mandarin Chinese Text-to-Speech (TTS) system. Mandarin Chinese or Putonghua, “the common spoken language”, is a tone language where each of the 400 plus base syllables can have up to 5 different lexical tone patterns. Their segmental and supra-segmental information is first modeled by 3 corresponding HMMs, including: (1) spectral envelop and gain; (2) voiced/unvoiced and fundamental frequency; and (3) segment duration. The corresponding HMMs are trained from a read speech database of 1,000 sentences recorded by a female speaker. Specifically, the spectral information is derived from short-time LPC spectral analysis. Among all LPC parameters, Line Spectrum Pair (LSP) has the closest relevance to the natural resonances or the “formants” of a speech sound and it is selected to parameterize the spectral information. Furthermore, the property of clustered LSPs around a spectral peak justify augmenting LSPs with their dynamic counterparts, both in time and frequency, in both HMM modeling and parameter trajectory synthesis. One hundred sentences synthesized by 4 LSP-based systems have been subjectively evaluated with an AB comparison test. The listening test results show that LSP and its dynamic counterpart, both in time and frequency, are preferred for the resultant higher synthesized speech quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zen, H., Toda, T.: An Overview of Nitech HMM-based Speech Synthesis System for Blizzard Challenge 2005. In: Proc. EuroSpeech (2005)
Google Scholar
Tokuda, K., Zen, H., Black, A.W.: An HMM-based speech synthesis system applied to English. In: 2002 IEEE Speech Synthesis Workshop, Santa Monica, California, September 11-13 (2002)
Google Scholar
Tokuda, K., Kobayashi, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech Parameter generation algorithms for HMM-based speech synthesis. In: Proc. ICASSP, Istanbul, Turkey, June 2000, pp. 1315–1318 (2000)
Google Scholar
Tomoki, T., Keiichi, T.: Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis. In: Proc. Eurospeech 2005 (2005)
Google Scholar
Kawahara, H., Masuda-Katsuse, I., Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency- based f0 extraction: possible role of a repetitive structure in sounds. Speech Communication 27, 187–207 (1999)
Article Google Scholar
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-Markov model based speech synthesis. In: Proc. ICSLP, pp. 1185–1180 (2004)
Google Scholar
Itakura, F.: Line spectrum representation of linear predictive coefficients of speech signals. J. Acoust. Soc. Am. 57, S35 (1975)
Article Google Scholar
Fukada, T., Tokuda, K., Kobayashi, T., Imai, S.: An adaptive algorithm for melcepstral analysis of speech. In: Proc. ICASSP, pp. 137–140 (1992)
Google Scholar
Soong, F.K., Juang, B.H.: Line spectrum pair (LSP) and speech data compression. In: Proc. ICASSP, San Diego, CA, pp. 1.10.1–1.10.4. (1984)
Google Scholar
Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space Probability Distribution HMM. IEICE Trans. Inf. & Syst. E85-D(3), 455–464 (2002)
Google Scholar
Shinoda, K., Watanabe, T.: Acoustic Modeling Based on The MDL Principle for Speech Recognition. In: Proc. EuroSpeech 1997, pp. 99–102 (1997)
Google Scholar
Wakita, H.: Linear prediction voice synthesizers: line spectrum pairs (LSP) is the newest of the several techniques. Speech Technol. 1, 17–22 (1981)
Google Scholar
Paliwal, K.K.: On the use of line spectral frequency parameters for speech recognition. Digital Signal Processing 2, 80–87 (1992)
Article Google Scholar
Chu, M., Peng, H., Yang, H., Chang, E.: Selecting non-uniform units from a very large corpus for concatenative speech synthesizer. In: Proc. ICASSP 2001, Salt Lake City (2001)
Google Scholar
Huang, C., Shi, Y., Zhou, J.L., Chu, M., Wang, T., Chang, E.: Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR. In: Proc. ICASSP 2004, pp. 901–904 (2004)
Google Scholar
Zen, H., Tokuda, K., Kitamura, T.: A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features. In: Proc. of ICASSP 2004, pp. 837–840 (2004)
Google Scholar
Wu, Y.J., Wang, R.H.: Minimum generation error training for HMM-based speech synthesis. In: Proc. of ICAPP 2006, pp. 89–93 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, Beijing
Yao Qian, Frank Soong, Yining Chen & Min Chu

Authors

Yao Qian
View author publications
You can also search for this author in PubMed Google Scholar
Frank Soong
View author publications
You can also search for this author in PubMed Google Scholar
Yining Chen
View author publications
You can also search for this author in PubMed Google Scholar
Min Chu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, The University of Hong Kong, Hong Kong
Qiang Huo
Human Language Technology Department, Institute for Infocomm Research (I2R), 119613, Singapore
Bin Ma
School of Computer Engineering, Nanyang Technological University (NTU), 639798, Singapore
Eng-Siong Chng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Haizhou Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, Y., Soong, F., Chen, Y., Chu, M. (2006). An HMM-Based Mandarin Chinese Text-To-Speech System. In: Huo, Q., Ma, B., Chng, ES., Li, H. (eds) Chinese Spoken Language Processing. ISCSLP 2006. Lecture Notes in Computer Science(), vol 4274. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11939993_26

Download citation

DOI: https://doi.org/10.1007/11939993_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49665-6
Online ISBN: 978-3-540-49666-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics