skip to main content
article

Analysis and modeling of F0 contours for cantonese text-to-speech

Published: 01 September 2004 Publication History

Abstract

For the generation of highly natural synthetic speech, the control of prosody is of primary importance. The fundamental frequency (F0) is one of the most important components of speech prosody. This research investigates the variation of F0 in continuous Cantonese speech, with the goal of establishing an effective mechanism of prosody control in Cantonese text-to-speech (TTS) applications. Cantonese is a commonly used Chinese dialect that is well known for being rich in tones. This article describes a simple yet effective approach to the analysis and modeling of F0. The surface F0 contour of a continuous Cantonese utterance is considered to be the combination of a global component--phrase-level intonation curve, and local components--syllable-level tone contoursA novel method of F0 normalization is proposed to separate the local components from the global one. As a result, the variation in tone contours is greatly reduced. Statistical analysis is performed for the phrase curves and context-dependent tone contours that are extracted from a large corpus of 1,200 utterances. Specifically, the analysis is focused on co-articulated tone contours for disyllabic words, cross-word contours, and phrase-initial tone contours. Based on the results of the analysis, a template-based model for F0 generation is established and integrated with a Cantonese TTS system. Subjective listening tests show that the proposed model significantly improves the naturalness of the output speech.

References

[1]
Clark, J. and Yallop, C. 1990. An Introduction to Phonetic and Phonology. Blackwell, London.
[2]
Cox, R.V., Rabiner, L. R., and Wilpon, J. G. 2000. Speech and language processing for next-millennium communications services. Proc.IEEE 88, 8 (2000), 1314--1337.
[3]
Dong, M. and Lua, K. T. 2002. Pitch contour model for Chinese text-to-speech using CART and statistical method. In Proceedings of the 2002 International Conference on Spoken Language Processing (Denver, CO, Sept. 2002) 2405--2408.
[4]
Dong, M. and Lua, K. T. 2000. An example-based approach for prosody generation in Chinese speech synthesis. In Proceedings of the 2nd International Symposium on Chinese Spoken Language Processing (Beijing, Oct. 2000). 303--307.
[5]
Dutoit, T. 1997. An Introduction to Text-to-Speech Synthesis. Kluwer, Amsterdam.
[6]
Grimes, B. F. Eds. 2003. Ethnologue: Languages of the World. 14th ed. http://www.sil.org/ethnologue (Internet version), SIL International.
[7]
Hill, D. R. and Kolman, B. 2001. Modern Matrix Algebra. Prentice Hall, Englewood Cliffs, NJ, 2001.
[8]
Holm, B. and Bailly, G. 2000. Generating prosody by superposing multi-parametric overlapping contours. In Proceedings of the 2000 International Conference on Spoken Language Processing (Beijing. Oct. 2000). 203--206.
[9]
Juamg, B. H. 2001. Why speech synthesis? (In memory of Prof. Jonathan Allen 1934-2000). IEEE Trans. on Speech and Audio Processing 9 (2001), 1, 1--2.
[10]
Kochanski, G. P. and Shih, C. 2001. Automatic modeling of Chinese intonation in continuous speech. In Proceedings of the 2001 European Conference on Speech Communication and Technology (Aalborg, Denmark, Sept. 2001). 2:911--914.
[11]
Kochanski, G. P. and Shih, C. 2003. Prosody modeling with soft templates. Speech Commun. 39, 4 (2003), 311--352.
[12]
Lau, W. 2000. Attributes and extraction of tone information for continuous Cantonese speech recognition. Mphil. thesis, Dept. of Electronic Engineering, Chinese University of Hong Kong.
[13]
Law, K. M. 2001. Cantonese text-to-speech synthesis using sub-syllable units. MPhil. Thesis, Dept. of Electronic Engineering, Chinese University of Hong Kong.
[14]
Lee, T., Ching, P. C., Chan, L. W., Mak, B., and Cheng, Y. H. 1995. Tone recognition of isolated Cantonese syllables. IEEE Trans. on Speech and Audio Processing 3, 3 (1995), 204--209.
[15]
Lee, T., Kochanski, G. P., Shih, C., and LI, Y. J. 2002. Modeling tones in continuous Cantonese speech. In Proceedings of the 2002 International Conference on Spoken Language Processing (Denver, CO, Sept. 2002). 4:2401--2404.
[16]
Lee, T., Lo, W. K., Ching, P. C., and Meng, H. 2002. Spoken language resources for Cantonese speech processing. Speech Commun. 36, 3-4 (2002), 327--342.
[17]
Lee, T., Meng, H., Lau, W., Lo, W. K., and Ching, P. C. 1999. Micro-prosodic control in Cantonese text-tospeech synthesis. In Proceedings of the 1999 European Conference on Speech Communication and Technology (Budapest, Sept. 1999). 4:1855--1858.
[18]
Li, Y. J. 2003. Prosody Analysis and Modeling for Cantonese Text-to-Speech. Mphil. thesis, Dept. of Electronic Engineering, Chinese University of Hong Kong.
[19]
Li, Y. J., Lee, T., and Qian, Y. 2002. Acoustical F0 analysis of continuous Cantonese speech. In Proceedings of the 2002 International Symposium on Chinese Spoken Language Processing (Taipei, Aug. 2002), 127--130.
[20]
Lieberman, P. 1967. Intonation, Perception and Language. MIT Press, Cambridge, MA.
[21]
Linguistic Society of Hong Kong (LSHK). 1997. Hong Kong Jyut Ping characters table Linguistic Society of Hong Kong Press.
[22]
Lo, W. K. 2000. Cantonese phonology and phonetics: An engineering introduction. Internal documentation. Digital Signal Processing Laboratory, Chinese University of Hong Kong.
[23]
Qian, Y., Lee, T., and Li, Y J. 2003. Overlapped di-tone modeling for tone recognition in continuous Cantonese speech. In Proceedings of the 2003 European Conference on Speech Communication and Technology (Geneva, Sept. 2003). 1845--1848.
[24]
Swerts, M. 1997. Prosodic features at discourse boundaries of different strength. J. Acoustical Society of America 101, 1 (1997), 514--521.
[25]
Sonntag, G. P. and Portele, T. 1998. Comparative evaluation of synthetic prosody with the PURR method. In Proceedings of the 1998 International Conference on Spoken Language Processing (Sydney, Australia, Nov. 1998). 18--21.
[26]
Talkin, D. and Lin, D. ESPS/waves online documentation. Entropic Research Laboratory.
[27]
Tseng, C.-Y. 1999. Investigating Mandarin Chinese prosody through speech database. In Proceedings of Oriental COCOSDA Workshop.
[28]
Van Heuven, V. J. and Van Bezoojien, R. 1995. Quality evaluation of synthesized speech. In Speech Coding and Synthesis. Kleign and Paliwal, eds. Elsevier Health Sciences, New York, 707--734.
[29]
Wang, C., Fujisaki, H., Tomana, R., and Ohno, S. 2000. Analysis of fundamental frequency contours of standard Chinese in terms of the command-response model and its application to synthesis by rule of intonation. In Proceedings of the 2000 International Conference on Spoken Language Processing (Beijing, Oct. 2000). 3:326--329.
[30]
Yuen, I. 2002. Tonal invariance and downtrend in Cantonese. In Speech Prosody 2002 (Aix-en-Provence, France, April 2002).
[31]
Zhang, J., Dong, S., and Yu, G. 1998. Total quality evaluation of speech synthesis systems. In Proceedings of the 1998 International Conference on Spoken Language Processing (Sydney, Australia, Nov. 1998). 60--63.

Cited By

View all
  • (2023)Privacy-Oriented Architecture for Building Automatic Voice Interaction Systems in Smart Environments in Disaster Recovery Scenarios2023 International Conference on Information and Communication Technologies for Disaster Management (ICT-DM)10.1109/ICT-DM58371.2023.10286949(1-8)Online publication date: 13-Sep-2023
  • (2022)Phoneme Classification Using Modulating Features2022 IEEE Region 10 Symposium (TENSYMP)10.1109/TENSYMP54529.2022.9864425(1-5)Online publication date: 1-Jul-2022
  • (2022)Integrated design of solar photovoltaic power generation technology and building construction based on the Internet of ThingsAlexandria Engineering Journal10.1016/j.aej.2021.08.00361:4(2775-2786)Online publication date: Apr-2022
  • Show More Cited By

Index Terms

  1. Analysis and modeling of F0 contours for cantonese text-to-speech

      Recommendations

      Reviews

      Peter C. Patton

      The fundamental frequency (F0) of human speech is the critical factor in creating synthetic speech with natural prosody, the temporal and rhythmic properties of human utterance that make speech sound natural rather than robotic. Mechanical techniques do a fairly good job of synthesizing intelligible speech, by imitating local intonation or tone contours, but don't sound truly natural because they don't handle phrase and sentence fundamental tone contours. Telephone robots have become much more sophisticated, but have not become more natural sounding. Li, Lee, and Qian take on an interesting challenge, by developing a text-to-speech system for Cantonese, a Chinese dialect with many tones. Intonation in Indo-European languages is employed to convey emotion; however, in monosyllabic agglutinative languages like the Chinese dialects, it conveys lexical, and, to some extent, syntactic information. It seems that the authors' technique will work for any language if it works for Cantonese, which has nine tones in all, of which three are entering tones, and the other six occur throughout an utterance. Fukienese, with its 13 tones, might be an interesting stress test as well. The authors capture the change in F0 over an utterance as a phrase curve, and local (syllabic) intonation is detected by any break that exceeds a given length. While the beginning and end frequencies of each of the six tones may vary over the phrase, their ratio or interval will be constant, and the phrase curve will be determined by linear regression over the converted tone heights. The authors analyzed 1,200 utterances, having 4,937 intonation phrases, to develop a Cantonese text-to-speech system, called CUTalk, consisting of three modules: text analysis, acoustic synthesis, and prosody generation. Subjective tests of the system were made with sentences taken from local newspapers, and naturalness was rated, on a scale from one to five, by native speakers. The results showed a marked improvement in the generation of natural spoken Chinese by a computer, but also revealed some opportunities for additional improvement. This always seems to be the result of trying to analyze or synthesize human language mechanically; we discover that language is even more complex than we thought, and learn as much about the natural language as we learn about computation for linguistic applications. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian Language Information Processing
      ACM Transactions on Asian Language Information Processing  Volume 3, Issue 3
      September 2004
      44 pages
      ISSN:1530-0226
      EISSN:1558-3430
      DOI:10.1145/1037811
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 September 2004
      Published in TALIP Volume 3, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Chinese dialects
      2. Text-to-speech
      3. fundamental frequency
      4. prosody
      5. tones

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)13
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 14 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Privacy-Oriented Architecture for Building Automatic Voice Interaction Systems in Smart Environments in Disaster Recovery Scenarios2023 International Conference on Information and Communication Technologies for Disaster Management (ICT-DM)10.1109/ICT-DM58371.2023.10286949(1-8)Online publication date: 13-Sep-2023
      • (2022)Phoneme Classification Using Modulating Features2022 IEEE Region 10 Symposium (TENSYMP)10.1109/TENSYMP54529.2022.9864425(1-5)Online publication date: 1-Jul-2022
      • (2022)Integrated design of solar photovoltaic power generation technology and building construction based on the Internet of ThingsAlexandria Engineering Journal10.1016/j.aej.2021.08.00361:4(2775-2786)Online publication date: Apr-2022
      • (2022)A survey on binary metaheuristic algorithms and their engineering applicationsArtificial Intelligence Review10.1007/s10462-022-10328-956:7(6101-6167)Online publication date: 21-Nov-2022
      • (2022)Conventional and contemporary approaches used in text to speech synthesis: a reviewArtificial Intelligence Review10.1007/s10462-022-10315-056:7(5837-5880)Online publication date: 13-Nov-2022
      • (2022)Hardware implementation of SLAM algorithms: a survey on implementation approaches and platformsArtificial Intelligence Review10.1007/s10462-022-10310-556:7(6187-6239)Online publication date: 23-Nov-2022
      • (2022)Deep learning in drug discovery: an integrative review and future challengesArtificial Intelligence Review10.1007/s10462-022-10306-156:7(5975-6037)Online publication date: 17-Nov-2022
      • (2022)Image denoising in the deep learning eraArtificial Intelligence Review10.1007/s10462-022-10305-256:7(5929-5974)Online publication date: 15-Nov-2022
      • (2021)The use of tonal coarticulation in segmentation of artificial language speech: A study with Mandarin listenersApplied Psycholinguistics10.1017/S0142716420000818(1-25)Online publication date: 5-Jan-2021
      • (2020)Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0 RepresentationApplied Sciences10.3390/app1018638110:18(6381)Online publication date: 13-Sep-2020
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media