Abstract
In this article we present a prosody generation architecture based on K-ToBI (Korean Tone and Break Index) representation. ToBI is a multitier representation system based on linguistic knowledge that transcribes events in an utterance. The TTS (Text-To-Speech) system, which adopts ToBI as an intermediate representation, is known to exhibit higher flexibility, modularity, and domain/task portability compared to the direct prosody generation TTS systems. However, for practical-level performance, the cost of corpus preparation is very expensive because the ToBI labeled corpus is constructed manually by many prosody experts, and normally requires large amounts of data for statistical prosody modeling. Unlike previous ToBI-based systems, this article proposes a new method, which transcribes the K-ToBI labels in Korean speech completely automatically. We develop automatic corpus-based K-ToBI labeling tools and prediction methods based on several lexico-syntactic linguistic features for decision-tree induction. We demonstrate the performance of F0 generation from automatically predicted K-ToBI labels, and confirm that the performance is reasonably comparable to state-of-the-art direct prosody generation methods and previous ToBI-based methods.
- BECKMAN, M. AND JUN, S. 1998. K-ToBI (KOREAN ToBI) labeling convention. In Proceedings of the Study of Korean Prosody. 1998.Google Scholar
- BLACK, A. W. AND HUNT, A. 1996. Generating f0 contours from ToBI labels using linear regression. In Proceedings of the International Conference on Spoken Language Processing (ICSLP, 1996), 1385--1388.Google Scholar
- BRILL, E. 1992. A simple rule-based part-of-speech tagger. In Proceedings of the Conference on Applied Natural Language Processing. 152--155. Google Scholar
- CHA, J., LEE, G., AND LEE, J. 1998. Generalized unknown morpheme guessing for hybrid POS tagging of Korean. In Proceedings of the Sixth Workshop on Very Large Corpora. 85--93.Google Scholar
- D'ALESSANDRO, C. AND MERTENS, P. 1995. Automatic pitch contour stylization using a model of tonal perception. Computer Speech and Language 5, 3 (1995), 257--288.Google Scholar
- DUTOIT, T. 1997. An Introduction to Text-to-Speech Synthesis. Kluwer, Amsterdam, The Netherlands. Google Scholar
- FUJISAKI, H. AND OHNO, S. 1995. Analysis and modeling of fundamental frequency contours of English utterances. In Proceedings of the Conference on EUROSPEECH'95. 985--988.Google Scholar
- HUCKVALE, M. 1996. Speech Filing System, SFS release 3ed.Google Scholar
- JUN, S. 2000. K-ToBI (KOREAN ToBI) labeling conventions (version 3.0, revised in January 2000). In Proceedings of The Phonetic Society of Korea Workshop, 2000. 105--140.Google Scholar
- LEE, S. 2000. Tree-based modeling of prosody for Korean TTS system. Ph.D thesis, Korea Advanced Institute of Science and Technology.Google Scholar
- LEE, Y., LEE, S., KIM, J., KO, H., KIM, Y., KIM, S., AND LEE, J. 1998. A computational algorithm for f0 contour generation in Korean developed with prosodically labeled databases using K-ToBI system. In Proceedings of the International Conference on Spoken language Processing (ICSLP, 1998). 1995--1998.Google Scholar
- MITCHELL, T. M. 1997. Machine Learning. McGraw-Hill. Google Scholar
- MOHLER, G. AND CONKIE, A. 1998. Parametric modeling of intonation using vector quantization. In Proceedings of the Third Speech Synthesis Workshop. 311--314.Google Scholar
- QUINLAN, J. R. 1983. C4.5: Programs for Machine Learning. Morgan Kaufmann. Google Scholar
- ROSS, K. 1995. Modeling of intonation for speech synthesis. Ph.D. dissertation, Boston University College of Engineering. Google Scholar
- ROSS, K., AND OSTENDORF, M. 1999. A dynamical system model for generating fundamental frequency for speech synthesis. IEEE Trans. Speech Audio Process. 7, 3 (1999), 259--309.Google Scholar
- SANDERS, E. AND TAYLOR, P. 1995. Using statistical models to predict phrase boundaries for speech synthesis. In Proceedings of the EUROSPEECH'95 Conference (Madrid, Spain), 1811--1814.Google Scholar
- VAN SANTEN, J. P., SPROAT, R.W., OLIVE, J. P., AND HIRSCHBERg, J. 1997. Progress in Speech Synthesis. Springer Verlag. Google Scholar
- SILVERMAN, K., BECKMAN, M., PITRELLI, J., OSTENDORF, M., WIGHTMAN, C., PRICE, P., PIERREHUMBERT, J., AND HIRSCHBERG, J. 1992. ToBI: A standard for labeling English prosody. In Proceedings of the nternational Conference on Spoken Language Processing (ICSLP, 1992), 867--870.Google Scholar
- STONE, C. 1996. A Course in Probability and Statistics. Duxbury, Belmont, CA.Google Scholar
- TAYLOR, P. 1995. The rise/fall/connection model of intonation. Speech Commun. 15 (1995). Google Scholar
- TAYLOR, P. AND BLACK A. 1998. Assigning phrase breaks from part-of-speech sequences. Comput. Speech. Lang. 2, 2 (1998).Google Scholar
Index Terms
- Automatic corpus-based tone and break-index prediction using K-ToBI representation
Recommendations
On the perception of "segmental intonation": F0 context effects on sibilant identification in German
In normal modally voiced utterances, voiceless fricatives like [s], [ź], [f], and [x] vary such that their aperiodic pitch impressions mirror the pitch level of the adjacent F0 contour. For instance, if the F0 contour creates a high or low pitch context,...
Prosody dependent speech recognition on radio news corpus of American English
Does prosody help word recognition? This paper proposes a novel probabilistic framework in which word and phoneme are dependent on prosody in a way that reduces word error rates (WER) relative to a prosody-independent recognizer with comparable ...
Prosody modification for speech recognition in emotionally mismatched conditions
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...
Comments