skip to main content
article

Automatic corpus-based tone and break-index prediction using K-ToBI representation

Published:01 September 2002Publication History
Skip Abstract Section

Abstract

In this article we present a prosody generation architecture based on K-ToBI (Korean Tone and Break Index) representation. ToBI is a multitier representation system based on linguistic knowledge that transcribes events in an utterance. The TTS (Text-To-Speech) system, which adopts ToBI as an intermediate representation, is known to exhibit higher flexibility, modularity, and domain/task portability compared to the direct prosody generation TTS systems. However, for practical-level performance, the cost of corpus preparation is very expensive because the ToBI labeled corpus is constructed manually by many prosody experts, and normally requires large amounts of data for statistical prosody modeling. Unlike previous ToBI-based systems, this article proposes a new method, which transcribes the K-ToBI labels in Korean speech completely automatically. We develop automatic corpus-based K-ToBI labeling tools and prediction methods based on several lexico-syntactic linguistic features for decision-tree induction. We demonstrate the performance of F0 generation from automatically predicted K-ToBI labels, and confirm that the performance is reasonably comparable to state-of-the-art direct prosody generation methods and previous ToBI-based methods.

References

  1. BECKMAN, M. AND JUN, S. 1998. K-ToBI (KOREAN ToBI) labeling convention. In Proceedings of the Study of Korean Prosody. 1998.Google ScholarGoogle Scholar
  2. BLACK, A. W. AND HUNT, A. 1996. Generating f0 contours from ToBI labels using linear regression. In Proceedings of the International Conference on Spoken Language Processing (ICSLP, 1996), 1385--1388.Google ScholarGoogle Scholar
  3. BRILL, E. 1992. A simple rule-based part-of-speech tagger. In Proceedings of the Conference on Applied Natural Language Processing. 152--155. Google ScholarGoogle Scholar
  4. CHA, J., LEE, G., AND LEE, J. 1998. Generalized unknown morpheme guessing for hybrid POS tagging of Korean. In Proceedings of the Sixth Workshop on Very Large Corpora. 85--93.Google ScholarGoogle Scholar
  5. D'ALESSANDRO, C. AND MERTENS, P. 1995. Automatic pitch contour stylization using a model of tonal perception. Computer Speech and Language 5, 3 (1995), 257--288.Google ScholarGoogle Scholar
  6. DUTOIT, T. 1997. An Introduction to Text-to-Speech Synthesis. Kluwer, Amsterdam, The Netherlands. Google ScholarGoogle Scholar
  7. FUJISAKI, H. AND OHNO, S. 1995. Analysis and modeling of fundamental frequency contours of English utterances. In Proceedings of the Conference on EUROSPEECH'95. 985--988.Google ScholarGoogle Scholar
  8. HUCKVALE, M. 1996. Speech Filing System, SFS release 3ed.Google ScholarGoogle Scholar
  9. JUN, S. 2000. K-ToBI (KOREAN ToBI) labeling conventions (version 3.0, revised in January 2000). In Proceedings of The Phonetic Society of Korea Workshop, 2000. 105--140.Google ScholarGoogle Scholar
  10. LEE, S. 2000. Tree-based modeling of prosody for Korean TTS system. Ph.D thesis, Korea Advanced Institute of Science and Technology.Google ScholarGoogle Scholar
  11. LEE, Y., LEE, S., KIM, J., KO, H., KIM, Y., KIM, S., AND LEE, J. 1998. A computational algorithm for f0 contour generation in Korean developed with prosodically labeled databases using K-ToBI system. In Proceedings of the International Conference on Spoken language Processing (ICSLP, 1998). 1995--1998.Google ScholarGoogle Scholar
  12. MITCHELL, T. M. 1997. Machine Learning. McGraw-Hill. Google ScholarGoogle Scholar
  13. MOHLER, G. AND CONKIE, A. 1998. Parametric modeling of intonation using vector quantization. In Proceedings of the Third Speech Synthesis Workshop. 311--314.Google ScholarGoogle Scholar
  14. QUINLAN, J. R. 1983. C4.5: Programs for Machine Learning. Morgan Kaufmann. Google ScholarGoogle Scholar
  15. ROSS, K. 1995. Modeling of intonation for speech synthesis. Ph.D. dissertation, Boston University College of Engineering. Google ScholarGoogle Scholar
  16. ROSS, K., AND OSTENDORF, M. 1999. A dynamical system model for generating fundamental frequency for speech synthesis. IEEE Trans. Speech Audio Process. 7, 3 (1999), 259--309.Google ScholarGoogle Scholar
  17. SANDERS, E. AND TAYLOR, P. 1995. Using statistical models to predict phrase boundaries for speech synthesis. In Proceedings of the EUROSPEECH'95 Conference (Madrid, Spain), 1811--1814.Google ScholarGoogle Scholar
  18. VAN SANTEN, J. P., SPROAT, R.W., OLIVE, J. P., AND HIRSCHBERg, J. 1997. Progress in Speech Synthesis. Springer Verlag. Google ScholarGoogle Scholar
  19. SILVERMAN, K., BECKMAN, M., PITRELLI, J., OSTENDORF, M., WIGHTMAN, C., PRICE, P., PIERREHUMBERT, J., AND HIRSCHBERG, J. 1992. ToBI: A standard for labeling English prosody. In Proceedings of the nternational Conference on Spoken Language Processing (ICSLP, 1992), 867--870.Google ScholarGoogle Scholar
  20. STONE, C. 1996. A Course in Probability and Statistics. Duxbury, Belmont, CA.Google ScholarGoogle Scholar
  21. TAYLOR, P. 1995. The rise/fall/connection model of intonation. Speech Commun. 15 (1995). Google ScholarGoogle Scholar
  22. TAYLOR, P. AND BLACK A. 1998. Assigning phrase breaks from part-of-speech sequences. Comput. Speech. Lang. 2, 2 (1998).Google ScholarGoogle Scholar

Index Terms

  1. Automatic corpus-based tone and break-index prediction using K-ToBI representation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader