Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis

Möbius, Bernd

doi:10.1023/A:1021052023237

Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis

Published: January 2003

Volume 6, pages 57–71, (2003)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Bernd Möbius¹

102 Accesses
22 Citations
Explore all metrics

Abstract

One of the most serious challenges for speech synthesis is the systematic treatment of events in language and speech that are known to have low frequencies of occurrence. The problems that extremely unbalanced frequency distributions pose for rule-based or data-driven models are often underestimated or even unrecognized. This paper discusses the problems pertinent to rare events in four components of speech synthesis systems: in linguistic text analysis, where productive word formation processes generate a potentially unbounded lexicon and cause heavily skewed word frequency distributions; in syllabification, where some syllables occur very frequently but most phonotactically possible syllables are very infrequent; in speech timing, where most constellations of factors affecting segmental duration are sparsely or not at all represented in training databases; and in unit selection synthesis, where the uneven distribution of speech unit frequencies poses challenges to speech corpus design. Currently available techniques for coping with the problem of rare or unseen events in each of these components are reviewed. Finally, a distinction is made between a strictly closed domain with a fixed vocabulary and a merely restricted domain with loopholes for unseen words and names, and the consequences of the respective type of domain for appropriate synthesis strategies are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Handling Two Difficult Challenges for Text-to-Speech Synthesis Systems: Out-of-Vocabulary Words and Prosody: A Case Study in Romanian

Speaker-Specific Pronunciation for Speech Synthesis

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

Article 29 January 2018

Lars Schillingmann, Jessica Ernst, … Eva Belke

References

Baayen, H. (1993). On frequency, transparency and productivity. In G. Booij and J. van Marle (Eds.), Yearbook of Morphology 1992. Dordrecht: Kluwer, pp. 181-208.
Google Scholar
Baayen, H. (2001). Word Frequency Distributions. Dordrecht: Kluwer.
Google Scholar
Baayen, H. and Lieber, R. (1991). Productivity and English derivation: A corpus based study. Linguistics, 29:801-843.
Google Scholar
Baker, J.K. (1979). Trainable grammars for speech recognition. In D. Klatt and J. Wolf (Eds.), Speech Communication Papers for ASA'79, pp. 547-550.
Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3:1-8.
Google Scholar
Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1):164-171.
Google Scholar
Beutnagel, M. and Conkie, A. (1999). Interaction of units in a unit selection database. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 3, pp. 1063-1066.
Google Scholar
Black, A.W. and Campbell, W.N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Proceedings of the European Conference on Speech Communication and Technology. Madrid, Spain, vol. 1, pp. 581-584.
Google Scholar
Black, A.W. and Lenzo, K.A. (2000). Limited domain synthesis. Proceedings of the International Conference on Spoken Language Processing. Beijing, vol. 2, pp. 411-414.
Google Scholar
Black, A.W. and Lenzo, K.A. (2001). Optimal data selection for unit selection synthesis. Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis. Pitlochry, UK, pp. 63-68.
Breen, A.P. and Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT's Laureate TTS system. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 373-376.
Campbell, W.N. (1992). Syllable-based segmental duration. In G. Bailly, C. Benoît, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, pp. 211-224.
Google Scholar
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1-38. 70 Möbius
Google Scholar
Donovan, R.E. and Eide, E.M. (1998). The IBM trainable speech synthesis system. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 5, pp. 1703-1706.
Google Scholar
Donovan, R.E. and Woodland, P.C. (1999). A hidden Markov model-based trainable speech synthesizer. Computer Speech and Language, 13:223-241.
Google Scholar
Evert, S. and Lüdeling, A. (2001). Measuring morphological productivity: Is automatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery, A. Hardie, and S. Khoja (Eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster, UK, pp. 167-175.
Good, I.J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4):237-264.
Google Scholar
Holzapfel, M. and Campbell, N. (1998). A nonlinear unit selection strategy for concatenative speech synthesis based on syllable level features. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 6, pp. 2755-2758.
Google Scholar
Hon, H.W., Acero, A., Huang, X., Liu, J., and Plumpe, M. (1998). Automatic generation of synthesis units for trainable text-to-speech systems. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. Seattle, WA, vol. 1, pp. 293-296.
Google Scholar
House, D. (1996). Differential perception of tonal contours through the syllable. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 1, pp. 2048-2051.
Google Scholar
Huang, X., Acero, A., Adcock, J., Hon, H.W., Goldsmith, J., Liu, J., and Plumpe, M. (1996). Whistler: A trainable text-to-speech system. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 4, pp. 2387-2390.
Google Scholar
Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. München, Germany, vol. 1, pp. 373-376.
Google Scholar
Iwahashi, N. and Sagisaka, Y. (1995). Speech segment network approach for an optimal synthesis unit set. Computer Speech and Language, 9:335-352.
Google Scholar
Jelinek, F. and Mercer, R.L. (1980). Interpolated estimation of Markov source parameters from sparse data. Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam, pp. 381-397.
Kaiki, N., Takeda, K., and Sagisaka, Y. (1990). Statistical analysis for segmental duration rules in Japanese speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Kobe, Japan, pp. 17-20.
Katz, S.M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 33(6):400-401.
Google Scholar
Khmaladze, E. (1987). The statistical analysis of large number of rare events (Tech. Report MS-R8804). Department of Mathematical Statistics, CWI. Amsterdam: Center for Mathematics and Computer Science.
Google Scholar
Kiraz, G.A. and Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. Proceedings of the Third InternationalWorkshop on Speech Synthesis. Jenolan Caves, Australia, pp. 71-76.
Klatt, D.H. (1973). Interaction between two factors that influence vowel duration. Journal of the Acoustical Society of America, 54(4):1102-1104.
Google Scholar
Levelt, W.J.M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press.
Google Scholar
Levelt, W.J.M. (1999). Producing spoken language: A blueprint of the speaker. In C.M. Brown and P. Hagoort (Eds.), The Neurocognition of Language. Oxford, UK: Oxford University Press, pp. 83-122.
Google Scholar
Levelt, W.J.M. and Wheeldon, L. (1994). Do speakers have access to a mental syllabary? Cognition, 50:239-269.
Google Scholar
Lewis, E. and Tatham, M. (1999).Word and syllable concatenation in text-to-speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 615-618.
Google Scholar
Lüdeling, A., Evert, S., and Heid, U. (2000). On measuring morphological productivity. Proceedings of KONVENS 2000. Ilmenau, Germany, pp. 57-61.
Macon, M.W., Cronk, A.E., and Wouters, J. (1998). Generalization and discrimination in tree-structured unit selection. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 195-200.
Maghbouleh, A. (1996). An empirical comparison of automatic decision tree and hand-configured linear models for vowel durations. Computational Phonology in Speech Technology: Proceedings of the Second Meeting of ACL SIGPHON. Santa Cruz, CA, pp. 1-7.
Möbius, B. (1998a). In R. Sproat (Ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Dordrecht: Kluwer, Chs. 3, 6, and 7.
Google Scholar
Möbius, B. (1998b). Word and syllable models for German text-tospeech synthesis. Proceedings of the Third InternationalWorkshop on Speech Synthesis. Jenolan Caves, Australia, pp. 59-64.
Möbius, B. (1999). The Bell Labs German text-to-speech system. Computer Speech and Language, 13:319-358.
Google Scholar
Möbius, B. (2001). German and multilingual speech synthesis. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 7(4):1-300.
Google Scholar
Möbius, B. and van Santen, J. (1996). Modeling segmental duration in German text-to-speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 4, pp. 2395-2398.
Google Scholar
Müller, K., Möbius, B., and Prescher, D. (2000). Inducing probabilistic syllable classes using multivariate clustering. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong, pp. 225-232.
Nakajima, S. (1994). Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering. Speech Communication, 14:313-324.
Google Scholar
Nakajima, S. and Hamada, H. (1988). Automatic generation of synthesis units based on context oriented clustering. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. New York, NY, pp. 659-662.
Pitrelli, J.F. and Zue,V.W. (1989). Ahierarchical model for phoneme duration in American English. Proceedings of the European Conference on Speech Communication and Technology. Paris, pp. 324-327.
Prescher, D. (2002). EM-basierte maschinelle Lernverfahren für natürliche Sprachen. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 8(2):1-366.
Google Scholar
Quackenbush, S.R., Barnwell, T.P., and Clements, M.A. (1988). Objective Measures of Speech Quality. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Riley, M.D. (1992). Tree-based modeling for speech synthesis. In G. Bailly, C. Benoît, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, pp. 265-273.
Google Scholar
Rooth, M., Riezler, S., Prescher, D., Carroll, G., and Beil, F. (1998). EM-based clustering for NLP applications. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 4(3):97-128.
Google Scholar
Samuelsson, C. (1996). Relating Turing's formula and Zipf's law. Proceedings of the 4th Workshop on Very Large Corpora. Copenhagen, Denmark.
Schmid, T., Lüdeling, A., Säuberlich, B., Heid, U., and Möbius, B. (2001). DeKo: Ein System zur Analyse komplexer Wörter. In H. Lobin (Ed.), Proceedings of GLDV-2001. Gießen, Germany, pp. 49-57.
Schultink, H. (1961). Produktiviteit als morfologisch fenomeen. Forum der Letteren, 2:110-125.
Google Scholar
Shih, C. and Ao, B. (1997). Duration study for the Bell Laboratories Mandarin text-to-speech system. In J. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 383-399.
Google Scholar
Stöber, K., Portele, T., Wagner, P., and Hess, W. (1999). Synthesis by word concatenation. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 619-622.
Google Scholar
Tanaka, K., Mizuno, H., Abe, M., and Nakajima, S. (1999). A Japanese text-to-speech system based on multi-form units with consideration of frequency distribution in Japanese. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 839-842.
Google Scholar
van Santen, J.P.H. (1993a). Exploring N-way tables with sumsof-products models. Journal of Mathematical Psychology, 37(3):327-371.
Google Scholar
van Santen, J.P.H. (1993b). Timing in text-to-speech systems. Proceedings of the European Conference on Speech Communication and Technology. Berlin, Germany, vol. 2, pp. 1397-1404.
Google Scholar
van Santen, J.P.H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8:95-128.
Google Scholar
van Santen, J.P.H. (1995). Computation of timing in text-to-speech synthesis. In W.B. Kleijn and K.K. Paliwal (Eds.), Speech Coding and Synthesis. Amsterdam: Elsevier, pp. 663-684.
Google Scholar
van Santen, J.P.H. (1997). Combinatorial issues in text-to-speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. Rhodes, Greece, vol. 5, pp. 2511-2514.
Google Scholar
van Santen, J.P.H. and Möbius, B. (2000). A quantitative model of F0 generation and alignment. In A. Botinis (Ed.), Intonation-Analysis, Modelling and Technology. Dordrecht: Kluwer, pp. 269-288.
Google Scholar
Venditti, J.J. and van Santen, J.P.H. (1998). Modeling segmental durations for Japanese text-to-speech synthesis. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 31-36.
Wahlster, W. (Ed.) (2000). Verbmobil: Foundations of Speech-to-Speech Translation. Berlin: Springer.
Google Scholar
Wouters, J. and Macon, M.W. (1998). A perceptual evaluation of distance measures for concatenative speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 6, pp. 2747-2750.
Google Scholar
Young, S. (1992). The general use of tying in phoneme-based HMM speech recognisers. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. San Francisco, CA, vol. 1, pp. 569-572.
Google Scholar
Zipf, G.K. (1935). The Psycho-Biology of Language. Boston, MA: Houghton Mifflin.
Google Scholar
Zipf, G.K. (1949). Human Behavior and the Principle of Least Effort-An Introduction to Human Ecology. New York: Hafner.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Natural Language Processing, University of Stuttgart, Azenbergstraße 12, D-70174, Stuttgart, Germany
Bernd Möbius

Authors

Bernd Möbius
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Möbius, B. Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis. International Journal of Speech Technology 6, 57–71 (2003). https://doi.org/10.1023/A:1021052023237

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1021052023237

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

Handling Two Difficult Challenges for Text-to-Speech Synthesis Systems: Out-of-Vocabulary Words and Prosody: A Case Study in Romanian

Speaker-Specific Pronunciation for Speech Synthesis

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

Handling Two Difficult Challenges for Text-to-Speech Synthesis Systems: Out-of-Vocabulary Words and Prosody: A Case Study in Romanian

Speaker-Specific Pronunciation for Speech Synthesis

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation