Abstract
One of the most serious challenges for speech synthesis is the systematic treatment of events in language and speech that are known to have low frequencies of occurrence. The problems that extremely unbalanced frequency distributions pose for rule-based or data-driven models are often underestimated or even unrecognized. This paper discusses the problems pertinent to rare events in four components of speech synthesis systems: in linguistic text analysis, where productive word formation processes generate a potentially unbounded lexicon and cause heavily skewed word frequency distributions; in syllabification, where some syllables occur very frequently but most phonotactically possible syllables are very infrequent; in speech timing, where most constellations of factors affecting segmental duration are sparsely or not at all represented in training databases; and in unit selection synthesis, where the uneven distribution of speech unit frequencies poses challenges to speech corpus design. Currently available techniques for coping with the problem of rare or unseen events in each of these components are reviewed. Finally, a distinction is made between a strictly closed domain with a fixed vocabulary and a merely restricted domain with loopholes for unseen words and names, and the consequences of the respective type of domain for appropriate synthesis strategies are discussed.
Similar content being viewed by others
References
Baayen, H. (1993). On frequency, transparency and productivity. In G. Booij and J. van Marle (Eds.), Yearbook of Morphology 1992. Dordrecht: Kluwer, pp. 181-208.
Baayen, H. (2001). Word Frequency Distributions. Dordrecht: Kluwer.
Baayen, H. and Lieber, R. (1991). Productivity and English derivation: A corpus based study. Linguistics, 29:801-843.
Baker, J.K. (1979). Trainable grammars for speech recognition. In D. Klatt and J. Wolf (Eds.), Speech Communication Papers for ASA'79, pp. 547-550.
Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3:1-8.
Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1):164-171.
Beutnagel, M. and Conkie, A. (1999). Interaction of units in a unit selection database. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 3, pp. 1063-1066.
Black, A.W. and Campbell, W.N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Proceedings of the European Conference on Speech Communication and Technology. Madrid, Spain, vol. 1, pp. 581-584.
Black, A.W. and Lenzo, K.A. (2000). Limited domain synthesis. Proceedings of the International Conference on Spoken Language Processing. Beijing, vol. 2, pp. 411-414.
Black, A.W. and Lenzo, K.A. (2001). Optimal data selection for unit selection synthesis. Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis. Pitlochry, UK, pp. 63-68.
Breen, A.P. and Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT's Laureate TTS system. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 373-376.
Campbell, W.N. (1992). Syllable-based segmental duration. In G. Bailly, C. Benoît, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, pp. 211-224.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1-38. 70 Möbius
Donovan, R.E. and Eide, E.M. (1998). The IBM trainable speech synthesis system. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 5, pp. 1703-1706.
Donovan, R.E. and Woodland, P.C. (1999). A hidden Markov model-based trainable speech synthesizer. Computer Speech and Language, 13:223-241.
Evert, S. and Lüdeling, A. (2001). Measuring morphological productivity: Is automatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery, A. Hardie, and S. Khoja (Eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster, UK, pp. 167-175.
Good, I.J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4):237-264.
Holzapfel, M. and Campbell, N. (1998). A nonlinear unit selection strategy for concatenative speech synthesis based on syllable level features. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 6, pp. 2755-2758.
Hon, H.W., Acero, A., Huang, X., Liu, J., and Plumpe, M. (1998). Automatic generation of synthesis units for trainable text-to-speech systems. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. Seattle, WA, vol. 1, pp. 293-296.
House, D. (1996). Differential perception of tonal contours through the syllable. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 1, pp. 2048-2051.
Huang, X., Acero, A., Adcock, J., Hon, H.W., Goldsmith, J., Liu, J., and Plumpe, M. (1996). Whistler: A trainable text-to-speech system. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 4, pp. 2387-2390.
Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. München, Germany, vol. 1, pp. 373-376.
Iwahashi, N. and Sagisaka, Y. (1995). Speech segment network approach for an optimal synthesis unit set. Computer Speech and Language, 9:335-352.
Jelinek, F. and Mercer, R.L. (1980). Interpolated estimation of Markov source parameters from sparse data. Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam, pp. 381-397.
Kaiki, N., Takeda, K., and Sagisaka, Y. (1990). Statistical analysis for segmental duration rules in Japanese speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Kobe, Japan, pp. 17-20.
Katz, S.M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 33(6):400-401.
Khmaladze, E. (1987). The statistical analysis of large number of rare events (Tech. Report MS-R8804). Department of Mathematical Statistics, CWI. Amsterdam: Center for Mathematics and Computer Science.
Kiraz, G.A. and Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. Proceedings of the Third InternationalWorkshop on Speech Synthesis. Jenolan Caves, Australia, pp. 71-76.
Klatt, D.H. (1973). Interaction between two factors that influence vowel duration. Journal of the Acoustical Society of America, 54(4):1102-1104.
Levelt, W.J.M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press.
Levelt, W.J.M. (1999). Producing spoken language: A blueprint of the speaker. In C.M. Brown and P. Hagoort (Eds.), The Neurocognition of Language. Oxford, UK: Oxford University Press, pp. 83-122.
Levelt, W.J.M. and Wheeldon, L. (1994). Do speakers have access to a mental syllabary? Cognition, 50:239-269.
Lewis, E. and Tatham, M. (1999).Word and syllable concatenation in text-to-speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 615-618.
Lüdeling, A., Evert, S., and Heid, U. (2000). On measuring morphological productivity. Proceedings of KONVENS 2000. Ilmenau, Germany, pp. 57-61.
Macon, M.W., Cronk, A.E., and Wouters, J. (1998). Generalization and discrimination in tree-structured unit selection. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 195-200.
Maghbouleh, A. (1996). An empirical comparison of automatic decision tree and hand-configured linear models for vowel durations. Computational Phonology in Speech Technology: Proceedings of the Second Meeting of ACL SIGPHON. Santa Cruz, CA, pp. 1-7.
Möbius, B. (1998a). In R. Sproat (Ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Dordrecht: Kluwer, Chs. 3, 6, and 7.
Möbius, B. (1998b). Word and syllable models for German text-tospeech synthesis. Proceedings of the Third InternationalWorkshop on Speech Synthesis. Jenolan Caves, Australia, pp. 59-64.
Möbius, B. (1999). The Bell Labs German text-to-speech system. Computer Speech and Language, 13:319-358.
Möbius, B. (2001). German and multilingual speech synthesis. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 7(4):1-300.
Möbius, B. and van Santen, J. (1996). Modeling segmental duration in German text-to-speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 4, pp. 2395-2398.
Müller, K., Möbius, B., and Prescher, D. (2000). Inducing probabilistic syllable classes using multivariate clustering. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong, pp. 225-232.
Nakajima, S. (1994). Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering. Speech Communication, 14:313-324.
Nakajima, S. and Hamada, H. (1988). Automatic generation of synthesis units based on context oriented clustering. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. New York, NY, pp. 659-662.
Pitrelli, J.F. and Zue,V.W. (1989). Ahierarchical model for phoneme duration in American English. Proceedings of the European Conference on Speech Communication and Technology. Paris, pp. 324-327.
Prescher, D. (2002). EM-basierte maschinelle Lernverfahren für natürliche Sprachen. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 8(2):1-366.
Quackenbush, S.R., Barnwell, T.P., and Clements, M.A. (1988). Objective Measures of Speech Quality. Englewood Cliffs, NJ: Prentice Hall.
Riley, M.D. (1992). Tree-based modeling for speech synthesis. In G. Bailly, C. Benoît, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, pp. 265-273.
Rooth, M., Riezler, S., Prescher, D., Carroll, G., and Beil, F. (1998). EM-based clustering for NLP applications. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 4(3):97-128.
Samuelsson, C. (1996). Relating Turing's formula and Zipf's law. Proceedings of the 4th Workshop on Very Large Corpora. Copenhagen, Denmark.
Schmid, T., Lüdeling, A., Säuberlich, B., Heid, U., and Möbius, B. (2001). DeKo: Ein System zur Analyse komplexer Wörter. In H. Lobin (Ed.), Proceedings of GLDV-2001. Gießen, Germany, pp. 49-57.
Schultink, H. (1961). Produktiviteit als morfologisch fenomeen. Forum der Letteren, 2:110-125.
Shih, C. and Ao, B. (1997). Duration study for the Bell Laboratories Mandarin text-to-speech system. In J. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 383-399.
Stöber, K., Portele, T., Wagner, P., and Hess, W. (1999). Synthesis by word concatenation. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 619-622.
Tanaka, K., Mizuno, H., Abe, M., and Nakajima, S. (1999). A Japanese text-to-speech system based on multi-form units with consideration of frequency distribution in Japanese. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 839-842.
van Santen, J.P.H. (1993a). Exploring N-way tables with sumsof-products models. Journal of Mathematical Psychology, 37(3):327-371.
van Santen, J.P.H. (1993b). Timing in text-to-speech systems. Proceedings of the European Conference on Speech Communication and Technology. Berlin, Germany, vol. 2, pp. 1397-1404.
van Santen, J.P.H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8:95-128.
van Santen, J.P.H. (1995). Computation of timing in text-to-speech synthesis. In W.B. Kleijn and K.K. Paliwal (Eds.), Speech Coding and Synthesis. Amsterdam: Elsevier, pp. 663-684.
van Santen, J.P.H. (1997). Combinatorial issues in text-to-speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. Rhodes, Greece, vol. 5, pp. 2511-2514.
van Santen, J.P.H. and Möbius, B. (2000). A quantitative model of F0 generation and alignment. In A. Botinis (Ed.), Intonation-Analysis, Modelling and Technology. Dordrecht: Kluwer, pp. 269-288.
Venditti, J.J. and van Santen, J.P.H. (1998). Modeling segmental durations for Japanese text-to-speech synthesis. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 31-36.
Wahlster, W. (Ed.) (2000). Verbmobil: Foundations of Speech-to-Speech Translation. Berlin: Springer.
Wouters, J. and Macon, M.W. (1998). A perceptual evaluation of distance measures for concatenative speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 6, pp. 2747-2750.
Young, S. (1992). The general use of tying in phoneme-based HMM speech recognisers. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. San Francisco, CA, vol. 1, pp. 569-572.
Zipf, G.K. (1935). The Psycho-Biology of Language. Boston, MA: Houghton Mifflin.
Zipf, G.K. (1949). Human Behavior and the Principle of Least Effort-An Introduction to Human Ecology. New York: Hafner.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Möbius, B. Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis. International Journal of Speech Technology 6, 57–71 (2003). https://doi.org/10.1023/A:1021052023237
Issue Date:
DOI: https://doi.org/10.1023/A:1021052023237