Skip to main content
Log in

Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

One of the most serious challenges for speech synthesis is the systematic treatment of events in language and speech that are known to have low frequencies of occurrence. The problems that extremely unbalanced frequency distributions pose for rule-based or data-driven models are often underestimated or even unrecognized. This paper discusses the problems pertinent to rare events in four components of speech synthesis systems: in linguistic text analysis, where productive word formation processes generate a potentially unbounded lexicon and cause heavily skewed word frequency distributions; in syllabification, where some syllables occur very frequently but most phonotactically possible syllables are very infrequent; in speech timing, where most constellations of factors affecting segmental duration are sparsely or not at all represented in training databases; and in unit selection synthesis, where the uneven distribution of speech unit frequencies poses challenges to speech corpus design. Currently available techniques for coping with the problem of rare or unseen events in each of these components are reviewed. Finally, a distinction is made between a strictly closed domain with a fixed vocabulary and a merely restricted domain with loopholes for unseen words and names, and the consequences of the respective type of domain for appropriate synthesis strategies are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Baayen, H. (1993). On frequency, transparency and productivity. In G. Booij and J. van Marle (Eds.), Yearbook of Morphology 1992. Dordrecht: Kluwer, pp. 181-208.

    Google Scholar 

  • Baayen, H. (2001). Word Frequency Distributions. Dordrecht: Kluwer.

    Google Scholar 

  • Baayen, H. and Lieber, R. (1991). Productivity and English derivation: A corpus based study. Linguistics, 29:801-843.

    Google Scholar 

  • Baker, J.K. (1979). Trainable grammars for speech recognition. In D. Klatt and J. Wolf (Eds.), Speech Communication Papers for ASA'79, pp. 547-550.

  • Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3:1-8.

    Google Scholar 

  • Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1):164-171.

    Google Scholar 

  • Beutnagel, M. and Conkie, A. (1999). Interaction of units in a unit selection database. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 3, pp. 1063-1066.

    Google Scholar 

  • Black, A.W. and Campbell, W.N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Proceedings of the European Conference on Speech Communication and Technology. Madrid, Spain, vol. 1, pp. 581-584.

    Google Scholar 

  • Black, A.W. and Lenzo, K.A. (2000). Limited domain synthesis. Proceedings of the International Conference on Spoken Language Processing. Beijing, vol. 2, pp. 411-414.

    Google Scholar 

  • Black, A.W. and Lenzo, K.A. (2001). Optimal data selection for unit selection synthesis. Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis. Pitlochry, UK, pp. 63-68.

  • Breen, A.P. and Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT's Laureate TTS system. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 373-376.

  • Campbell, W.N. (1992). Syllable-based segmental duration. In G. Bailly, C. Benoît, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, pp. 211-224.

    Google Scholar 

  • Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1-38. 70 Möbius

    Google Scholar 

  • Donovan, R.E. and Eide, E.M. (1998). The IBM trainable speech synthesis system. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 5, pp. 1703-1706.

    Google Scholar 

  • Donovan, R.E. and Woodland, P.C. (1999). A hidden Markov model-based trainable speech synthesizer. Computer Speech and Language, 13:223-241.

    Google Scholar 

  • Evert, S. and Lüdeling, A. (2001). Measuring morphological productivity: Is automatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery, A. Hardie, and S. Khoja (Eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster, UK, pp. 167-175.

  • Good, I.J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4):237-264.

    Google Scholar 

  • Holzapfel, M. and Campbell, N. (1998). A nonlinear unit selection strategy for concatenative speech synthesis based on syllable level features. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 6, pp. 2755-2758.

    Google Scholar 

  • Hon, H.W., Acero, A., Huang, X., Liu, J., and Plumpe, M. (1998). Automatic generation of synthesis units for trainable text-to-speech systems. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. Seattle, WA, vol. 1, pp. 293-296.

    Google Scholar 

  • House, D. (1996). Differential perception of tonal contours through the syllable. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 1, pp. 2048-2051.

    Google Scholar 

  • Huang, X., Acero, A., Adcock, J., Hon, H.W., Goldsmith, J., Liu, J., and Plumpe, M. (1996). Whistler: A trainable text-to-speech system. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 4, pp. 2387-2390.

    Google Scholar 

  • Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. München, Germany, vol. 1, pp. 373-376.

    Google Scholar 

  • Iwahashi, N. and Sagisaka, Y. (1995). Speech segment network approach for an optimal synthesis unit set. Computer Speech and Language, 9:335-352.

    Google Scholar 

  • Jelinek, F. and Mercer, R.L. (1980). Interpolated estimation of Markov source parameters from sparse data. Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam, pp. 381-397.

  • Kaiki, N., Takeda, K., and Sagisaka, Y. (1990). Statistical analysis for segmental duration rules in Japanese speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Kobe, Japan, pp. 17-20.

  • Katz, S.M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 33(6):400-401.

    Google Scholar 

  • Khmaladze, E. (1987). The statistical analysis of large number of rare events (Tech. Report MS-R8804). Department of Mathematical Statistics, CWI. Amsterdam: Center for Mathematics and Computer Science.

    Google Scholar 

  • Kiraz, G.A. and Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. Proceedings of the Third InternationalWorkshop on Speech Synthesis. Jenolan Caves, Australia, pp. 71-76.

  • Klatt, D.H. (1973). Interaction between two factors that influence vowel duration. Journal of the Acoustical Society of America, 54(4):1102-1104.

    Google Scholar 

  • Levelt, W.J.M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press.

    Google Scholar 

  • Levelt, W.J.M. (1999). Producing spoken language: A blueprint of the speaker. In C.M. Brown and P. Hagoort (Eds.), The Neurocognition of Language. Oxford, UK: Oxford University Press, pp. 83-122.

    Google Scholar 

  • Levelt, W.J.M. and Wheeldon, L. (1994). Do speakers have access to a mental syllabary? Cognition, 50:239-269.

    Google Scholar 

  • Lewis, E. and Tatham, M. (1999).Word and syllable concatenation in text-to-speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 615-618.

    Google Scholar 

  • Lüdeling, A., Evert, S., and Heid, U. (2000). On measuring morphological productivity. Proceedings of KONVENS 2000. Ilmenau, Germany, pp. 57-61.

  • Macon, M.W., Cronk, A.E., and Wouters, J. (1998). Generalization and discrimination in tree-structured unit selection. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 195-200.

  • Maghbouleh, A. (1996). An empirical comparison of automatic decision tree and hand-configured linear models for vowel durations. Computational Phonology in Speech Technology: Proceedings of the Second Meeting of ACL SIGPHON. Santa Cruz, CA, pp. 1-7.

  • Möbius, B. (1998a). In R. Sproat (Ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Dordrecht: Kluwer, Chs. 3, 6, and 7.

    Google Scholar 

  • Möbius, B. (1998b). Word and syllable models for German text-tospeech synthesis. Proceedings of the Third InternationalWorkshop on Speech Synthesis. Jenolan Caves, Australia, pp. 59-64.

  • Möbius, B. (1999). The Bell Labs German text-to-speech system. Computer Speech and Language, 13:319-358.

    Google Scholar 

  • Möbius, B. (2001). German and multilingual speech synthesis. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 7(4):1-300.

    Google Scholar 

  • Möbius, B. and van Santen, J. (1996). Modeling segmental duration in German text-to-speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA, vol. 4, pp. 2395-2398.

    Google Scholar 

  • Müller, K., Möbius, B., and Prescher, D. (2000). Inducing probabilistic syllable classes using multivariate clustering. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong, pp. 225-232.

  • Nakajima, S. (1994). Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering. Speech Communication, 14:313-324.

    Google Scholar 

  • Nakajima, S. and Hamada, H. (1988). Automatic generation of synthesis units based on context oriented clustering. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. New York, NY, pp. 659-662.

  • Pitrelli, J.F. and Zue,V.W. (1989). Ahierarchical model for phoneme duration in American English. Proceedings of the European Conference on Speech Communication and Technology. Paris, pp. 324-327.

  • Prescher, D. (2002). EM-basierte maschinelle Lernverfahren für natürliche Sprachen. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 8(2):1-366.

    Google Scholar 

  • Quackenbush, S.R., Barnwell, T.P., and Clements, M.A. (1988). Objective Measures of Speech Quality. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  • Riley, M.D. (1992). Tree-based modeling for speech synthesis. In G. Bailly, C. Benoît, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, pp. 265-273.

    Google Scholar 

  • Rooth, M., Riezler, S., Prescher, D., Carroll, G., and Beil, F. (1998). EM-based clustering for NLP applications. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS, 4(3):97-128.

    Google Scholar 

  • Samuelsson, C. (1996). Relating Turing's formula and Zipf's law. Proceedings of the 4th Workshop on Very Large Corpora. Copenhagen, Denmark.

  • Schmid, T., Lüdeling, A., Säuberlich, B., Heid, U., and Möbius, B. (2001). DeKo: Ein System zur Analyse komplexer Wörter. In H. Lobin (Ed.), Proceedings of GLDV-2001. Gießen, Germany, pp. 49-57.

  • Schultink, H. (1961). Produktiviteit als morfologisch fenomeen. Forum der Letteren, 2:110-125.

    Google Scholar 

  • Shih, C. and Ao, B. (1997). Duration study for the Bell Laboratories Mandarin text-to-speech system. In J. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 383-399.

    Google Scholar 

  • Stöber, K., Portele, T., Wagner, P., and Hess, W. (1999). Synthesis by word concatenation. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 619-622.

    Google Scholar 

  • Tanaka, K., Mizuno, H., Abe, M., and Nakajima, S. (1999). A Japanese text-to-speech system based on multi-form units with consideration of frequency distribution in Japanese. Proceedings of the European Conference on Speech Communication and Technology. Budapest, Hungary, vol. 2, pp. 839-842.

    Google Scholar 

  • van Santen, J.P.H. (1993a). Exploring N-way tables with sumsof-products models. Journal of Mathematical Psychology, 37(3):327-371.

    Google Scholar 

  • van Santen, J.P.H. (1993b). Timing in text-to-speech systems. Proceedings of the European Conference on Speech Communication and Technology. Berlin, Germany, vol. 2, pp. 1397-1404.

    Google Scholar 

  • van Santen, J.P.H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8:95-128.

    Google Scholar 

  • van Santen, J.P.H. (1995). Computation of timing in text-to-speech synthesis. In W.B. Kleijn and K.K. Paliwal (Eds.), Speech Coding and Synthesis. Amsterdam: Elsevier, pp. 663-684.

    Google Scholar 

  • van Santen, J.P.H. (1997). Combinatorial issues in text-to-speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. Rhodes, Greece, vol. 5, pp. 2511-2514.

    Google Scholar 

  • van Santen, J.P.H. and Möbius, B. (2000). A quantitative model of F0 generation and alignment. In A. Botinis (Ed.), Intonation-Analysis, Modelling and Technology. Dordrecht: Kluwer, pp. 269-288.

    Google Scholar 

  • Venditti, J.J. and van Santen, J.P.H. (1998). Modeling segmental durations for Japanese text-to-speech synthesis. Proceedings of the Third International Workshop on Speech Synthesis. Jenolan Caves, Australia, pp. 31-36.

  • Wahlster, W. (Ed.) (2000). Verbmobil: Foundations of Speech-to-Speech Translation. Berlin: Springer.

    Google Scholar 

  • Wouters, J. and Macon, M.W. (1998). A perceptual evaluation of distance measures for concatenative speech synthesis. Proceedings of the International Conference on Spoken Language Processing. Sydney, Australia, vol. 6, pp. 2747-2750.

    Google Scholar 

  • Young, S. (1992). The general use of tying in phoneme-based HMM speech recognisers. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing. San Francisco, CA, vol. 1, pp. 569-572.

    Google Scholar 

  • Zipf, G.K. (1935). The Psycho-Biology of Language. Boston, MA: Houghton Mifflin.

    Google Scholar 

  • Zipf, G.K. (1949). Human Behavior and the Principle of Least Effort-An Introduction to Human Ecology. New York: Hafner.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Möbius, B. Rare Events and Closed Domains: Two Delicate Concepts in Speech Synthesis. International Journal of Speech Technology 6, 57–71 (2003). https://doi.org/10.1023/A:1021052023237

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021052023237

Navigation