Abstract
In concatenative Text-to-Speech, the size of the speech corpus is closely related to synthetic speech quality. In this paper, we describe our work on a new corpus-based Bell Labs' TTS system. This encompasses large acoustic inventories with a rich set of annotations, models and data structures for representing and managing such inventories, and an optimal unit selection algorithm that accommodates a broad range of possible cost criteria. We also propose a new method for setting weights in the cost functions based on a perceptual preference test. Our results show that this approach can successfully predict human preference patterns. Synthetic speech using weights determined in this manner consistently demonstrates smoother transitions and higher voice quality than speech using manually set weights.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Breen, A.P. and Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT's laureate tts system. Proceedings of the Third ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia: ESCA/IEEE.
Donovan, R.E. (1996). Trainable speech synthesis. Ph.D. Thesis, University of Cambridge, Cambridge, UK.
Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht; Boston; London: Kluwer Academic.
Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing-96, Munich, IEEE, vol. 1, pp. 373-76.
Lee, M., van Santen, J.P.H., Möbius, B., and Olive, P.O. (1999). Formant tracking using segmental phonemic information. Proceedings of the European Conference on Speech Communication and Technology (Eurospeech). Budapest, Hungary: ESCA.
Nakajima, S. and Hamada, H. (1988). Automatic generation of synthesis units based on context oriented clustering. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing-88, New York, NY: IEEE.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1992). Numerical recipes in C-The art of scientific computing. Cambridge University Press.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lee, M., Lopresti, D.P. & Olive, J.P. A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perception Based Cost Functions. International Journal of Speech Technology 6, 347–356 (2003). https://doi.org/10.1023/A:1025752731945
Issue Date:
DOI: https://doi.org/10.1023/A:1025752731945