Abstract
This article presents research and development aimed at creating a Polish speech database for speech synthesis and adapting BOSS (The Bonn Open Synthesis System) to the Polish language. First of all, the linguistic background for the design of Polish spoken resources for unit selection is presented, together with the presentation of the applied transcription and annotation methods. The next section details the assumptions and the structure of the Polish corpus and its segmental and prosodic annotation. Then, the linguistic features used in duration modelling and the selection of adequate speech units of two Polish modules in BOSS are reported: the duration prediction module (the description is accompanied by a concise overview of Polish duration modelling for speech technology purposes) and the cost functions module. Finally, the results of two kinds of perception tests are discussed: the first is a preference test aimed at the evaluation of synthesized speech obtained using three variants of speech signal segmentation (automatic, semi-automatic and manual) and the second is a mean opinion score test carried out to provide a preliminary assessment of the synthesized speech quality attained with the Polish version of the BOSS synthesizer. The closing chapter summarizes future perspectives and challenges for the Polish TTS (text-to-speech) and further developments of BOSS for Polish.
Similar content being viewed by others
References
Baranowska, E., Francuzik, K., Karpiński, M., & Kleśta, J. (2003). Identification of nuclear melody. Placement in Polish read texts. In A. Mettouchi & G. Ferre (Eds.), Interfaces prosodiques, Nantes, France.
Batusek, R. A. (2002). Duration model for Czech text-to-speech synthesis. In Proc. of speech prosody, Aix-en-Provence, France.
Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., van den Heuvel, H., Hain, H.-U., Wang, X. S., & Garcia, M. N. (2006). TC-STAR: Specifications of language resources and evaluation for speech synthesis. In Proceedings of LREC (international conference on language resources and evaluation), Genoa, Italy.
Bonafonte, A., Lourdes, A., Esquerra1, I., Oller, S., & Moreno, A. (2009). Recent work on the FESTCAT database for speech synthesis. In Proceedings of the I Iberian SLTech 2009, Porto Salvo, Portugal.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey: Wadsworth & Brooks/Cole Advanced Books & Software.
Breuer, S., & Abresch, J. (2003). Unit selection speech synthesis for a directory enquiries service. In Proceedings of the ICPhS, Barcelona, Spain.
Campbell, N. (1992). Multi-level timing in speech University of Sussex. PhD Thesis. (Exp. Psychol): Brighton, UK.
Chung, H., & Huckvale, M. A. (2001). Linguistic factors affecting timing in Korean with application to speech synthesis. In Proceedings of Eurospeech, Scandinavia.
Cruttenden, A. (1994). Intonation. Cambridge: Cambridge University Press.
Demenko, G. (1999). Analiza cech suprasegmentalnych języka polskiego na potrzeby syntezy mowy. Poznań: Wydawnictwo Naukowe UAM.
Demenko, G. (2005). Speech synthesis of Polish based on the concatenation phonetic-acoustic segments. In 2nd language & technology conference: Human language technologies as a challenge for computer science and linguistics, April 21–23, 2005, Poznań, Poland.
Demenko, G., Wypych, M., & Baranowska, E. (2003). Speech and language technology : Vol. 7. Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Poznań: Edition PTFON.
Demenko, G., Bachan, J., Möbius, B., Klessa, K., Szymański, M., & Grocholewski, G. (2008). Development and evaluation of Polish speech corpus for unit selection speech synthesis systems. In Proceedings of Interspeech 2008, Brisbane, Australia.
Fék, M., Pesti, P., Németh, G., Zainkó, C., & Olaszy, G. (2006). Corpus-based unit selection TTS for Hungarian. TSD 2006 367-373 (retrieved from http://speechlab.tmit.bme.hu/zainko/ on 1 March 2010).
Fujisaki, H., Hirose, K., & Takahashi, N. (1990). Manifestation of linguistic and paralinguistic information in the voice fundamental frequency contours of spoken Japanese. In Proceedings of ICSLP, Kobe, Japan.
Gardner-Bonneau, D. (Ed.) (2003). Special Issue on Speech Synthesis. International Journal of Speech Technology. Kluwer Academic Publishers.
Gibbon, D., Moore, R., & Winski, R. (1997). Handbook of standards and resources for spoken language systems. Berlin: Mouton de Gruyter.
Grocholewski, S. (1997). Corpora—speech database for Polish diphones. In Proceedings of Eurospeech’97 (pp. 1735–1738).
Hirst, D., & Di Cristo, A. (Eds.) (1998). Intonation systems. A survey of twenty languages. Cambridge: Cambridge University Press.
Jassem, W. (1962). Akcent języka polskiego. Wrocław: Ossolineum.
Jassem, W. (2003). Illustrations of the IPA: Polish. Journal of the Phonetic Association, 23(1), 103–107.
Jassem, W., Morton, J., & Steffen-Batóg, M. (1968). The perception of stress in synthetic speech-like stimuli by Polish listeners. In W. Jassem (Ed.), Speech analysis and synthesis 1 (pp. 289–308). Warszawa: Państwowe Wydawnictwo Naukowe.
Jassem, W., Krzyśko, M., & Stolarski, P. (1981). IPPT PAN: Vol. 33. Regresyjny model izochronizmu zestrojowego w sygnale mowy, Warszawa.
Keating, P. (1979). A phonetic study of a voicing contrast in Polish. Unpublished doctoral dissertation, Brown University.
Klatt, D. H. (1979). Synthesis by rule of segmental durations in English sentences. In K. Lindblom & K. Ohman (Eds.), Frontiers of speech communication research. London: Academic Press.
Klessa, K. (2006). Analiza iloczasu głoskowego na potrzeby syntezy mowy polskiej. Unpublished doctoral dissertation, Adam Mickiewicz University, Poznań, Poland.
Klessa, K., Szymański, M., Breuer, S., & Demenko, G. (2007). Optimization of Polish segmental duration prediction with CART. In SSW6, Bonn.
Matoušek, J., Tihelka, D., & Romportl, J. (2008). Building of a speech corpus optimised for unit selection TTS synthesis. In Proceedings of LREC (international conference on language resources and evaluation), Marrakech, Morocco.
Mixdorff, H. (1998). Intonation patterns of German—Model-based quantitative analysis and synthesis of F0-contours. PhD thesis submitted to TU Dresden.
Möbius, B. (2000). Corpus-based speech synthesis: Methods and challenges. In W. Sendlmeier (Ed.), Forum Phoneticum : Vol. 69. Speech and signals: Aspects of speech synthesis and automatic speech recognition (pp. 79–96). Frankfurt a. M.: Hector.
Möbius, B. (2001). Rare events and closed domains: Two delicate concepts in speech synthesis. In Fourth ISCA ITRW on speech synthesis, Perthshire, Scotland.
Möbius, B., & van Santen, J. P. H. (1996). Modeling segmental duration in German text-to-speech synthesis. In Proceedings of the international conference on spoken language processing (Vol. 4, pp. 2395–2398) Philadelphia, PA.
Morton, J., & Jassem, W. (1965). Acoustic correlates of stress. Language and Speech, 8, 150–181.
Ostendorf, M., Digalakis, Vassilios V., & Kimball, Owen A. (1996). From HMM’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–378.
Richter, L. (1974). Porównanie iloczasu samogłosek polskich wymówionych w logatomach oraz w wyrazach. In Biuletyn Polskiego towarzystwa fonetycznego (Vol. 32, pp. 173–178).
Richter, L. (1978). Wpływ pozycji w zestroju akcentowym na czas trwania głosek. In Lingua Posnaniensia, Vol. 21, Poznań, Poland.
Riedi, M. P. (1998). Controlling segmental duration in speech synthesis systems. PhD thesis, TIK-Schriftenreihe (26), ETH Zürich.
Sagisaka, Y., Campbell, N., & Higuchi, N. (1997). Computing prosody, computational models for processing spontaneous speech. New York: Springer.
Śledziński, D. (2007). Fonetyczno-akustyczna analiza struktury sylaby w języku polskim na potrzeby technologii mowy. Unpublished PhD Thesis, Adam Mickiewicz University, Poznań, Poland.
Steffen-Batóg, M., & Nowakowski, P. (1993). An algorithm for phonetic transcription of orthographic texts in Polish. In M. Steffen-Batóg & W. Awedyk (Eds.), Studia phonetica posnaniensia, Vol. 3. Poznań: Wydawnictwo Naukowe UAM.
Steffen-Batogowa, M. (1975). Automatyzacja transkrypcji fonematycznej tekstów polskich. Warszawa: PWN.
Szymański, M., & Grocholewski, S. (2005). Transcription-based automatic segmentation of speech. In Proceedings of 2nd language & technology conference (pp. 11–15). Poznań.
Szymański, M., & Grocholewski, S. (2006). Post-processing of automatic segmentation of speech using dynamic programming. In LNAI. Proc. 9th international conference on text, speech and dialogue, Brno. Berlin: Springer.
Szymański, M., & Grocholewski, S. (2008). Error prediction-based semi-automatic segmentation of speech databases. In LNAI. Proc. 11th international conference on text, speech and dialog, Brno, Czech Republic. Berlin: Springer.
Tokuda, K., & Black, A. (2005). The Blizzard Challenge 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. Interspeech (Eurospeech) (pp. 77–80).
Toledano, D., Hernández Gómez, L. A., & Villarrubia Grande, L. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625.
Van Santen, J. P. H. (1993a). Exploring N-way tables with sums-of-product models. Journal of Mathematical Psychology, 37(3), 327–371.
Van Santen, J. P. H. (1993b). Quantitative modeling of segmental duration. In Proceedings of human language technology conference (pp. 323–328), Princeton, New Jersey.
Van Santen, J., & Buchsbaum, A. L. (1997). Methods for optimal text selection. In Proceedings Eurospeech 1997, Rhodos, Greece.
Van Son, R. J. J. H., & Van Santen, J. P. H. (1997). Strong interaction between factors influencing consonant duration. In Proceedings of Eurospeech ’97, Rhodos.
Wagner, A. (2008). Kompleksowy model intonacji do zastosowania w syntezie mowy. Unpublished doctoral dissertation, Adam Mickiewicz University, Poznań, Poland.
Wells, J. (1996). The SAMPA homepage. http://www.phon.ucl.ac.uk/home/sampa/home.htm.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Demenko, G., Klessa, K., Szymański, M. et al. Polish unit selection speech synthesis with BOSS: extensions and speech corpora. Int J Speech Technol 13, 85–99 (2010). https://doi.org/10.1007/s10772-010-9071-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-010-9071-3