Skip to main content
Log in

Polish unit selection speech synthesis with BOSS: extensions and speech corpora

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This article presents research and development aimed at creating a Polish speech database for speech synthesis and adapting BOSS (The Bonn Open Synthesis System) to the Polish language. First of all, the linguistic background for the design of Polish spoken resources for unit selection is presented, together with the presentation of the applied transcription and annotation methods. The next section details the assumptions and the structure of the Polish corpus and its segmental and prosodic annotation. Then, the linguistic features used in duration modelling and the selection of adequate speech units of two Polish modules in BOSS are reported: the duration prediction module (the description is accompanied by a concise overview of Polish duration modelling for speech technology purposes) and the cost functions module. Finally, the results of two kinds of perception tests are discussed: the first is a preference test aimed at the evaluation of synthesized speech obtained using three variants of speech signal segmentation (automatic, semi-automatic and manual) and the second is a mean opinion score test carried out to provide a preliminary assessment of the synthesized speech quality attained with the Polish version of the BOSS synthesizer. The closing chapter summarizes future perspectives and challenges for the Polish TTS (text-to-speech) and further developments of BOSS for Polish.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baranowska, E., Francuzik, K., Karpiński, M., & Kleśta, J. (2003). Identification of nuclear melody. Placement in Polish read texts. In A. Mettouchi & G. Ferre (Eds.), Interfaces prosodiques, Nantes, France.

  • Batusek, R. A. (2002). Duration model for Czech text-to-speech synthesis. In Proc. of speech prosody, Aix-en-Provence, France.

  • Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., van den Heuvel, H., Hain, H.-U., Wang, X. S., & Garcia, M. N. (2006). TC-STAR: Specifications of language resources and evaluation for speech synthesis. In Proceedings of LREC (international conference on language resources and evaluation), Genoa, Italy.

  • Bonafonte, A., Lourdes, A., Esquerra1, I., Oller, S., & Moreno, A. (2009). Recent work on the FESTCAT database for speech synthesis. In Proceedings of the I Iberian SLTech 2009, Porto Salvo, Portugal.

  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey: Wadsworth & Brooks/Cole Advanced Books & Software.

    MATH  Google Scholar 

  • Breuer, S., & Abresch, J. (2003). Unit selection speech synthesis for a directory enquiries service. In Proceedings of the ICPhS, Barcelona, Spain.

  • Campbell, N. (1992). Multi-level timing in speech University of Sussex. PhD Thesis. (Exp. Psychol): Brighton, UK.

  • Chung, H., & Huckvale, M. A. (2001). Linguistic factors affecting timing in Korean with application to speech synthesis. In Proceedings of Eurospeech, Scandinavia.

  • Cruttenden, A. (1994). Intonation. Cambridge: Cambridge University Press.

    Google Scholar 

  • Demenko, G. (1999). Analiza cech suprasegmentalnych języka polskiego na potrzeby syntezy mowy. Poznań: Wydawnictwo Naukowe UAM.

    Google Scholar 

  • Demenko, G. (2005). Speech synthesis of Polish based on the concatenation phonetic-acoustic segments. In 2nd language & technology conference: Human language technologies as a challenge for computer science and linguistics, April 21–23, 2005, Poznań, Poland.

  • Demenko, G., Wypych, M., & Baranowska, E. (2003). Speech and language technology : Vol. 7. Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Poznań: Edition PTFON.

    Google Scholar 

  • Demenko, G., Bachan, J., Möbius, B., Klessa, K., Szymański, M., & Grocholewski, G. (2008). Development and evaluation of Polish speech corpus for unit selection speech synthesis systems. In Proceedings of Interspeech 2008, Brisbane, Australia.

  • Fék, M., Pesti, P., Németh, G., Zainkó, C., & Olaszy, G. (2006). Corpus-based unit selection TTS for Hungarian. TSD 2006 367-373 (retrieved from http://speechlab.tmit.bme.hu/zainko/ on 1 March 2010).

  • Fujisaki, H., Hirose, K., & Takahashi, N. (1990). Manifestation of linguistic and paralinguistic information in the voice fundamental frequency contours of spoken Japanese. In Proceedings of ICSLP, Kobe, Japan.

  • Gardner-Bonneau, D. (Ed.) (2003). Special Issue on Speech Synthesis. International Journal of Speech Technology. Kluwer Academic Publishers.

  • Gibbon, D., Moore, R., & Winski, R. (1997). Handbook of standards and resources for spoken language systems. Berlin: Mouton de Gruyter.

    Google Scholar 

  • Grocholewski, S. (1997). Corpora—speech database for Polish diphones. In Proceedings of Eurospeech’97 (pp. 1735–1738).

  • Hirst, D., & Di Cristo, A. (Eds.) (1998). Intonation systems. A survey of twenty languages. Cambridge: Cambridge University Press.

    Google Scholar 

  • Jassem, W. (1962). Akcent języka polskiego. Wrocław: Ossolineum.

    Google Scholar 

  • Jassem, W. (2003). Illustrations of the IPA: Polish. Journal of the Phonetic Association, 23(1), 103–107.

    Article  Google Scholar 

  • Jassem, W., Morton, J., & Steffen-Batóg, M. (1968). The perception of stress in synthetic speech-like stimuli by Polish listeners. In W. Jassem (Ed.), Speech analysis and synthesis 1 (pp. 289–308). Warszawa: Państwowe Wydawnictwo Naukowe.

    Google Scholar 

  • Jassem, W., Krzyśko, M., & Stolarski, P. (1981). IPPT PAN: Vol. 33. Regresyjny model izochronizmu zestrojowego w sygnale mowy, Warszawa.

  • Keating, P. (1979). A phonetic study of a voicing contrast in Polish. Unpublished doctoral dissertation, Brown University.

  • Klatt, D. H. (1979). Synthesis by rule of segmental durations in English sentences. In K. Lindblom & K. Ohman (Eds.), Frontiers of speech communication research. London: Academic Press.

    Google Scholar 

  • Klessa, K. (2006). Analiza iloczasu głoskowego na potrzeby syntezy mowy polskiej. Unpublished doctoral dissertation, Adam Mickiewicz University, Poznań, Poland.

  • Klessa, K., Szymański, M., Breuer, S., & Demenko, G. (2007). Optimization of Polish segmental duration prediction with CART. In SSW6, Bonn.

  • Matoušek, J., Tihelka, D., & Romportl, J. (2008). Building of a speech corpus optimised for unit selection TTS synthesis. In Proceedings of LREC (international conference on language resources and evaluation), Marrakech, Morocco.

  • Mixdorff, H. (1998). Intonation patterns of German—Model-based quantitative analysis and synthesis of F0-contours. PhD thesis submitted to TU Dresden.

  • Möbius, B. (2000). Corpus-based speech synthesis: Methods and challenges. In W. Sendlmeier (Ed.), Forum Phoneticum : Vol. 69. Speech and signals: Aspects of speech synthesis and automatic speech recognition (pp. 79–96). Frankfurt a. M.: Hector.

    Google Scholar 

  • Möbius, B. (2001). Rare events and closed domains: Two delicate concepts in speech synthesis. In Fourth ISCA ITRW on speech synthesis, Perthshire, Scotland.

  • Möbius, B., & van Santen, J. P. H. (1996). Modeling segmental duration in German text-to-speech synthesis. In Proceedings of the international conference on spoken language processing (Vol. 4, pp. 2395–2398) Philadelphia, PA.

  • Morton, J., & Jassem, W. (1965). Acoustic correlates of stress. Language and Speech, 8, 150–181.

    Google Scholar 

  • Ostendorf, M., Digalakis, Vassilios V., & Kimball, Owen A. (1996). From HMM’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–378.

    Article  Google Scholar 

  • Richter, L. (1974). Porównanie iloczasu samogłosek polskich wymówionych w logatomach oraz w wyrazach. In Biuletyn Polskiego towarzystwa fonetycznego (Vol. 32, pp. 173–178).

  • Richter, L. (1978). Wpływ pozycji w zestroju akcentowym na czas trwania głosek. In Lingua Posnaniensia, Vol. 21, Poznań, Poland.

  • Riedi, M. P. (1998). Controlling segmental duration in speech synthesis systems. PhD thesis, TIK-Schriftenreihe (26), ETH Zürich.

  • Sagisaka, Y., Campbell, N., & Higuchi, N. (1997). Computing prosody, computational models for processing spontaneous speech. New York: Springer.

    Google Scholar 

  • Śledziński, D. (2007). Fonetyczno-akustyczna analiza struktury sylaby w języku polskim na potrzeby technologii mowy. Unpublished PhD Thesis, Adam Mickiewicz University, Poznań, Poland.

  • Steffen-Batóg, M., & Nowakowski, P. (1993). An algorithm for phonetic transcription of orthographic texts in Polish. In M. Steffen-Batóg & W. Awedyk (Eds.), Studia phonetica posnaniensia, Vol. 3. Poznań: Wydawnictwo Naukowe UAM.

    Google Scholar 

  • Steffen-Batogowa, M. (1975). Automatyzacja transkrypcji fonematycznej tekstów polskich. Warszawa: PWN.

    Google Scholar 

  • Szymański, M., & Grocholewski, S. (2005). Transcription-based automatic segmentation of speech. In Proceedings of 2nd language & technology conference (pp. 11–15). Poznań.

  • Szymański, M., & Grocholewski, S. (2006). Post-processing of automatic segmentation of speech using dynamic programming. In LNAI. Proc. 9th international conference on text, speech and dialogue, Brno. Berlin: Springer.

    Google Scholar 

  • Szymański, M., & Grocholewski, S. (2008). Error prediction-based semi-automatic segmentation of speech databases. In LNAI. Proc. 11th international conference on text, speech and dialog, Brno, Czech Republic. Berlin: Springer.

    Google Scholar 

  • Tokuda, K., & Black, A. (2005). The Blizzard Challenge 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. Interspeech (Eurospeech) (pp. 77–80).

  • Toledano, D., Hernández Gómez, L. A., & Villarrubia Grande, L. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625.

    Article  Google Scholar 

  • Van Santen, J. P. H. (1993a). Exploring N-way tables with sums-of-product models. Journal of Mathematical Psychology, 37(3), 327–371.

    Article  MATH  MathSciNet  Google Scholar 

  • Van Santen, J. P. H. (1993b). Quantitative modeling of segmental duration. In Proceedings of human language technology conference (pp. 323–328), Princeton, New Jersey.

  • Van Santen, J., & Buchsbaum, A. L. (1997). Methods for optimal text selection. In Proceedings Eurospeech 1997, Rhodos, Greece.

  • Van Son, R. J. J. H., & Van Santen, J. P. H. (1997). Strong interaction between factors influencing consonant duration. In Proceedings of Eurospeech ’97, Rhodos.

  • Wagner, A. (2008). Kompleksowy model intonacji do zastosowania w syntezie mowy. Unpublished doctoral dissertation, Adam Mickiewicz University, Poznań, Poland.

  • Wells, J. (1996). The SAMPA homepage. http://www.phon.ucl.ac.uk/home/sampa/home.htm.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Breuer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Demenko, G., Klessa, K., Szymański, M. et al. Polish unit selection speech synthesis with BOSS: extensions and speech corpora. Int J Speech Technol 13, 85–99 (2010). https://doi.org/10.1007/s10772-010-9071-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-010-9071-3

Keywords

Navigation