Abstract
SpeeCon Polish Speech Database was collected within the framework of the SpeeCon project partially sponsored by the EC (IST-1999-10003). The database contains two sets of data, which comprise 550 adults’ recording sessions and 50 sessions from children, respectively. The adult speakers were recorded in various environments: offices, living rooms, cars and public places. Recordings contain free spontaneous speech passages, elicited spontaneous speech, phonetically compact words and sentences, general-purpose words and phrases, specific application words and utterances. One of the most important problems in the construction of the database is to define bases for multi-level transcription composed of several tiers. They could be grouped into three classes – linguistic, symbolic and physical representation. The orthographic transcription is applied to the sentence, phrase and word tiers, symbolic transcription related to grammar and articulation – to part of speech, phoneme and syllabic tiers and mnemonics – to the description of some characteristic of the measurable physical data. The paper presents the rules applied to text, speech and noise transcriptions and remarks on pronunciation varieties found in the database. The final part of the paper discusses the problem of the lexicon creation, which is an alphabetically ordered list of distinct lexical items occurring in the recorded corpus. The Polish lexicon has been built up by various methods, including hand-annotation and generation by rule with subsequent manual check.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gubrynowicz, R.: The Polish Database of Spoken Language. In: Proc. First Int. Conference on Language Resources and Evaluation, Granada, May 28–30, pp. 1031–1037 (1998)
Grocholewski, S.: First Polish Database. In: Proc. First Int. Conference on Language Resources and Evaluation, Granada, May 28–30, pp. 1059–1062 (1998)
Lamel, L.F., Kassel, R.H., Seneff, S.: Speech database development: Design and analysis of the acoustic-phonetic corpus. In: Proc. DARPA Speech Recognition Workshop, pp. 100–109 (1986)
Damhuis, M., Boogaart, T., Veld, C., Versteijlen, M., Schelvis, W., Bos, L., Boves, L.: Creation and analysis of the Dutch POLYPHONE corpus. In: Proc. Int. Congress on Speech and Language Processing, Yokohama, pp. 1803–1806 (1994)
Höge, H., Draxler, C., van den Heuvel, H., Johansen, F., Sanders, E., Tropf, H.: SpeechDat Multilingual Speech Databases for Teleservices: Across the Finish Line. In: Proceedings of Eurospeech 1999, Budapest, vol. 6, pp. 2699–2702 (1999)
Biedrzycki, L.: Phonology of English and Polish resonants (in Polish), PWN, Warszawa (1978)
Marasek, K.: Large Vocabulary Continuous Speech Recognition System for Polish. Archives of Acoustics 28(4), 293–303
Brill, E.: A Corpus-Based Approach to Language Learning. PhD Dissertation, University of Pennsylvania (1996)
Przepiórkowski, A.: The IPI Corpus, http://dach.ipipan.waw.pl/~adamp/Papers/2004-corpus/book_en.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marasek, K., Gubrynowicz, R. (2005). Multi-level Annotation in SpeeCon Polish Speech Database. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds) Intelligent Media Technology for Communicative Intelligence. IMTCI 2004. Lecture Notes in Computer Science(), vol 3490. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11558637_7
Download citation
DOI: https://doi.org/10.1007/11558637_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29035-3
Online ISBN: 978-3-540-31738-8
eBook Packages: Computer ScienceComputer Science (R0)