Multi-level Annotation in SpeeCon Polish Speech Database

Marasek, Krzysztof; Gubrynowicz, Ryszard

doi:10.1007/11558637_7

Krzysztof Marasek²¹ &
Ryszard Gubrynowicz²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3490))

Included in the following conference series:

Intelligent Media Technology for Communicative Intelligence

526 Accesses
6 Citations

Abstract

SpeeCon Polish Speech Database was collected within the framework of the SpeeCon project partially sponsored by the EC (IST-1999-10003). The database contains two sets of data, which comprise 550 adults’ recording sessions and 50 sessions from children, respectively. The adult speakers were recorded in various environments: offices, living rooms, cars and public places. Recordings contain free spontaneous speech passages, elicited spontaneous speech, phonetically compact words and sentences, general-purpose words and phrases, specific application words and utterances. One of the most important problems in the construction of the database is to define bases for multi-level transcription composed of several tiers. They could be grouped into three classes – linguistic, symbolic and physical representation. The orthographic transcription is applied to the sentence, phrase and word tiers, symbolic transcription related to grammar and articulation – to part of speech, phoneme and syllabic tiers and mnemonics – to the description of some characteristic of the measurable physical data. The paper presents the rules applied to text, speech and noise transcriptions and remarks on pronunciation varieties found in the database. The final part of the paper discusses the problem of the lexicon creation, which is an alphabetically ordered list of distinct lexical items occurring in the recorded corpus. The Polish lexicon has been built up by various methods, including hand-annotation and generation by rule with subsequent manual check.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Multi-level Approach to Speech Corpora Annotation for Automatic Speech Recognition

The “One Day of Speech” Corpus: Phonetic and Syntactic Studies of Everyday Spoken Russian

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Article 05 July 2023

References

Gubrynowicz, R.: The Polish Database of Spoken Language. In: Proc. First Int. Conference on Language Resources and Evaluation, Granada, May 28–30, pp. 1031–1037 (1998)
Google Scholar
Grocholewski, S.: First Polish Database. In: Proc. First Int. Conference on Language Resources and Evaluation, Granada, May 28–30, pp. 1059–1062 (1998)
Google Scholar
Lamel, L.F., Kassel, R.H., Seneff, S.: Speech database development: Design and analysis of the acoustic-phonetic corpus. In: Proc. DARPA Speech Recognition Workshop, pp. 100–109 (1986)
Google Scholar
Damhuis, M., Boogaart, T., Veld, C., Versteijlen, M., Schelvis, W., Bos, L., Boves, L.: Creation and analysis of the Dutch POLYPHONE corpus. In: Proc. Int. Congress on Speech and Language Processing, Yokohama, pp. 1803–1806 (1994)
Google Scholar
Höge, H., Draxler, C., van den Heuvel, H., Johansen, F., Sanders, E., Tropf, H.: SpeechDat Multilingual Speech Databases for Teleservices: Across the Finish Line. In: Proceedings of Eurospeech 1999, Budapest, vol. 6, pp. 2699–2702 (1999)
Google Scholar
http://www.speecon.com/
Biedrzycki, L.: Phonology of English and Polish resonants (in Polish), PWN, Warszawa (1978)
Google Scholar
http://www.phon.ucl.ac.uk/home/sampa/polish.htm
http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm
http://www.speech.kth.se/wavesurfer/
http://www.speecon.com/public_docs/D21.zip
Marasek, K.: Large Vocabulary Continuous Speech Recognition System for Polish. Archives of Acoustics 28(4), 293–303
Google Scholar
http://www.praat.org
http://htk.ca.ed.uk
Brill, E.: A Corpus-Based Approach to Language Learning. PhD Dissertation, University of Pennsylvania (1996)
Google Scholar
Przepiórkowski, A.: The IPI Corpus, http://dach.ipipan.waw.pl/~adamp/Papers/2004-corpus/book_en.pdf

Download references

Author information

Authors and Affiliations

Polish-Japanese Institute of Information Technology, ul. Koszykowa 86, 02-008, Warsaw, Poland
Krzysztof Marasek
Institute of Fundamental Technological Research PAS, ul. Świętokrzyska 21, 00-049, Warsaw, Poland
Ryszard Gubrynowicz

Authors

Krzysztof Marasek
View author publications
You can also search for this author in PubMed Google Scholar
Ryszard Gubrynowicz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Polish Academy of Science, Ordona 21, 01-237, Warsaw, Poland
Leonard Bolc
School of Computer Science, University of Adelaide, 5005, Adelaide, SA, Australia
Zbigniew Michalewicz
Dept. of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Japan
Toyoaki Nishida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marasek, K., Gubrynowicz, R. (2005). Multi-level Annotation in SpeeCon Polish Speech Database. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds) Intelligent Media Technology for Communicative Intelligence. IMTCI 2004. Lecture Notes in Computer Science(), vol 3490. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11558637_7

Download citation

DOI: https://doi.org/10.1007/11558637_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29035-3
Online ISBN: 978-3-540-31738-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics