Abstract
We present ParCzech 3.0, a speech corpus of the Czech parliamentary speeches from The Czech Chamber of Deputies which took place from 25th November 2013 to 1st April 2021.
Different from previous speech corpora of Czech, we preserve not just orthography but also all the available metadata (speaker identities, gender, web pages links, affiliations committees, political groups, etc.) and complement this with automatic morphological and syntactic annotation, and named entities recognition. The corpus is encoded in the TEI format which allows for a straightforward and versatile exploitation.
The rather rich metadata and annotation make the corpus relevant for a wide audience of researchers ranging from engineers in the speech community to theoretical linguists studying rhetorical patterns at scale.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
Paragraphs are made by stenographers and can be revised by speaker.
References
Erjavec, T., Pančur, A.: Parla-CLARIN: TEI guidelines for corpora of parliamentary proceedings, September 2019. https://doi.org/10.5281/zenodo.3446164
Hladká, B., Kopp, M., Straňák, P.: Compiling Czech parliamentary stenographic protocols into a corpus. In: Proceedings of the LREC 2020 Workshop on Creating, Using and Linking of Parliamentary Corpora with Other Types of Political Discourse (ParlaCLARIN II), pp. 18–22. ELRA, Paris (2020)
Hladká, B., Kopp, M., Straňák, P.: ParCzech PS7 1.0 (2020). http://hdl.handle.net/11234/1-3174, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University
Hladká, B., Kopp, M., Straňák, P.: ParCzech PS7 2.0 (2020), http://hdl.handle.net/11234/1-3436, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University
Jakubíček, M., Kovíř, V.: CzechParl: corpus of stenographic protocols from Czech Parliament. In: RASLAN 2010, pp. 41–46 (2010). http://nlp.fi.muni.cz/raslan/2010/paper11.pdf
Janssen, M.: TEITOK: text-faithful annotated corpora. In: Proceeding of LREC 2016, pp. 4037–4043 (2016)
Kratochvíl, J., Polak, P., Bojar, O.: Large corpus of Czech parliament plenary hearings. In: Proceedings of LREC 2020, pp. 6363–6367. ELRA (2020). https://www.aclweb.org/anthology/2020.lrec-1.781/
Krůza, J.O.: Czech parliament meeting recordings as ASR training data. In: Proceedings of the 2020 FCCSIS. Annals of Computer Science and Information Systems, vol. 21, pp. 185–188. IEEE (2020). https://doi.org/10.15439/2020F119
Pražák, A., Šmídl, L.: Czech parliament meetings (2012). http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University
Roukos, S., Graff, D., Melamed, D.: Hansard French/English LDC95T20 (1995). https://doi.org/10.35111/jhgn-rv21
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 ST: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. ACL (2018). https://doi.org/10.18653/v1/K18-2020
Straková, J., Straka, M., Hajič, J.: Neural architectures for nested NER through linearization. In: Proceedings of ACL, pp. 5326–5331 (2019)
Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of ACL System Demonstrations, pp. 13–18, June 2014. https://doi.org/10.3115/v1/P14-5003
TEI Consortium: TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.2.1., 1 March 2021. TEI Consortium. http://www.tei-c.org/Guidelines/P5/
Acknowledgements
This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 825460 (ELITR) and the grant 19-26934X (NEUREM3) of the Czech Science Foundation, and Project No. LM2018101 LINDAT/CLARIAH-CZ of the Ministry of Education, Youth and Sports of the Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kopp, M., Stankov, V., Krůza, J.O., Straňák, P., Bojar, O. (2021). ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-83527-9_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)