ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata

Kopp, Matyáš; Stankov, Vladislav; Krůza, Jan Oldřich; Straňák, Pavel; Bojar, Ondřej

doi:10.1007/978-3-030-83527-9_25

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1202 Accesses
1 Citations

Abstract

We present ParCzech 3.0, a speech corpus of the Czech parliamentary speeches from The Czech Chamber of Deputies which took place from 25th November 2013 to 1st April 2021.

Different from previous speech corpora of Czech, we preserve not just orthography but also all the available metadata (speaker identities, gender, web pages links, affiliations committees, political groups, etc.) and complement this with automatic morphological and syntactic annotation, and named entities recognition. The corpus is encoded in the TEI format which allows for a straightforward and versatile exploitation.

The rather rich metadata and annotation make the corpus relevant for a wide audience of researchers ranging from engineers in the speech community to theoretical linguists studying rhetorical patterns at scale.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.clarin.eu/event/2020/clarin-cafe-join-our-parliamentary-flavoured-coffee-parlamint.
2.
https://www.psp.cz/eknih/2002ps/audio/2006/01/17/index.htm.
3.
https://psp.cz/eknih/2013ps/stenprot/zip/index.htm.
4.
https://psp.cz/eknih/2013ps/stenprot/index.htm.
5.
e.g. https://psp.cz/eknih/2013ps/audio/index.htm.
6.
https://psp.cz/sqw/hp.sqw?k=1300.
7.
Paragraphs are made by stenographers and can be revised by speaker.

References

Erjavec, T., Pančur, A.: Parla-CLARIN: TEI guidelines for corpora of parliamentary proceedings, September 2019. https://doi.org/10.5281/zenodo.3446164
Hladká, B., Kopp, M., Straňák, P.: Compiling Czech parliamentary stenographic protocols into a corpus. In: Proceedings of the LREC 2020 Workshop on Creating, Using and Linking of Parliamentary Corpora with Other Types of Political Discourse (ParlaCLARIN II), pp. 18–22. ELRA, Paris (2020)
Google Scholar
Hladká, B., Kopp, M., Straňák, P.: ParCzech PS7 1.0 (2020). http://hdl.handle.net/11234/1-3174, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University
Hladká, B., Kopp, M., Straňák, P.: ParCzech PS7 2.0 (2020), http://hdl.handle.net/11234/1-3436, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University
Jakubíček, M., Kovíř, V.: CzechParl: corpus of stenographic protocols from Czech Parliament. In: RASLAN 2010, pp. 41–46 (2010). http://nlp.fi.muni.cz/raslan/2010/paper11.pdf
Janssen, M.: TEITOK: text-faithful annotated corpora. In: Proceeding of LREC 2016, pp. 4037–4043 (2016)
Google Scholar
Kratochvíl, J., Polak, P., Bojar, O.: Large corpus of Czech parliament plenary hearings. In: Proceedings of LREC 2020, pp. 6363–6367. ELRA (2020). https://www.aclweb.org/anthology/2020.lrec-1.781/
Krůza, J.O.: Czech parliament meeting recordings as ASR training data. In: Proceedings of the 2020 FCCSIS. Annals of Computer Science and Information Systems, vol. 21, pp. 185–188. IEEE (2020). https://doi.org/10.15439/2020F119
Pražák, A., Šmídl, L.: Czech parliament meetings (2012). http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University
Roukos, S., Graff, D., Melamed, D.: Hansard French/English LDC95T20 (1995). https://doi.org/10.35111/jhgn-rv21
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 ST: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. ACL (2018). https://doi.org/10.18653/v1/K18-2020
Straková, J., Straka, M., Hajič, J.: Neural architectures for nested NER through linearization. In: Proceedings of ACL, pp. 5326–5331 (2019)
Google Scholar
Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of ACL System Demonstrations, pp. 13–18, June 2014. https://doi.org/10.3115/v1/P14-5003
TEI Consortium: TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.2.1., 1 March 2021. TEI Consortium. http://www.tei-c.org/Guidelines/P5/

Download references

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 825460 (ELITR) and the grant 19-26934X (NEUREM3) of the Czech Science Foundation, and Project No. LM2018101 LINDAT/CLARIAH-CZ of the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Charles University, Faculty of Mathematics and Physics, ÚFAL, Malostranské nám. 25, Praha 1, 11800, Prague, Czech Republic
Matyáš Kopp, Vladislav Stankov, Jan Oldřich Krůza, Pavel Straňák & Ondřej Bojar

Authors

Matyáš Kopp
View author publications
You can also search for this author in PubMed Google Scholar
Vladislav Stankov
View author publications
You can also search for this author in PubMed Google Scholar
Jan Oldřich Krůza
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Straňák
View author publications
You can also search for this author in PubMed Google Scholar
Ondřej Bojar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ondřej Bojar .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kopp, M., Stankov, V., Krůza, J.O., Straňák, P., Bojar, O. (2021). ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_25
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics