Skip to main content

ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2021)

Abstract

We present ParCzech 3.0, a speech corpus of the Czech parliamentary speeches from The Czech Chamber of Deputies which took place from 25th November 2013 to 1st April 2021.

Different from previous speech corpora of Czech, we preserve not just orthography but also all the available metadata (speaker identities, gender, web pages links, affiliations committees, political groups, etc.) and complement this with automatic morphological and syntactic annotation, and named entities recognition. The corpus is encoded in the TEI format which allows for a straightforward and versatile exploitation.

The rather rich metadata and annotation make the corpus relevant for a wide audience of researchers ranging from engineers in the speech community to theoretical linguists studying rhetorical patterns at scale.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.clarin.eu/event/2020/clarin-cafe-join-our-parliamentary-flavoured-coffee-parlamint.

  2. 2.

    https://www.psp.cz/eknih/2002ps/audio/2006/01/17/index.htm.

  3. 3.

    https://psp.cz/eknih/2013ps/stenprot/zip/index.htm.

  4. 4.

    https://psp.cz/eknih/2013ps/stenprot/index.htm.

  5. 5.

    e.g. https://psp.cz/eknih/2013ps/audio/index.htm.

  6. 6.

    https://psp.cz/sqw/hp.sqw?k=1300.

  7. 7.

    Paragraphs are made by stenographers and can be revised by speaker.

References

  1. Erjavec, T., Pančur, A.: Parla-CLARIN: TEI guidelines for corpora of parliamentary proceedings, September 2019. https://doi.org/10.5281/zenodo.3446164

  2. Hladká, B., Kopp, M., Straňák, P.: Compiling Czech parliamentary stenographic protocols into a corpus. In: Proceedings of the LREC 2020 Workshop on Creating, Using and Linking of Parliamentary Corpora with Other Types of Political Discourse (ParlaCLARIN II), pp. 18–22. ELRA, Paris (2020)

    Google Scholar 

  3. Hladká, B., Kopp, M., Straňák, P.: ParCzech PS7 1.0 (2020). http://hdl.handle.net/11234/1-3174, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University

  4. Hladká, B., Kopp, M., Straňák, P.: ParCzech PS7 2.0 (2020), http://hdl.handle.net/11234/1-3436, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University

  5. Jakubíček, M., Kovíř, V.: CzechParl: corpus of stenographic protocols from Czech Parliament. In: RASLAN 2010, pp. 41–46 (2010). http://nlp.fi.muni.cz/raslan/2010/paper11.pdf

  6. Janssen, M.: TEITOK: text-faithful annotated corpora. In: Proceeding of LREC 2016, pp. 4037–4043 (2016)

    Google Scholar 

  7. Kratochvíl, J., Polak, P., Bojar, O.: Large corpus of Czech parliament plenary hearings. In: Proceedings of LREC 2020, pp. 6363–6367. ELRA (2020). https://www.aclweb.org/anthology/2020.lrec-1.781/

  8. Krůza, J.O.: Czech parliament meeting recordings as ASR training data. In: Proceedings of the 2020 FCCSIS. Annals of Computer Science and Information Systems, vol. 21, pp. 185–188. IEEE (2020). https://doi.org/10.15439/2020F119

  9. Pražák, A., Šmídl, L.: Czech parliament meetings (2012). http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4, LINDAT/CLARIAH-CZ digital library at ÚFAL, Faculty of Mathematics and Physics, Charles University

  10. Roukos, S., Graff, D., Melamed, D.: Hansard French/English LDC95T20 (1995). https://doi.org/10.35111/jhgn-rv21

  11. Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 ST: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. ACL (2018). https://doi.org/10.18653/v1/K18-2020

  12. Straková, J., Straka, M., Hajič, J.: Neural architectures for nested NER through linearization. In: Proceedings of ACL, pp. 5326–5331 (2019)

    Google Scholar 

  13. Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of ACL System Demonstrations, pp. 13–18, June 2014. https://doi.org/10.3115/v1/P14-5003

  14. TEI Consortium: TEI P5: Guidelines for Electronic Text Encoding and Interchange. 4.2.1., 1 March 2021. TEI Consortium. http://www.tei-c.org/Guidelines/P5/

Download references

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 825460 (ELITR) and the grant 19-26934X (NEUREM3) of the Czech Science Foundation, and Project No. LM2018101 LINDAT/CLARIAH-CZ of the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ondřej Bojar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kopp, M., Stankov, V., Krůza, J.O., Straňák, P., Bojar, O. (2021). ParCzech 3.0: A Large Czech Speech Corpus with Rich Metadata. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83527-9_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics