Skip to main content

Using Low-Cost Annotation to Train a Reliable Czech Shallow Parser

  • Conference paper
Book cover Text, Speech, and Dialogue (TSD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8082))

Included in the following conference series:

  • 2378 Accesses

Abstract

Bushbank is a relatively new concept — a type of annotated corpus where annotation is driven by use of automatic tools and the task of human annotators is limited to accepting or rejecting parts of their output. This creates a possibility to obtain annotated corpora of considerable size at relatively low cost.

In this paper we ask the question if the Czech Bushbank is reliable enough to be used for a NLP task instead of a traditional corpus with high annotation rigour. We perform evaluation of three different parsers using its shallow syntactic annotation, including a CRF chunker made originally for Polish. The results are very promising, showing that many practical applications could benefit from low-cost annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Böhmová, A., Hajič, J., Hajičová, E., Hladká, B.: The prague dependency treebank. In: Treebanks, pp. 103–127. Springer (2003)

    Google Scholar 

  2. Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., Bémová, A., Štěpánek, J., Pajas, P., Kárník, J.: Anotace na analytické rovině. návod pro anotátory (2004)

    Google Scholar 

  3. Shen, H.: Voting between multiple data representations for text chunking. Master’s thesis, Simon Fraser University, Canada (2004)

    Google Scholar 

  4. Radziszewski, A., Maziarz, M., Wieczorek, J.: Shallow syntactic annotation in the Corpus of Wroclaw University of Technology. Cognitive Studies 12 (2012)

    Google Scholar 

  5. Kordoni, V., Zhang, Y.: Annotating Wall Street Journal texts using a hand-crafted deep linguistic grammar. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pp. 170–173. Association for Computational Linguistics, Stroudsburg (2009)

    Chapter  Google Scholar 

  6. Waszczuk, J., Glowińska, K., Savary, A., Przepiówski, A.: Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish. In: Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational Linguistics – Applications (CLA 2010), pp. 531–539. PTI, Wisla (2010)

    Google Scholar 

  7. Grac, M.: Case study of bushbank concept. In: Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, pp. 353–361. Institute of Digital Enhancement of Cognitive Processing, Waseda University, Singapore (2011)

    Google Scholar 

  8. Collins, M., Ramshaw, L., Hajič, J., Tillmann, C.: A statistical parser for Czech. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 505–512. Association for Computational Linguistics (1999)

    Google Scholar 

  9. Kovář, V., Horák, A., Jakubíček, M.: Syntactic analysis using finite patterns: A new parsing system for Czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562, pp. 161–171. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Radziszewski, A., Pawlaczek, A.: Large-scale experiments with NP chunking of Polish. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 143–149. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  11. Šmerk, P.: K morfologické desambiguaci češtiny (2008)

    Google Scholar 

  12. Grác, M., Jakubíček, M., Kovář, V.: Through low-cost annotation to reliable parsing evaluation. In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, pp. 555–562. Waseda University, Tokio (2010)

    Google Scholar 

  13. Radziszewski, A., Wardyński, A., Śniatowski, T.: WCCL: A morpho-syntactic feature toolkit. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 434–441. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  14. Grishman, R., Macleod, C., Sterling, J.: Evaluating parsing strategies using standardized parse files. In: Proceedings of the 3rd ACL Conference on Applied Natural Language Processing, pp. 156–161 (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Radziszewski, A., Grác, M. (2013). Using Low-Cost Annotation to Train a Reliable Czech Shallow Parser. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_72

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40585-3_72

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40584-6

  • Online ISBN: 978-3-642-40585-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics