Skip to main content
Log in

Annotation of multiword expressions in the Prague dependency treebank

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We describe annotation of multiword expressions (MWEs) in the Prague dependency treebank, using several automatic pre-annotation steps. We use subtrees of the tectogrammatical tree structures of the Prague dependency treebank to store representations of the MWEs in the dictionary and pre-annotate following occurrences automatically. We also show a way to measure reliability of this type of annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. With a few exceptions, such as personal pronouns (that refer to other lexeme) or coordination heads.

  2. Functional Generative Description (FGD, Sgall et al. 1986; Hajičová et al. 1998)) is a framework for systematic description of a language, that the PDT project is based upon. In FGD units of the t-layer are construed equivalently to monosemic lexemes and are combined into dependency trees, based on syntactic valency of the t-nodes.

  3. NEs can in general be also single-word, but in this phase of our project we are only interested in MWEs, so when we say NE in this paper, we always mean multiword.

  4. It is so because in PDT-VALLEX valency is not the only criterion for distinguishing frames (=meanings). Two words with the same morphological lemma and valency frame are assigned two different frames if their meaning differs.

  5. Although we have created the PML schema of s-layer primarily for annotations of MWEs, we made it quite generic. It can be utilised for any treebank annotations that use a large lexicon. For instance one s-file can contain multiple annotations of valency referencing to different valency dictionaries. This generic nature of s-layer is the reason why it allows references to morphological, analytical or tectogrammatical layer of PDT, even though in our current project we only need the references to t-layer.

  6. This is exactly what happens: (1) Tree structure of the selected MWEs is identified via TrEd, (2) The tree structure is added to the lexeme’s entry in SemLex, (3) All the sentences in the given file are searched for the same MWEs using its tree structure (via TrEd), and (4) Other occurrences returned by TrEd are tagged with this MWEs’ ID, but these occurrences receive an attribute “auto”, which identifies them (both in the s-files and visually in the annotation tool) as annotated automatically.

  7. In our previous work we used a weighted variant of π (which does not reflect individual coders’ distributions) with the same result as \(\kappa_w^U\) : π w  = 0.644). See Bejček et al. (2008).

References

  • Artstein, R., & Poesio, M. (2007). Inter-coder agreement for computational linguistics. Submitted to Computational Linguistics.

  • Bejček, E., Straňák, P., & Schlesinger, P. (2008). Annotation of multiword expressions in the Prague dependency treebank. In IJCNLP 2008 Proceedings of the third international joint conference on natural language processing (pp. 793–798).

  • Čermák, F., Červená, V., Churavý, M., & Machač, J. (1994). Slovník české frazeologie a idiomatiky. Praha: Academia.

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213–220.

    Article  Google Scholar 

  • Eurovoc. (2007). http://europa.eu/eurovoc/.

  • Hajič, J. (2005). Complex corpus annotation: The Prague dependency treebank, Chap. In Insight into Slovak and Czech corpus linguistics (pp. 54–73). Bratislava, Slovakia: Veda.

  • Hajič, J., Holub, M., Hučinová, M., Pavlík, M., Pecina, P., Straňák, P., et al. (2004). Validating and improving the Czech WordNet via lexico-semantic annotation of the Prague dependency treebank. In LREC 2004, Lisbon.

  • Hajič, J., Panevová, J., Urešová, Z., Bémová, A., Kolářová, V., & Pajas, P. (2003). PDT-VALLEX. In J. Nivre & E. Hinrichs (Eds.), Proceedings of the second workshop on treebanks and linguistic theories, Vol. 9 of Mathematical modeling in physics, engineering and cognitive sciences (pp. 57–68). Vaxjo, Sweden: Vaxjo University Press.

  • Hajičová, E., Partee, B. H., & Sgall, P. (1998). Topic-focus articulation, tripartite structures, and semantic content, Vol. 71 of Studies in linguistics and philosophy. Dordrecht: Kluwer.

  • Hnátková, M. (2002). Značkováni frazémů a idiomů v Českém národním korpusu s pomoci Slovníku české frazeologie a idiomatiky. Slovo a slovesnost.

  • Kilgarriff, A. (1998). SENSEVAL: An exercise in evaluating word sense disambiguation programs. In Proceedings of LREC (pp. 581–588). Granada

  • Krenn, B. & Erbach, G. (1993). Idioms and Support Verb Constructions in HPSG. Technical report, Universität des Saarlandes, Saarbrücken.

  • Mel'čuk, I. (1996). Lexical functions: A tool for the description of lexical relations in a lexicon. In L. Wanner (Ed.) Lexical functions in lexicography and natural language processing, Vol. 31 of Studies in language companion series (pp. 37–102). Amsterdam: John Benjamins.

  • Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B., et al. (2004). The NomBank project: An interim report. In A. Meyers (Ed.), HLT-NAACL 2004 workshop: Frontiers in corpus annotation (pp. 24–31). Boston, MA, USA : Association for Computational Linguistics.

  • Mihalcea, R. (1998) SEMCOR Semantically tagged corpus.

  • Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., et al. (2006). Annotation on the Tectogrammatical Level in the Prague Dependency Treebank Annotation manual. Technical Report 30, ÚFAL MFF UK, Prague, Czech Rep.

  • Pajas, P. (2007). TrEd. http://ufal.mff.cuni.cz/∼pajas/tred/index.html.

  • Pajas, P., & Štěpánek, J. (2005). A Generic XML-Based Format for Structured Linguistic Annotation and Its Application to Prague Dependency Treebank 2.0. Technical Report TR-2005-29, ÚFAL MFF UK, Prague, Czech Rep.

  • Palmer, M., Gildea, D., & Kingsbury, P. (2005) The proposition bank: A corpus annotated with semantic roles. Computational Linguistics Journal 31(1).

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Third international conference, CICLing.

  • Ševčiková, M., Žabokrtský, Z., & Krůza, O. (2007). Zpracováni pojmenovaných entit v českých textech (Treatment of Named Entities in Czech Texts). Technical Report TR-2007-36, ÚFAL MFF UK, Prague, Czech Republic.

  • Sgall, P., Hajičová, E., & Panevová, J. (1986). The meaning of the sentence in its semantic and pragmatic aspects. Praha/Dordrecht: Academia/Reidel Publishing Company.

  • Smrž, P. (2003). Quality control for wordnet development. In P. Sojka, K. Pala, P. Smrž, C. Fellbaum, & P. Vossen (Eds.), Proceedings of the second international WordNet conference—GWC 2004 (pp. 206–212). Masaryk University Brno: Brno, Czech Republic.

Download references

Acknowledgements

This work has been supported by grants 1ET201120505 and 1ET100300517 of Grant Agency of the Academy of Science of the Czech Republic, projects MSM0021620838 and LC536 of the Ministry of Education and 201/05/H014 of the Czech Science Foundation and a grant GAUK 4307/2009 of the Grant Agency of Charles University in Prague.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Eduard Bejček or Pavel Straňák.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bejček, E., Straňák, P. Annotation of multiword expressions in the Prague dependency treebank. Lang Resources & Evaluation 44, 7–21 (2010). https://doi.org/10.1007/s10579-009-9093-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9093-0

Keywords

Navigation