Skip to main content
Log in

Annotation of sentence structure

Capturing the relationship between clauses in Czech sentences

  • Original paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The focus of this article is on the creation of a collection of sentences manually annotated with respect to their sentence structure. We show that the concept of linear segments—linguistically motivated units, which may be easily detected automatically—serves as a good basis for the identification of clauses in Czech. The segment annotation captures such relationships as subordination, coordination, apposition and parenthesis; based on segmentation charts, individual clauses forming a complex sentence are identified. The annotation of a sentence structure enriches a dependency-based framework with explicit syntactic information on relations among complex units like clauses. We have gathered a collection of 3,444 sentences from the Prague Dependency Treebank, which were annotated with respect to their sentence structure (these sentences comprise 10,746 segments forming 6,341 clauses). The main purpose of the project is to gain a development data—promising results for Czech NLP tools (as a dependency parser or a machine translation system for related languages) that adopt an idea of clause segmentation have been already reported. The collection of sentences with annotated sentence structure provides the possibility of further improvement of such tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. We adopt the basic idea of segments introduced and used by Kuboň (2001) and Kuboň et al. (2007). We slightly modify it for the purposes of the annotation task.

  2. http://ufal.mff.cuni.cz/pdt2.0/.

  3. E.g., in experiments reported by Lopatková and Holan (2009), a correct level of embedding was assigned only to approx. 75% of segments.

  4. In Czech, the subordinated clause representing the object must be separated by a comma and introduced by a subordinating conjunction, as in Řekla, že přijde.

  5. We consider main clauses to be such clauses that are syntactically/formally independent, see also Section 3.

  6. This decision enables us to speed up the annotation as well as to avoid undesired overlapped/repeated annotation: The analytical layer of the PDT already contains the information on syntactic functions (like predicate, subject, object, nominal predicate, attribute, or adverbial); detailed semantic classification pertains to the tectogrammatical layer of the PDT.

  7. Quotation marks marking direct speech have to be combined with another boundary in Czech, primarily with a comma. This rule serves for reliably distinguishing direct speech from the cases when quotation marks are used, e.g., for emphasizing individual words—the latter type gets the same level of embedding as its neighbors.

  8. In the PDT, a coordination of sentence members and a coordination of clauses are not distinguished (at the analytical layer).

  9. The reason for this decision lies in the verb-centric character of dependency syntax traditionally used for Czech.

  10. At the a-layer, the ellipsis of a predicate is marked by a special analytical function; at the t-layer, ellipsis is restored (as a node of a tree).

  11. We have focused on the sentences from data/full/amw/train2 portion of the PDT data, i.e., one (out of eight) directory with the PDT standard training data with the annotation both on m- and a-layers; the number of annotated sentences is approximately the same as the number of sentences in the developing data set from this portion of PDT.

References

  • Abney, S. P. (1991). Parsing by chunks. In R. Berwick, S. Abney, & C. Tenny (Eds.). Principle-based parsing (pp. 257–278). Dordrecht: Kluwer Academic Publishers.

    Chapter  Google Scholar 

  • Abney, S. P. (1995). Partial parsing via finite-state cascades. Journal of Natural Language Engineering 2(4), 337–344.

    Article  Google Scholar 

  • Ciravegna, F., & Lavelli, A. (1999). Full text parsing using cascades of rules: An information extraction procedure. In Proceedings of EACL’99 (pp. 102–109). University of Bergen, Bergen.

  • Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech). Prague: Karolinum Press.

    Google Scholar 

  • Hajič, J., Panevová, J., Buráňová, E., Urešová, Z., Bémová, A., Štěpánek, J., et al. (2004). Anotace na analytické rovině. Návod pro anotátory. UFAL/CKL technical report no. 2004/TR-2004-23, ÚFAL/CKL MFF UK.

  • Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., et al. (2006). Prague dependency treebank 2.0. Philadelphia: Linguistic Data Consortium.

    Google Scholar 

  • Holan, T., & Žabokrtský, Z. (2006). Combining Czech dependency parsers. In Proceedings of TSD 2006(pp. 95–102). Springer, LNAI, Vol. 4188.

  • Homola, P., & Kuboň, V. (2010). Exploiting charts in the MT between related languages. International Journal of Computational Linguistics and Applications 1(1–2), 185–199.

    Google Scholar 

  • Jones, B. E. M. (1994). Exploiting the role of punctuation in parsing natural text. In: Proceedings of the COLING’94, (pp. 421–425).

  • Krůza, O., & Kuboň, V. (2009). Automatic extraction of clause relationships from a treebank. In Computational linguistics and intelligent text processing. Proceedings of CICLing 2009 (pp. 195–206). Springer, LNCS, Vol. 5449.

  • Kuboň, V. (2001). Problems of robust parsing of Czech. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague, Prague.

  • Kuboň, V., Lopatková, M., Plátek, M. & Pognan, P. (2007). A linguistically-based segmentation of complex sentences. In D. Wilson & G. Sutcliffe (Eds.). Proceedings of FLAIRS conference (pp. 368–374). Menlo Park, CA: AAAI Press.

    Google Scholar 

  • Lopatková, M. & Holan, T. (2009). Segmentation charts for Czech—Relations among segments in complex sentences. In A. H. Dediu, A. M. Ionescu, & C. Martín-Vide (Eds.). Proceedings of LATA 2009 (Vol. 5457, pp. 542–553). New York: Springer, LNCS.

    Google Scholar 

  • Lopatková, M., & Kljueva, N. (2010). Anotace segmentů. (Anotanční příručka) (in manuscript).

  • Marinčič, D., Šef, T., & Gams, M. (2010). Intraclausal coordination and clause detection as a preprocessing step to dependency parsing. In V. Matoušek, & P. Mautner (Eds.) Proceedings of TSD 2009 (Vol. 5729, pp. 147–153). Springer, LNAI, New York.

  • Ohno, T., Matsubara, S., Kashioka, H., Maruyama, T., & Inagaki, Y. (2006) Dependency parsing of Japanese spoken monologue based on clause boundaries. In Proceedings of COLING and ACL, ACL, (pp. 169–176).

  • Šmilauer, V. (1969). Novočeská skladba (New Czech syntax). PhD thesis, Praha: Státní pedagogické nakladatelství.

  • Zeman, D. (2004). Parsing with a statistical dependency model. PhD thesis, Prague: Charles University in Prague.

Download references

Acknowledgments

The article presents the results of the project supported by the grant No. 405/08/0681 and partially by the grant No. P202/10/1333, Grant Agency of the Czech Republic. Also, the authors are grateful to the unknown reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markéta Lopatková.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopatková, M., Homola, P. & Klyueva, N. Annotation of sentence structure. Lang Resources & Evaluation 46, 25–36 (2012). https://doi.org/10.1007/s10579-011-9162-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9162-z

Keywords

Navigation