Skip to main content
Log in

The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The paper presents the Chinese Discourse TreeBank, a corpus annotated with Penn Discourse TreeBank style discourse relations that take the form of a predicate taking two arguments. We first characterize the syntactic and statistical distributions of Chinese discourse connectives as well as the role of Chinese punctuation marks in discourse annotation, and then describe how we design our annotation strategy procedure based on this characterization. The Chinese-specific features of our annotation strategy include annotating explicit and implicit discourse relations in one single pass, defining the argument labels on semantic, rather than syntactic, grounds, as well as annotating the semantic type of implicit discourse relations directly. We also introduce a flat, 11-valued semantic type classification scheme for discourse relations. We finally demonstrate the feasibility of our approach with evaluation results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. In this and subsequent examples, DE is used to gloss , a particle in Chinese that does not have an English equivalent. Similarly, CL is used to gloss Chinese classifiers.

  2. As one reviewer points out, avoiding discontinuous arguments in all cases may not be possible for other languages. For example, for an English sentence like “The little boy, when asked about the game, squealed in delight”, there is no obvious way of avoiding discontinuous arguments. We are not suggesting that discontinuous arguments should be eliminated at all cost. Instead, we argue that it is desirable to minimize the number of discontinuous arguments when this is warranted by the linguistic facts of the language.

  3. This data will be released via the LDC.

  4. www.seas.upenn.edu/pdtb.

  5. There are instances where a punctuation mark does not exist, e.g., between a section title and the paragraph that follows the section title. In this case the line break is treated as the punctuation mark and it signals the existence of a discourse relation.

  6. Following the practice of the PDTB, when computing the agreement, we lump the finer-grained semantic types together into the coarser-grained semantic classes before performing the computation.

References

  • Al-Saif, A., & Markert, K. (2010). The leeds Arabic Discourse Treebank: Annotating discourse connectives for Arabic. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valletta, Malta.

  • Asher, N. (1993). Reference to abstract objects in discourse (Vol. 50). Berlin: Springer.

    Google Scholar 

  • Asher, N., & Lascarides, A. (2003). Logics of conversation. Cambridge: Cambridge University Press.

    Google Scholar 

  • Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of COLING/ACL (pp. 86–90), Montreal, Canada.

  • Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Current directions in discourse and dialogue, Kluwer.

  • Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of NAACL-2000 (pp. 132–139), Seattle, Washington.

  • Chen, K.-J., Huang, C.-R., Chen, F.-Y., Luo, C.-C., Chang, M.-C., & Chen, C.-J. (2004). Sinica Treebank: Design criteria, representational issues and implementation. In A. Abeillé (Ed.), Building and using parsed corpora. Dordrecht: Kluwer.

    Google Scholar 

  • Collins, M. (1999). Head-driven statistical models for natural language parsing, Ph.D. thesis, University of Pennsylvania.

  • Hajič, J., Böhmová, A., Hajicová, E., & Hladká, B. (2003). The Prague Dependency Treebank: A three level annotation scenario. In A. Abeillé (Ed.), Treebanks: Building and using annotated corpora. Dordrecht: Kluwer.

    Google Scholar 

  • Hobbs, J. R. (1985). On the coherence and structure of discourse. CSLI.

  • Huang, H.-H., & Chen, H.-H. (2011). Chinese discourse relation recognition. In Proceedings of the 5th international joint conference on natural language processing (pp. 1442–1446), Chiang Mai, Thailand.

  • Huang, H.-H., & Chen, H.-H. (2012a). An annotation system for development of Chinese discourse corpus. In Proceedings of COLING (Demos) (pp. 223–230), Mumbai, India.

  • Huang, H.-H., & Chen, H.-H. (2012b). Contingency and comparison relation labeling and structure prediction in Chinese sentences. In Proceedings of the 13th annual meeting of the special interest group on discourse and dialogue (pp. 261–269), Seoul, South Korea.

  • Mani, I., Verhagen, M., Wellner, B., Lee, C. M., & Pustejovsky, J. (2006). Machine learning of temporal relations. In Proceedings of the COLING-ACL’2006, Sydney, Australia.

  • Mann, W., & Thompson, S. (1988). Rhetorical sturcture theory. Text, 8(3), 243–281.

    Google Scholar 

  • Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Miltsakaki, E., Prasad, R., Joshi, A., & Webber B. (2004). The Penn Discourse Treebank. In Proceedings of the 4th international conference on language resources and evaluation, Lisbon, Portugal.

  • Mladová, L., Zikanova, S., & Hajicová, E. (2008). From sentence to discourse: Building an annotation scheme for discourse based on Prague Dependency Treebank. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco.

  • Oza, U., Prasad, R., Kolachina, S., Sharma, D. M., & Joshi, A. (2009). The Hindi discourse relation bank. In Proceedings of the third linguistic annotation workshop (pp. 158–161), Suntec, Singapore.

  • Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106.

    Article  Google Scholar 

  • Petrov, S., & Klein, D. (2007). Improved inferencing for unlexicalized parsing. In Proceedings of of HLT-NAACL, Rochester, NY

  • Poláková, L., Mírovskỳ, J., Nedoluzhko, A., Jínová, P., Zikánová, Š., & Hajičová, E. (2013) Introducing the Prague Discourse Treebank 1.0. In Proceedings of the 6th international joint conference on natural language processing (pp. 91–99).

  • Pradhan, S., Ward, W., Hacioglu, K., Martin, J. H., & Jurafsky, D. (2004). Shallow semantic parsing using support vector machines. In Proceedings of NAACL-HLT 2004 (pp. 233–240), Boston, MA.

  • Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., et al. (2008). The Penn Discourse Treebank 2.0. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008), Marrakech, Morocco.

  • Punyakanok, V., Roth, D., Yih, W., & Zimak, D. (2004). Semantic role labeling via integer programming inference. In Proceedings of COLING-2004 (pp. 1346–1352), Geneva, Switzerland.

  • Pustejovsky, J., Hanks, P., Sauri, R., See, A., Day, D., Ferro, L., et al. (2003). The TimeBank Corpus. In Proceedings of corpus linguistics (pp. 647–656), Lancaster, UK.

  • Toutanova, K., Haghighi, A., & Manning, C. (2005). Joint learning improves semantic role labeling. In Proceedings of ACL-2005 (pp. 589–596), Ann Arbor, MI.

  • Webber, B., & Joshi, A. (1998). Anchoring a lexicalized tree-adjoining grammar for discourse. In ACL/COLING workshop on discourse relations and discourse markers, Montreal, Canada.

  • Webber, B., Joshi, A., Stone, M., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29(4), 545–587.

    Article  Google Scholar 

  • Webber, B., Knott, A., Stone, M., & Joshi, A. (1999). Discourse relations: A structural and presuppositional account using lexicalized TAG. In Meeting of the association of computational linguistics, College Park, MD.

  • Wolf, F., & Gibson, E. (2005). Representing discourse coherence: A corpus-based study. Computational Linguistics, 31(2), 249–287.

    Article  Google Scholar 

  • Xue, N., & Palmer, M. (2004). Calibrating features for semantic role labeling. In Proceedings of 2004 conference on empirical methods in natural language processing, Barcelona, Spain.

  • Xue, N., & Palmer, M. (2009). Adding semantic roles to the Chinese Treebank. Natural Language Engineering, 15(1), 143–172.

    Article  Google Scholar 

  • Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.

    Article  Google Scholar 

  • Xue, N., & Yang, Y. (2011). Chinese sentence segmentation as comma disambiguation. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, Portland, OR.

  • Yang, Y., & Xue, N. (2012). Chinese comma disambiguation for discourse analysis. In Proceedings of the 50th annual conference of the association for computational linguistics (ACL-2012), Jeju Island, Korea.

  • Zeyrek, D., Demirşahin, I., Sevdik-Çallı, A., Balaban, H. O., Yalçinkaya, İ., & Turan, U. D. (2010). The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations. In Proceedings of the fourth linguistic annotation workshop (pp. 282–289), Uppsala, Sweden.

  • Zeyrek, D., Turan, Ü., Bozsahin, C., Çakici, R., Sevdik-Çallı, A., Demirşahin, İ., et al. (2009). Annotating subordinators in the Turkish Discourse Bank. In Proceedings of the third linguistic annotation workshop (pp. 44–47), Suntec, Singapore.

  • Zeyrek, D., & Webber, B. L. (2008). A discourse resource for Turkish: Annotating discourse connectives in the METU corpus. In Proceedings of the 6th workshop on Asian language resources (pp. 65–72), Hyderabad, India.

  • Zhou, Y., & Xue, N. (2012). PDTB-style annotation of Chinese text. In Proceedings of the 50th annual conference of the association for computational linguistics (ACL-2012), Jeju Island, Korea.

Download references

Acknowledgments

This work is supported by the IIS Division of the National Science Foundation via Grant No. 0910532 entitled “Richer Representations for Machine Translation” and by the CNS Division via Grant No. 0855184 entitled “Building a community resource for temporal inference in Chinese”. All views expressed in this paper are those of the authors and do not necessarily represent the view of the National Science Foundation. We would like to thank Jill Lu and Jennifer Zhang for their help with the annotation effort.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nianwen Xue.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Xue, N. The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations. Lang Resources & Evaluation 49, 397–431 (2015). https://doi.org/10.1007/s10579-014-9290-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9290-3

Keywords

Navigation