Abstract
The paper presents the Chinese Discourse TreeBank, a corpus annotated with Penn Discourse TreeBank style discourse relations that take the form of a predicate taking two arguments. We first characterize the syntactic and statistical distributions of Chinese discourse connectives as well as the role of Chinese punctuation marks in discourse annotation, and then describe how we design our annotation strategy procedure based on this characterization. The Chinese-specific features of our annotation strategy include annotating explicit and implicit discourse relations in one single pass, defining the argument labels on semantic, rather than syntactic, grounds, as well as annotating the semantic type of implicit discourse relations directly. We also introduce a flat, 11-valued semantic type classification scheme for discourse relations. We finally demonstrate the feasibility of our approach with evaluation results.

Similar content being viewed by others
Notes
In this and subsequent examples, DE is used to gloss
, a particle in Chinese that does not have an English equivalent. Similarly, CL is used to gloss Chinese classifiers.
As one reviewer points out, avoiding discontinuous arguments in all cases may not be possible for other languages. For example, for an English sentence like “The little boy, when asked about the game, squealed in delight”, there is no obvious way of avoiding discontinuous arguments. We are not suggesting that discontinuous arguments should be eliminated at all cost. Instead, we argue that it is desirable to minimize the number of discontinuous arguments when this is warranted by the linguistic facts of the language.
This data will be released via the LDC.
There are instances where a punctuation mark does not exist, e.g., between a section title and the paragraph that follows the section title. In this case the line break is treated as the punctuation mark and it signals the existence of a discourse relation.
Following the practice of the PDTB, when computing the agreement, we lump the finer-grained semantic types together into the coarser-grained semantic classes before performing the computation.
References
Al-Saif, A., & Markert, K. (2010). The leeds Arabic Discourse Treebank: Annotating discourse connectives for Arabic. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valletta, Malta.
Asher, N. (1993). Reference to abstract objects in discourse (Vol. 50). Berlin: Springer.
Asher, N., & Lascarides, A. (2003). Logics of conversation. Cambridge: Cambridge University Press.
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of COLING/ACL (pp. 86–90), Montreal, Canada.
Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Current directions in discourse and dialogue, Kluwer.
Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of NAACL-2000 (pp. 132–139), Seattle, Washington.
Chen, K.-J., Huang, C.-R., Chen, F.-Y., Luo, C.-C., Chang, M.-C., & Chen, C.-J. (2004). Sinica Treebank: Design criteria, representational issues and implementation. In A. Abeillé (Ed.), Building and using parsed corpora. Dordrecht: Kluwer.
Collins, M. (1999). Head-driven statistical models for natural language parsing, Ph.D. thesis, University of Pennsylvania.
Hajič, J., Böhmová, A., Hajicová, E., & Hladká, B. (2003). The Prague Dependency Treebank: A three level annotation scenario. In A. Abeillé (Ed.), Treebanks: Building and using annotated corpora. Dordrecht: Kluwer.
Hobbs, J. R. (1985). On the coherence and structure of discourse. CSLI.
Huang, H.-H., & Chen, H.-H. (2011). Chinese discourse relation recognition. In Proceedings of the 5th international joint conference on natural language processing (pp. 1442–1446), Chiang Mai, Thailand.
Huang, H.-H., & Chen, H.-H. (2012a). An annotation system for development of Chinese discourse corpus. In Proceedings of COLING (Demos) (pp. 223–230), Mumbai, India.
Huang, H.-H., & Chen, H.-H. (2012b). Contingency and comparison relation labeling and structure prediction in Chinese sentences. In Proceedings of the 13th annual meeting of the special interest group on discourse and dialogue (pp. 261–269), Seoul, South Korea.
Mani, I., Verhagen, M., Wellner, B., Lee, C. M., & Pustejovsky, J. (2006). Machine learning of temporal relations. In Proceedings of the COLING-ACL’2006, Sydney, Australia.
Mann, W., & Thompson, S. (1988). Rhetorical sturcture theory. Text, 8(3), 243–281.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Miltsakaki, E., Prasad, R., Joshi, A., & Webber B. (2004). The Penn Discourse Treebank. In Proceedings of the 4th international conference on language resources and evaluation, Lisbon, Portugal.
Mladová, L., Zikanova, S., & Hajicová, E. (2008). From sentence to discourse: Building an annotation scheme for discourse based on Prague Dependency Treebank. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco.
Oza, U., Prasad, R., Kolachina, S., Sharma, D. M., & Joshi, A. (2009). The Hindi discourse relation bank. In Proceedings of the third linguistic annotation workshop (pp. 158–161), Suntec, Singapore.
Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106.
Petrov, S., & Klein, D. (2007). Improved inferencing for unlexicalized parsing. In Proceedings of of HLT-NAACL, Rochester, NY
Poláková, L., Mírovskỳ, J., Nedoluzhko, A., Jínová, P., Zikánová, Š., & Hajičová, E. (2013) Introducing the Prague Discourse Treebank 1.0. In Proceedings of the 6th international joint conference on natural language processing (pp. 91–99).
Pradhan, S., Ward, W., Hacioglu, K., Martin, J. H., & Jurafsky, D. (2004). Shallow semantic parsing using support vector machines. In Proceedings of NAACL-HLT 2004 (pp. 233–240), Boston, MA.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., et al. (2008). The Penn Discourse Treebank 2.0. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008), Marrakech, Morocco.
Punyakanok, V., Roth, D., Yih, W., & Zimak, D. (2004). Semantic role labeling via integer programming inference. In Proceedings of COLING-2004 (pp. 1346–1352), Geneva, Switzerland.
Pustejovsky, J., Hanks, P., Sauri, R., See, A., Day, D., Ferro, L., et al. (2003). The TimeBank Corpus. In Proceedings of corpus linguistics (pp. 647–656), Lancaster, UK.
Toutanova, K., Haghighi, A., & Manning, C. (2005). Joint learning improves semantic role labeling. In Proceedings of ACL-2005 (pp. 589–596), Ann Arbor, MI.
Webber, B., & Joshi, A. (1998). Anchoring a lexicalized tree-adjoining grammar for discourse. In ACL/COLING workshop on discourse relations and discourse markers, Montreal, Canada.
Webber, B., Joshi, A., Stone, M., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29(4), 545–587.
Webber, B., Knott, A., Stone, M., & Joshi, A. (1999). Discourse relations: A structural and presuppositional account using lexicalized TAG. In Meeting of the association of computational linguistics, College Park, MD.
Wolf, F., & Gibson, E. (2005). Representing discourse coherence: A corpus-based study. Computational Linguistics, 31(2), 249–287.
Xue, N., & Palmer, M. (2004). Calibrating features for semantic role labeling. In Proceedings of 2004 conference on empirical methods in natural language processing, Barcelona, Spain.
Xue, N., & Palmer, M. (2009). Adding semantic roles to the Chinese Treebank. Natural Language Engineering, 15(1), 143–172.
Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.
Xue, N., & Yang, Y. (2011). Chinese sentence segmentation as comma disambiguation. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, Portland, OR.
Yang, Y., & Xue, N. (2012). Chinese comma disambiguation for discourse analysis. In Proceedings of the 50th annual conference of the association for computational linguistics (ACL-2012), Jeju Island, Korea.
Zeyrek, D., Demirşahin, I., Sevdik-Çallı, A., Balaban, H. O., Yalçinkaya, İ., & Turan, U. D. (2010). The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations. In Proceedings of the fourth linguistic annotation workshop (pp. 282–289), Uppsala, Sweden.
Zeyrek, D., Turan, Ü., Bozsahin, C., Çakici, R., Sevdik-Çallı, A., Demirşahin, İ., et al. (2009). Annotating subordinators in the Turkish Discourse Bank. In Proceedings of the third linguistic annotation workshop (pp. 44–47), Suntec, Singapore.
Zeyrek, D., & Webber, B. L. (2008). A discourse resource for Turkish: Annotating discourse connectives in the METU corpus. In Proceedings of the 6th workshop on Asian language resources (pp. 65–72), Hyderabad, India.
Zhou, Y., & Xue, N. (2012). PDTB-style annotation of Chinese text. In Proceedings of the 50th annual conference of the association for computational linguistics (ACL-2012), Jeju Island, Korea.
Acknowledgments
This work is supported by the IIS Division of the National Science Foundation via Grant No. 0910532 entitled “Richer Representations for Machine Translation” and by the CNS Division via Grant No. 0855184 entitled “Building a community resource for temporal inference in Chinese”. All views expressed in this paper are those of the authors and do not necessarily represent the view of the National Science Foundation. We would like to thank Jill Lu and Jennifer Zhang for their help with the annotation effort.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhou, Y., Xue, N. The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations. Lang Resources & Evaluation 49, 397–431 (2015). https://doi.org/10.1007/s10579-014-9290-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9290-3