The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations

Zhou, Yuping; Xue, Nianwen

doi:10.1007/s10579-014-9290-3

The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations

Original Paper
Published: 21 November 2014

Volume 49, pages 397–431, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Yuping Zhou¹ &
Nianwen Xue¹

1958 Accesses
Explore all metrics

Abstract

The paper presents the Chinese Discourse TreeBank, a corpus annotated with Penn Discourse TreeBank style discourse relations that take the form of a predicate taking two arguments. We first characterize the syntactic and statistical distributions of Chinese discourse connectives as well as the role of Chinese punctuation marks in discourse annotation, and then describe how we design our annotation strategy procedure based on this characterization. The Chinese-specific features of our annotation strategy include annotating explicit and implicit discourse relations in one single pass, defining the argument labels on semantic, rather than syntactic, grounds, as well as annotating the semantic type of implicit discourse relations directly. We also introduce a flat, 11-valued semantic type classification scheme for discourse relations. We finally demonstrate the feasibility of our approach with evaluation results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Penn Discourse Treebank: An Annotated Corpus of Discourse Relations

Turkish Discourse Bank: Connectives and Their Configurations

CRPC-DB a Discourse Bank for Portuguese

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Notes

In this and subsequent examples, DE is used to gloss , a particle in Chinese that does not have an English equivalent. Similarly, CL is used to gloss Chinese classifiers.
As one reviewer points out, avoiding discontinuous arguments in all cases may not be possible for other languages. For example, for an English sentence like “The little boy, when asked about the game, squealed in delight”, there is no obvious way of avoiding discontinuous arguments. We are not suggesting that discontinuous arguments should be eliminated at all cost. Instead, we argue that it is desirable to minimize the number of discontinuous arguments when this is warranted by the linguistic facts of the language.
This data will be released via the LDC.
www.seas.upenn.edu/pdtb.
There are instances where a punctuation mark does not exist, e.g., between a section title and the paragraph that follows the section title. In this case the line break is treated as the punctuation mark and it signals the existence of a discourse relation.
Following the practice of the PDTB, when computing the agreement, we lump the finer-grained semantic types together into the coarser-grained semantic classes before performing the computation.

References

Al-Saif, A., & Markert, K. (2010). The leeds Arabic Discourse Treebank: Annotating discourse connectives for Arabic. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valletta, Malta.
Asher, N. (1993). Reference to abstract objects in discourse (Vol. 50). Berlin: Springer.
Google Scholar
Asher, N., & Lascarides, A. (2003). Logics of conversation. Cambridge: Cambridge University Press.
Google Scholar
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In Proceedings of COLING/ACL (pp. 86–90), Montreal, Canada.
Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Current directions in discourse and dialogue, Kluwer.
Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of NAACL-2000 (pp. 132–139), Seattle, Washington.
Chen, K.-J., Huang, C.-R., Chen, F.-Y., Luo, C.-C., Chang, M.-C., & Chen, C.-J. (2004). Sinica Treebank: Design criteria, representational issues and implementation. In A. Abeillé (Ed.), Building and using parsed corpora. Dordrecht: Kluwer.
Google Scholar
Collins, M. (1999). Head-driven statistical models for natural language parsing, Ph.D. thesis, University of Pennsylvania.
Hajič, J., Böhmová, A., Hajicová, E., & Hladká, B. (2003). The Prague Dependency Treebank: A three level annotation scenario. In A. Abeillé (Ed.), Treebanks: Building and using annotated corpora. Dordrecht: Kluwer.
Google Scholar
Hobbs, J. R. (1985). On the coherence and structure of discourse. CSLI.
Huang, H.-H., & Chen, H.-H. (2011). Chinese discourse relation recognition. In Proceedings of the 5th international joint conference on natural language processing (pp. 1442–1446), Chiang Mai, Thailand.
Huang, H.-H., & Chen, H.-H. (2012a). An annotation system for development of Chinese discourse corpus. In Proceedings of COLING (Demos) (pp. 223–230), Mumbai, India.
Huang, H.-H., & Chen, H.-H. (2012b). Contingency and comparison relation labeling and structure prediction in Chinese sentences. In Proceedings of the 13th annual meeting of the special interest group on discourse and dialogue (pp. 261–269), Seoul, South Korea.
Mani, I., Verhagen, M., Wellner, B., Lee, C. M., & Pustejovsky, J. (2006). Machine learning of temporal relations. In Proceedings of the COLING-ACL’2006, Sydney, Australia.
Mann, W., & Thompson, S. (1988). Rhetorical sturcture theory. Text, 8(3), 243–281.
Google Scholar
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Miltsakaki, E., Prasad, R., Joshi, A., & Webber B. (2004). The Penn Discourse Treebank. In Proceedings of the 4th international conference on language resources and evaluation, Lisbon, Portugal.
Mladová, L., Zikanova, S., & Hajicová, E. (2008). From sentence to discourse: Building an annotation scheme for discourse based on Prague Dependency Treebank. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco.
Oza, U., Prasad, R., Kolachina, S., Sharma, D. M., & Joshi, A. (2009). The Hindi discourse relation bank. In Proceedings of the third linguistic annotation workshop (pp. 158–161), Suntec, Singapore.
Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106.
Article Google Scholar
Petrov, S., & Klein, D. (2007). Improved inferencing for unlexicalized parsing. In Proceedings of of HLT-NAACL, Rochester, NY
Poláková, L., Mírovskỳ, J., Nedoluzhko, A., Jínová, P., Zikánová, Š., & Hajičová, E. (2013) Introducing the Prague Discourse Treebank 1.0. In Proceedings of the 6th international joint conference on natural language processing (pp. 91–99).
Pradhan, S., Ward, W., Hacioglu, K., Martin, J. H., & Jurafsky, D. (2004). Shallow semantic parsing using support vector machines. In Proceedings of NAACL-HLT 2004 (pp. 233–240), Boston, MA.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., et al. (2008). The Penn Discourse Treebank 2.0. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008), Marrakech, Morocco.
Punyakanok, V., Roth, D., Yih, W., & Zimak, D. (2004). Semantic role labeling via integer programming inference. In Proceedings of COLING-2004 (pp. 1346–1352), Geneva, Switzerland.
Pustejovsky, J., Hanks, P., Sauri, R., See, A., Day, D., Ferro, L., et al. (2003). The TimeBank Corpus. In Proceedings of corpus linguistics (pp. 647–656), Lancaster, UK.
Toutanova, K., Haghighi, A., & Manning, C. (2005). Joint learning improves semantic role labeling. In Proceedings of ACL-2005 (pp. 589–596), Ann Arbor, MI.
Webber, B., & Joshi, A. (1998). Anchoring a lexicalized tree-adjoining grammar for discourse. In ACL/COLING workshop on discourse relations and discourse markers, Montreal, Canada.
Webber, B., Joshi, A., Stone, M., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29(4), 545–587.
Article Google Scholar
Webber, B., Knott, A., Stone, M., & Joshi, A. (1999). Discourse relations: A structural and presuppositional account using lexicalized TAG. In Meeting of the association of computational linguistics, College Park, MD.
Wolf, F., & Gibson, E. (2005). Representing discourse coherence: A corpus-based study. Computational Linguistics, 31(2), 249–287.
Article Google Scholar
Xue, N., & Palmer, M. (2004). Calibrating features for semantic role labeling. In Proceedings of 2004 conference on empirical methods in natural language processing, Barcelona, Spain.
Xue, N., & Palmer, M. (2009). Adding semantic roles to the Chinese Treebank. Natural Language Engineering, 15(1), 143–172.
Article Google Scholar
Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.
Article Google Scholar
Xue, N., & Yang, Y. (2011). Chinese sentence segmentation as comma disambiguation. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, Portland, OR.
Yang, Y., & Xue, N. (2012). Chinese comma disambiguation for discourse analysis. In Proceedings of the 50th annual conference of the association for computational linguistics (ACL-2012), Jeju Island, Korea.
Zeyrek, D., Demirşahin, I., Sevdik-Çallı, A., Balaban, H. O., Yalçinkaya, İ., & Turan, U. D. (2010). The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations. In Proceedings of the fourth linguistic annotation workshop (pp. 282–289), Uppsala, Sweden.
Zeyrek, D., Turan, Ü., Bozsahin, C., Çakici, R., Sevdik-Çallı, A., Demirşahin, İ., et al. (2009). Annotating subordinators in the Turkish Discourse Bank. In Proceedings of the third linguistic annotation workshop (pp. 44–47), Suntec, Singapore.
Zeyrek, D., & Webber, B. L. (2008). A discourse resource for Turkish: Annotating discourse connectives in the METU corpus. In Proceedings of the 6th workshop on Asian language resources (pp. 65–72), Hyderabad, India.
Zhou, Y., & Xue, N. (2012). PDTB-style annotation of Chinese text. In Proceedings of the 50th annual conference of the association for computational linguistics (ACL-2012), Jeju Island, Korea.

Download references

Acknowledgments

This work is supported by the IIS Division of the National Science Foundation via Grant No. 0910532 entitled “Richer Representations for Machine Translation” and by the CNS Division via Grant No. 0855184 entitled “Building a community resource for temporal inference in Chinese”. All views expressed in this paper are those of the authors and do not necessarily represent the view of the National Science Foundation. We would like to thank Jill Lu and Jennifer Zhang for their help with the annotation effort.

Author information

Authors and Affiliations

Brandeis University, Waltham, MA, USA
Yuping Zhou & Nianwen Xue

Authors

Yuping Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Nianwen Xue
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nianwen Xue.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, Y., Xue, N. The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations. Lang Resources & Evaluation 49, 397–431 (2015). https://doi.org/10.1007/s10579-014-9290-3

Download citation

Published: 21 November 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10579-014-9290-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Penn Discourse Treebank: An Annotated Corpus of Discourse Relations

Turkish Discourse Bank: Connectives and Their Configurations

CRPC-DB a Discourse Bank for Portuguese

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now