Abstract
This paper reports on our research to build a large-scale Tsinghua Chinese Treebank (TCT). We propose a two-stage approach to reduce manual proofreading labors as much as possible. The insertion of an intermediate functional chunk level creates a good information bridge to link simple chunk annotation with detailed syntactic tree annotation. We describe our chunk and tree annotation schemes, focus on two grammatical relation tag sets designed to give more detailed description for most of the special language phenomena in the Chinese language. We also briefly introduce our current progress in building a Chinese chunk bank with 2,000,000 Chinese characters, developing an efficient Chinese chunk-based parser and building a 1,000,000 words Chinese treebank. All this work lays good foundations for further research project to build a good Chinese parser.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alsina, A.: The Role of Argument Structure in Grammar: Evidence from Romance. CSLI Lecture Notes No. 62. CSLI Publications, Stanford, California, USA (1996)
Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus. In: Proc. of the Third Conference on Language Resources and Evaluation LREC 2002, Las Palmas, Spain (2002)
Hajic, J.: Building a syntactically annotated corpus: The Prague Dependency Treebank. In: Hajicova, E. (ed.) Issues of valency and meaning. Studies in honour of Jarmila Panevova. Charles University Press, Prague (1999)
Huang, et al.: Sinica Treebank: Design Criteria,Annotation Guidelines, and On-line Interface. In: Proc. of the Second Chinese Language Processing Workshop, HongKong, pp. 29–37 (1999)
Kaplan, R., Bresnan, J.: Lexical- Functional Grammar: A Formal System of Representation. In: Bresnan, J. (ed.) The Mental Representation of Grammatical Relations, pp. 173–281. MIT Press, Cambridge (1982)
Kingsbury, P., Palmer, M., Marcus, M.: Adding Semantic Annotation to the Penn TreeBank. In: Proceedings of the Human Language Technology Conference, San Diego, California (2002)
Marcus, M., et al.: The Penn Treebank: Annotating predicate argument structure. In: Proc. Of the ARPA Human Language Technology Workshop, San Francisco, CA (1994)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
Sun, M.S., Zhou, Q., et al.: Constructing a Word-segmented & POS-tagged Chinese Corpus and a Chinese Treebank. In: Proc. of International conference on Chinese language computing (ICCLC 2000), pp. 239–243 (2000)
Xia, F., Palmer, M., et al.: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In: Proc. of the second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece (2000)
Xue, N.W., Chiou, F., Martha, P.: Building a Large-Scale Annotated Chinese Corpus. In: Proc. of 19th International Conference on Computational Linguistics (COLING 2002), Taiwan (2002)
Zhou, Q.: A Statistics-Based Chinese Parser. In: Proc. of the Fifth Workshop on Very Large Corpora, Bejing, China, pp. 4–15 (1997)
Zhou, Q., Sun, M.S.: Build a Chinese Treebank as the test suite for Chinese parser. In: Proc. of the Workshop MAL 1999 (Multi-lingual information Processing and Asian Language Processing), Beijing, China (1999)
Zhou, Q., Zhang, W.D., Ren, H.B.: Build a large scale Chinese functional chunk bank. In: Huang, C., Zhang, P. (eds.) Natural language understanding and machine translation, pp. 102–107. Tsinghua University Press, Beijing (2001) (in Chinese)
Skut, W., Brants, T., Krenn, B., Uszkoreit, H.: A linguistically interpreted corpus of German newspaper text. In: Proceedings of the Conference on Language Resources and Evaluation LREC 1998, Granade, Spain, pp. 705–711 (1998)
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the COLING- ACL 1998, Montreal, Canada, pp. 86–90 (1998)
Fillmore, C.J.: Frame semantics. In: Linguistics in the Morning Calm, pp. 111–137. Hanshin Publishing Co., Seoul (1982)
Fillmore, C.J., Wooters, C., Baker, C.F.: Building a Large Lexical Databank Which Provides Deep Semantics. In: Proc. of the Pacific Asian Conference on Language, Information and Computation, Hong Kong (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, Q. (2003). Build a Large-Scale Syntactically Annotated Chinese Corpus. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2003. Lecture Notes in Computer Science(), vol 2807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39398-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-39398-6_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20024-6
Online ISBN: 978-3-540-39398-6
eBook Packages: Springer Book Archive