Skip to main content

Build a Large-Scale Syntactically Annotated Chinese Corpus

  • Conference paper
Text, Speech and Dialogue (TSD 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2807))

Included in the following conference series:

Abstract

This paper reports on our research to build a large-scale Tsinghua Chinese Treebank (TCT). We propose a two-stage approach to reduce manual proofreading labors as much as possible. The insertion of an intermediate functional chunk level creates a good information bridge to link simple chunk annotation with detailed syntactic tree annotation. We describe our chunk and tree annotation schemes, focus on two grammatical relation tag sets designed to give more detailed description for most of the special language phenomena in the Chinese language. We also briefly introduce our current progress in building a Chinese chunk bank with 2,000,000 Chinese characters, developing an efficient Chinese chunk-based parser and building a 1,000,000 words Chinese treebank. All this work lays good foundations for further research project to build a good Chinese parser.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alsina, A.: The Role of Argument Structure in Grammar: Evidence from Romance. CSLI Lecture Notes No. 62. CSLI Publications, Stanford, California, USA (1996)

    Google Scholar 

  2. Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus. In: Proc. of the Third Conference on Language Resources and Evaluation LREC 2002, Las Palmas, Spain (2002)

    Google Scholar 

  3. Hajic, J.: Building a syntactically annotated corpus: The Prague Dependency Treebank. In: Hajicova, E. (ed.) Issues of valency and meaning. Studies in honour of Jarmila Panevova. Charles University Press, Prague (1999)

    Google Scholar 

  4. Huang, et al.: Sinica Treebank: Design Criteria,Annotation Guidelines, and On-line Interface. In: Proc. of the Second Chinese Language Processing Workshop, HongKong, pp. 29–37 (1999)

    Google Scholar 

  5. Kaplan, R., Bresnan, J.: Lexical- Functional Grammar: A Formal System of Representation. In: Bresnan, J. (ed.) The Mental Representation of Grammatical Relations, pp. 173–281. MIT Press, Cambridge (1982)

    Google Scholar 

  6. Kingsbury, P., Palmer, M., Marcus, M.: Adding Semantic Annotation to the Penn TreeBank. In: Proceedings of the Human Language Technology Conference, San Diego, California (2002)

    Google Scholar 

  7. Marcus, M., et al.: The Penn Treebank: Annotating predicate argument structure. In: Proc. Of the ARPA Human Language Technology Workshop, San Francisco, CA (1994)

    Google Scholar 

  8. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)

    Google Scholar 

  9. Sun, M.S., Zhou, Q., et al.: Constructing a Word-segmented & POS-tagged Chinese Corpus and a Chinese Treebank. In: Proc. of International conference on Chinese language computing (ICCLC 2000), pp. 239–243 (2000)

    Google Scholar 

  10. Xia, F., Palmer, M., et al.: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In: Proc. of the second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece (2000)

    Google Scholar 

  11. Xue, N.W., Chiou, F., Martha, P.: Building a Large-Scale Annotated Chinese Corpus. In: Proc. of 19th International Conference on Computational Linguistics (COLING 2002), Taiwan (2002)

    Google Scholar 

  12. Zhou, Q.: A Statistics-Based Chinese Parser. In: Proc. of the Fifth Workshop on Very Large Corpora, Bejing, China, pp. 4–15 (1997)

    Google Scholar 

  13. Zhou, Q., Sun, M.S.: Build a Chinese Treebank as the test suite for Chinese parser. In: Proc. of the Workshop MAL 1999 (Multi-lingual information Processing and Asian Language Processing), Beijing, China (1999)

    Google Scholar 

  14. Zhou, Q., Zhang, W.D., Ren, H.B.: Build a large scale Chinese functional chunk bank. In: Huang, C., Zhang, P. (eds.) Natural language understanding and machine translation, pp. 102–107. Tsinghua University Press, Beijing (2001) (in Chinese)

    Google Scholar 

  15. Skut, W., Brants, T., Krenn, B., Uszkoreit, H.: A linguistically interpreted corpus of German newspaper text. In: Proceedings of the Conference on Language Resources and Evaluation LREC 1998, Granade, Spain, pp. 705–711 (1998)

    Google Scholar 

  16. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the COLING- ACL 1998, Montreal, Canada, pp. 86–90 (1998)

    Google Scholar 

  17. Fillmore, C.J.: Frame semantics. In: Linguistics in the Morning Calm, pp. 111–137. Hanshin Publishing Co., Seoul (1982)

    Google Scholar 

  18. Fillmore, C.J., Wooters, C., Baker, C.F.: Building a Large Lexical Databank Which Provides Deep Semantics. In: Proc. of the Pacific Asian Conference on Language, Information and Computation, Hong Kong (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, Q. (2003). Build a Large-Scale Syntactically Annotated Chinese Corpus. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2003. Lecture Notes in Computer Science(), vol 2807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39398-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39398-6_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20024-6

  • Online ISBN: 978-3-540-39398-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics